论文解析BuildingAnElasticQueryEngineonDisaggregatedStorage(NSDI2020)fxjwind|thevarietyof和avarietyof_宠物寄养

Shared-nothingarchitectureshavebeenthefoundationoftraditionalqueryexecutionenginesanddatawarehousingsystems.Insucharchitectures,persistentdata(e.g.,customerdatastoredastables)ispartitionedacrossasetofcomputenodes,eachofwhichisresponsibleonlyforitslocaldata.Suchshared-nothingarchitectureshaveenabledqueryexecutionenginesthatscalewell,providecross-jobisolationandgooddatalocalityresultinginhighperformanceforavarietyofworkloads.

Shared-nothing的问题

当前数据和负载的变化又加剧了问题Traditionaldatawarehousingsystemsweredesignedtooperateonrecurringqueriesondatawithpredictablevolumeandrate,e.g.,datacomingfromwithintheorganization:transactionalsystems,enterpriseresourceplanningapplication,customerrelationshipmanagementapplications,etc.Thesituationhaschangedsignificantly.Today,anincreasinglylargefractionofdatacomesfromlesscontrollable,externalsources(e.g.,applicationlogs,socialmedia,webapplications,mobilesystems,etc.)resultinginad-hoc,time-varying,andunpredictablequeryworkloads.Forsuchworkloads,shared-nothingarchitecturesbegethighcost,inflexibility,poorperformanceandinefficiency,whichhurtsproductionapplicationsandclusterdeployments.

针对上述的问题，提出snowflake，keyinsight是计算和存储分离Toovercometheselimitations,wedesignedSnowflake—anelastic,transactionalqueryexecutionenginewithSQLsupportcomparabletostate-of-the-artdatabases.ThekeyinsightinSnowflakedesignisthattheaforementionedlimitationsofshared-nothingarchitecturesarerootedintightcouplingofcomputeandstorage,andthesolutionistodecouplethetwo!Snowflakethusdisaggregatescomputefrompersistentstorage;customerdataisstoredinapersistentdatastore(e.g.,AmazonS3[5],AzureBlobStorage[8],etc.)thatprovideshighavailabilityandon-demandelasticity.Computeelasticityisachievedusingapoolofpre-warmednodes,thatcanbeassignedtocustomersonanon-demandbasis.

本文主要从以下几点展开，中间存储，查询调度，扩展性，多租户Snowflakesystemhasnowbeenactiveforseveralyearsandtoday,servesthousandsofcustomersexecutingmillionsofqueriesoverpetabytesofdata,onadailybasis.ThispaperdescribesSnowflakesystemdesign,withaparticularfocusonephemeralstoragesystemdesign,queryscheduling,elasticityandefficientlysupportingmulti-tenancy.

本文分析一个14天的查询数据集，得出以下发现，查询类型比例；查询中间结果大小差异数个量级；很小的本地存储作为cache仍然可以取得很好命中率；良好的扩展性；Peak资源利用率高，平均利用率较低

Wealsousestatisticscollectedduringexecutionof70millionqueriesoveraperiodof14contiguousdaysinFebruary2018topresentadetailedstudyofnetwork,computeandstoragecharacteristicsinSnowflake.Ourkeyfindingsare:

提出3个未来研究的方向，计算和中间存储分离；更深的存储结构；亚秒级计费

Ourstudybothcorroborates（证实，confirm）excitingongoingresearchdirectionsinthecommunity,aswellashighlightsseveralinterestingvenuesforfutureresearch:-Decouplingofcomputeandephemeralstorage:Snowflakedecouplescomputefrompersistentstoragetoachieveelasticity.However,currently,computeandephemeralstorageisstilltightlycoupled.Asweshowin§4,theratioofcomputecapacityandephemeralstoragecapacityinourproductionclusterscanvarybyseveralordersofmagnitude,leadingtoeitherunderutilizationofCPUorthrashing（冲撞）ofephemeralstorage,forad-hocqueryprocessingworkloads.Tothatend（为此）,recentacademicworkondecouplingcomputefromephemeralstorage[22,27]isofextremeinterest.However,moreworkisneededinephemeralstoragesystemdesign,especiallyintermsofprovidingfine-grainedelasticity,multi-tenancy,andcrossqueryisolation(§4,§7).

-Deepstoragehierarchy:

Snowflakeephemeralstoragesystem,similartorecentworkoncompute-storagedisaggregation[14,15],

usescachingoffrequentlyreadpersistentdatatobothreducethenetworktrafficandtoimprovedatalocality.

However,existingmechanismsforimprovingcachinganddatalocalityweredesignedfortwo-tierstoragesystems(memoryasthemaintierandHDD/SSDasthesecondtier).

Aswediscussin§4,thestoragehierarchyinourproductionclustersisgettingincreasinglydeeper,andnewmechanismsareneededthatcanefficientlyexploittheemergingdeepstoragehierarchy.

-Pricingatsub-secondtimescales:

Snowflakeachievescomputeelasticityatfine-grainedtimescalesbyservingcustomersusingapoolofpre-warmednodes.

Thiswascost-efficientwithcloudpricingathourlygranularity.

However,mostcloudprovidershaverecentlytransitionedtosub-secondpricing[6],leadingtonewtechnicalchallengesinefficientlyachievingresourceelasticityandresourcesharingacrossmultipletenants.

ResolvingthesechallengesmayrequiredesigndecisionsandtradeoffsthatmaybedifferentfromthoseinSnowflake’scurrentdesign(§7).

解释一波，这个系统，负载，infrastructure的泛化问题

Ourstudyhasanimportantcaveat（注意事项）.

Itfocusesonaspecificsystem(Snowflake),aspecificworkload(SQLqueries),andaspecificcloudinfrastructure(S3).

Whileoursystemislarge-scale,hasthousandsofcustomersexecutingmillionsofqueries,andrunsontopofoneofthemostprominentinfrastructures,itisneverthelesslimited.

Weleaveittofutureworkanevaluationofwhetherourstudyandobservationsgeneralizetoothersystems,workloadsandinfrastructures.

However,wearehopefulthatjustlikepriorworkloadstudiesonnetworktrafficcharacteristics[9]andcloudworkloads[28]

(eachofwhichalsofocusedonaspecificsystemimplementationrunningaspecificworkloadonaspecificinfrastructure)fueled（加油）andaidedresearchinthepast,ourstudyandpubliclyreleaseddatawillbeusefulforthecommunity.

WeprovideanoverviewofSnowflakedesign.

Snowflaketreatspersistentandintermediatedatadifferently;

wedescribethesein§2.1,followedbyahigh-leveloverviewofSnowflakearchitecture(§2.2)andqueryexecutionprocess(§2.3).

系统设计首先考虑存储的hierarchy，Persistent，Intermediate，Meta三层存储

Likemostqueryexecutionenginesanddatawarehousingsystems,Snowflakehasthreeformsofapplicationstate:

Persistentdataiscustomerdatastoredastablesinthedatabase.

Eachtablemaybereadbymanyqueries,overtimeorevenconcurrently.

Thesetablesarethuslong-livedandrequirestrongdurabilityandavailabilityguarantees.

Intermediatedataisgeneratedbyqueryoperators(e.g.,joins)andisusuallyconsumedbynodesparticipatinginexecutingthatquery.

Intermediatedataisthusshort-lived.

Moreover,toavoidnodesbeingblockedonintermediatedataaccess,low-latencyhigh-throughputaccesstointermediatedataispreferredoverstrongdurabilityguarantees.

Indeed,incaseoffailureshappeningduringthe(short)lifetimeofintermediatedata,onecansimplyrerunthepartofthequerythatproducedit.

Metadatasuchasobjectcatalogs,mappingfromdatabasetablestocorrespondingfilesinpersistentstorage,statistics,transactionlogs,locks,etc.

Thispaperprimarilyfocusesonpersistentandintermediatedata,asthevolumeofmetadataistypicallyrelativelysmallanddoesnotintroduceinterestingsystemschallenges.

Figure1showsthehigh-levelarchitectureforSnowflake.

架构分四层，

服务层（管控，SQL，调度），

计算层，核心具有pre-warmed的ECSpool

中间存储层，特殊设计的分布式存储，和计算节点co-located，增删节点无需repartition

Persistent存储层

Ithasfourmaincomponents—acentralizedservicefororchestrating（编排）end-to-endqueryexecution,acomputelayer,adistributedephemeralstoragesystemandapersistentdatastore.

Wedescribeeachofthesebelow.

CentralizedControlviaCloudServices.

AllSnowflakecustomersinteractwithandsubmitqueriestoacentralizedlayercalledCloudServices(CS)[12].

Thislayerisresponsibleforaccesscontrol,queryoptimizationandplanning,scheduling,transactionmanagement,concurrencycontrol,etc.

CSisdesignedandimplementedasamulti-tenantandlong-livedservicewithsufficientreplicationforhighavailabilityandscalability.

Thus,failureofindividualservicenodesdoesnotcauselossofstateoravailability,thoughsomeofthequeriesmayfailandbere-executedtransparently.

ElasticComputeviaVirtualWarehouseabstraction.

CustomersaregivenaccesstocomputationalresourcesinSnowflakethroughtheabstractionofaVirtualWarehouse(VW).

EachVWisessentiallyasetofAWSEC2instancesontopwhichcustomerqueriesexecuteinadistributedfashion.

Customerspayforcompute-timebasedontheVWsize.

EachVWcanbeelasticallyscaledonanon-demandbasisuponcustomerrequest.

Tosupportelasticityatfine-grainedtimescales(e.g.,tensofseconds),Snowflakemaintainsapoolofpre-warmedEC2instances;

uponreceivingarequest,wesimplyadd/removeEC2instancesto/fromthatVW(incaseofaddition,weareabletosupportmostrequestsdirectlyfromourpoolofpre-warmedinstancesthusavoidinginstancestartuptime).

EachVWmayrunmultipleconcurrentqueries.

Infact,manyofourcustomersrunmultipleVWs(e.g.,onefordataingestion,andoneforexecutingOLAPqueries).

ElasticLocalEphemeralStorage.

Intermediatedatahasdifferentperformancerequirementscomparedtopersistentdata(§2.1).

Unfortunately,existingpersistentdatastoresdonotmeettheserequirements

(e.g.,S3doesnotprovidethedesiredlow-latencyandhigh-throughputpropertiesneededforintermediatedatatoensureminimalblockingofcomputenodes);

hence,webuiltadistributedephemeralstoragesystemcustom-designedtomeettherequirementsofintermediatedatainoursystem.

Thesystemisco-locatedwithcomputenodesinVWs,andisexplicitlydesignedtoautomaticallyscaleasnodesareaddedorremoved.

Weprovidemoredetailsin§4and§6,butnoteherethatasnodesareaddedandremoved,ourephemeralstoragesystemdoesnotrequiredatarepartitioningorreshuffling(thusalleviatingoneofthecorelimitationsofshared-nothingarchitectures).

EachVWrunsitsownindependentdistributedephemeralstoragesystemwhichisusedonlybyqueriesrunningonthatparticularVW.

ElasticRemotePersistentStorage.

Snowflakestoresallitspersistentdatainaremote,disaggregated,persistentdatastore.

WestorepersistentdatainS3despitetherelativelymodest（委婉的说不行）latencyandthroughputperformancebecauseofS3’selasticity,highavailabilityanddurabilityproperties.S3的优劣

S3supportsstoringimmutablefiles—filescanonlybeoverwritteninfullanddonotevenallowappendoperations.

However,S3supportsreadrequestsforpartsofafile.

TostoretablesinS3,Snowflakepartitionsthemhorizontallyintolarge,immutablefilesthatareequivalenttoblocksintraditionaldatabasesystems[12].

Withineachfile,thevaluesofeachindividualattributeorcolumnaregroupedtogetherandcompressed,asinPAX[2].文件以rowgroup的PAX的方式组织，结构类似orc，parquet

Eachfilehasaheaderthatstoresoffsetofeachcolumnwithinthefile,enablingustousethepartialreadfunctionalityofS3toonlyreadcolumnsthatareneededforqueryexecution.

AllVWsbelongingtothesamecustomerhaveaccesstothesamesharedtablesviaremotepersistentstore,andhencedonotneedtophysicallycopydatafromoneVWtoanother.

重复前面说一遍，为什么需要EphemeralStorageSystem

Snowflakeusesacustom-designeddistributedstoragesystemformanagementandexchangeofintermediatedata,duetotwolimitationsinexistingpersistentdatastores[5,8].

First,theyfallshortofprovidingthenecessarylatencyandthroughputperformancetoavoidcomputetasksbeingblocksonintermediatedataexchange.

Second,theyprovidemuchstrongeravailabilityanddurabilitysemanticsthanwhatisneededforintermediatedata.

Ourephemeralstoragesystemallowsustoovercomeboththeselimitations.

Tasksexecutingqueryoperations(e.g.,joins)onagivencomputenodewriteintermediatedatalocally;and,tasksconsumingtheintermediatedatareaditeitherlocallyorremotelyoverthenetwork

(dependingonthenodewherethetaskisscheduled,§5).

基本的设计选择就是，中间数据，除了放内存，还可能放SSD，或S3，原因很简单因为放不下

Wemadetwoimportantdesigndecisionsinourephemeralstoragesystem.

First,ratherthandesigningapurein-memorystoragesystem,wedecidedtousebothmemoryandlocalSSDs—taskswriteasmuchintermediatedataaspossibletotheirlocalmemory;

whenmemoryisfull,intermediatedataisspilledtolocalSSDs.

Ourrationale（基本原理）isthatwhilepurelyin-memorysystemscanachievesuperiorperformancewhenentiredatafitsinmemory,theyaretoorestrictivetohandlethevarietyofourtargetworkloads.

Figure3(left)showsthattherearequeriesthatexchangehundredsofgigabytesoreventerabytesofintermediatedata;forsuchqueries,itishardtofitallintermediatedatainmainmemory.

TheseconddesigndecisionwastoallowintermediatedatatospillintoremotepersistentdatastoreincasethelocalSSDcapacityisexhausted.

SpillingintermediatedatatoS3,insteadofothercomputenodes,ispreferableforanumberofreasons—

itdoesnotrequirekeepingtrackofintermediatedatalocation,italleviatestheneedforexplicitlyhandlingout-of-memoryorout-of-diskerrorsforlargequeries,andoverall,allowstokeepourephemeralstoragesystemthinandhighlyperformant.

因为无法估计查询用的资源和中间结果大小，所以很难保证突然产生大量中间结果不会用完local资源，只能放都S3。

如果要解决这问题，需要首先解耦计算层和中间结果层，独立正对query的需求进行分配，并且中间结果层要支持细粒度的扩展性。

FutureDirections.Forperformance-criticalqueries,wewantintermediatedatatoentirelyfitinmemory,oratleastinSSDs,andnotspilltoS3.

Thisrequiresaccurateresourceprovisioning（供应）.However,provisioningCPU,memoryandstorageresourceswhileachievinghighutilizationturnsouttobechallengingduetotworeasons.

Thefirstreasonislimitednumberofavailablenodeinstances(eachprovidingafixedamountofCPU,memoryandstorageresources),andsignificantlymorediverseresourcedemandsacrossqueries.

Forinstance,Figure3(center)showsthat,acrossqueries,theratioofcomputerequirementsandintermediatedatasizescanvarybyasmuchassixordersofmagnitude.

Theavailablenodeinstancessimplydonotprovideenoughoptionstoaccuratelymatchnodehardwareresourceswithsuchdiversequerydemands.

Second,evenifwecouldmatchnodehardwareresourceswithquerydemands,accuratelyprovisioningmemoryandstorageresourcesrequiresapriori（先验）knowledgeofintermediatedatasizegeneratedbythequery.

However,ourexperienceisthatpredictingthevolumeofintermediatedatageneratedbyaqueryishard,orevenimpossible,formostqueries.

AsshowninFigure3,intermediatedatasizesnotonlyvaryovermultipleordersofmagnitudeacrossqueries,butalsohavelittleornocorrelationwithamountofpersistentdatareadortheexpectedexecutiontimeofthequery.

Toresolvethefirstchallenge,wecoulddecouplecomputefromephemeralstorage.

Thiswouldallowustomatchavailablenoderesourceswithqueryresourcedemandsbyindependentlyprovisioningindividualresources.

However,thechallengeofunpredictableintermediatedatasizesishardertoresolve.

Forsuchqueries,simultaneouslyachievinghighperformanceandhighresourceutilizationwouldrequirebothdecouplingofcomputeandephemeralstorage,aswellasefficienttechniquesforfine-grainedelasticityofephemeralstoragesystem.

Wediscussthelatterinmoredetailin§6.

中间结果集生命周期很短，在peak时比较大，平均很小，所以可以和cache共用本地磁盘

机会主义的共用方式，中间结果优先

Oneofthekeyobservationswemadeduringearlyphasesofephemeralstoragesystemdesignisthatintermediatedataisshort-lived.

Thus,whilestoringintermediatedatarequireslargememoryandstoragecapacityatpeak,thedemandislowonanaverage.

Thisallowsstatisticalmultiplexingofourephemeralstoragesystemcapacitybetweenintermediatedataandfrequentlyaccessedpersistentdata.

Thisimprovesperformancesince(1)queriesindatawarehousesystemsexhibithighlyskewedaccesspatternsoverpersistentdata[10];and

(2)ephemeralstoragesystemperformanceissignificantlybetterthanthatof(existing)remotepersistentdatastores.

Snowflakeenablesstatisticalmultiplexingofephemeralstoragesystemcapacitybetweenintermediatedataandpersistentdataby“opportunistically”cachingfrequentlyaccessedpersistentdatafiles,

whereopportunisticallyreferstothefactthatintermediatedatastorageisalwaysprioritizedovercachingpersistentdatafiles.

However,apersistentdatafilecannotbecachedonanynode—Snowflakeassignsinputfilesetsforthecustomertonodesusingconsistenthashingoverpersistentdatafilenames.

Afilecanonlybecachedatthenodetowhichitconsistentlyhashesto;eachnodeusesasimpleLRUpolicytodecidecachingandevictionofpersistentdatafiles.

Giventheperformancegapbetweenourephemeralstoragesystemandremotepersistentdatastore,suchopportunisticcachingofpersistentdatafilesimprovestheexecutiontimeformanyqueriesinSnowflake.

Furthermore,sincestorageofintermediatedataisalwaysprioritizedovercachingofpersistentdatafiles,suchanopportunisticperformanceimprovementinqueryexecutiontimecanbeachievedwithoutimpactingperformanceforintermediatedataaccess.

文件Cache通过一致性hash被分配到某一个node上，通过直写cache来保证一致性

并且当加减节点是，使用lazy的一致性hash，来避免resuffle，搬运数据

Maintainingtherightsystemsemanticsduringopportunisticcachingofpersistentdatafilesrequiresacarefuldesign.

First,toensuredataconsistency,the“view”ofpersistentfilesinephemeralstoragesystemmustbeconsistentwiththoseinremotepersistentdatastore.

Weachievethisbyforcingtheephemeralstoragesystemtoactasawrite-throughcacheforpersistentdatafiles.

Second,consistenthashingofpersistentdatafilesonnodesinanavewayrequiresreshufflingofcacheddatawhenVWsareelasticallyscaled.

Weimplementalazyconsistenthashingoptimizationinourephemeralstoragesystemthatavoidssuchdatareshufflingaltogether;wedescribethiswhenwediscussSnowflakeelasticityin§6.

直写cache，所以需要多写一份local数据，每次写Persistent数据时候，需要同步更新cache

Persistentdatabeingopportunisticallycachedintheephemeralstoragesystemmeansthatsomesubsetofpersistentdataaccessrequestscouldbeservedbytheephemeralstoragesystem(dependingonwhetherornotthereisacachehit).

Figure4showsthepersistentdataI/Otrafficdistribution,intermsoffractionofbytes,betweentheephemeralstoragesystemandremotepersistentdatastore.

Thewrite-throughnatureofourephemeralstoragesystemresultsinamountofdatawrittentoephemeralstoragebeingroughlyofthesamemagnitude

astheamountofdatawrittentoremotepersistentdatastore(theyarenotalwaysequalbecauseofprioritizingstorageofintermediatedataovercachingofpersistentdata).

cache的效果还不错，虽然localdisk很小

Eventhoughourephemeralstoragecapacityissignificantlylowerthanthatofacustomer’spersistentdata(around0:1%onanaverage),

skewedfileaccessdistributionsandtemporalfileaccesspatternscommonindatawarehouses[7]enablereasonablyhighcachehitrates(avg.hitrateiscloseto80%forread-onlyqueriesandaround60%forread-writequeries).

Figure5showsthehitratedistributionsacrossqueries.Themedianhitratesareevenhigher.

未来的方向，

如何平衡中间结果和cache对于有限本地磁盘的占用

随着NVM或remote临时存储的诞生，存储的hierarchy会越来越深，需要新的多层cache的新架构

FutureDirections.Figure4andFigure5suggestthatmoreworkisneededoncaching.

Inadditiontolocalityofreferenceinaccesspatterns,cachehitratealsodependsoneffectivecachesizeavailabletothequeryrelativetotheamountofpersistentdataaccessedbythequery.

Theeffectivecachesize,inturn,dependsonboththeVWsizeandthevolumeofintermediatedatageneratedbyconcurrentlyexecutingqueries.

Ourpreliminary（初步的）analysishasnotledtoanyconclusiveobservationsontheimpactoftheabovetwofactorsontheobservedcachehitrates,andamorefine-grainedanalysisisneededtounderstandfactorsthatimpactcachehitrates.

Wehighlighttwoadditionaltechnicalproblems.

First,sinceend-to-endqueryperformancedependsonboth,cachehitrateforpersistentdatafilesandI/Othroughputforintermediatedata,itisimportanttooptimizehowtheephemeralstoragesystemsplitscapacitybetweenthetwo.

Althoughwecurrentlyusethesimplepolicyofalwaysprioritizingintermediatedata,itmaynotbetheoptimalpolicywithrespecttoend-to-endperformanceobjectives(e.g.,averagequerycompletiontimeacrossallqueriesfromthesamecustomer).

Forexample,itmaybebettertoprioritizecachingapersistentdatafilethatisgoingtobeaccessedbymanyqueriesoverintermediatedatathatisaccessedbyonlyone.

Itwouldbeinterestingtoexploreextensionstoknowncachingmechanismsthatoptimizeforend-to-endqueryperformanceobjectives[7]totakeintermediatedataintoaccount.

Second,existingcachingmechanismsweredesignedfortwo-tierstoragesystems(memoryasthemaintierandHDD/SSDasthesecondtier).

InSnowflake,wealreadyhavethreetiersofhierarchywithcompute-localmemory,ephemeralstoragesystemandremotepersistentdatastore;

asemergingnon-volatilememorydevicesaredeployedinthecloudandasrecentdesignsonremoteephemeralstoragesystemsmature[22],thestoragehierarchyinthecloudwillgetincreasinglydeeper.

Snowflakeusestraditionaltwo-tiermechanisms—eachnodeimplementsalocalLRUpolicyforevictionsfromlocalmemorytolocalSSD,andanindependentLRUpolicyforevictionsfromlocalSSDtoremotepersistentdatastore.

However,toefficientlyexploitthedeepeningstoragehierarchy,weneednewcachingmechanismsthatcanefficientlycoordinatecachingacrossmultipletiers.

WebelievemanyoftheabovetechnicalchallengesarenotspecifictoSnowflake,andwouldapplymorebroadlytoanydistributedapplicationbuiltontopofdisaggregatedstorage.

WenowdescribethequeryexecutionprocessinSnowflake.

CustomerssubmittheirqueriestotheCloudServices(CS)forexecutiononaspecificVW.

CSperformsqueryparsing,queryplanningandoptimization,andcreatesasetoftaskstobescheduledoncomputenodesoftheVW.

Locality-awaretaskscheduling.

Tofullyexploittheephemeralstoragesystem,Snowflakecolocateseachtaskwithpersistentdatafilesthatitoperatesonusingalocality-awareschedulingmechanism(recall,thesefilesmaybecachedinephemeralstoragesystem).

Specifically,recallthatSnowflakeassignspersistentdatafilestocomputenodesusingconsistenthashingovertablefilenames.

Thus,forafixedVWsize,eachpersistentdatafileiscachedonaspecificnode.

Snowflakeschedulesthetaskthatoperatesonapersistentdatafiletothenodeonwhichitsfileconsistentlyhashesto.

Asaresultofthisschedulingscheme,queryparallelismistightlycoupledwithconsistenthashingoffilesonnodes—aqueryisscheduledforcachelocalityandmaybedistributedacrossallthenodesintheVW.

Forinstance,consideracustomerthathas1millionfilesworthofpersistentdata,andisrunningaVWwith10nodes.

Considertwoqueries,wherethefirstqueryoperateson100files,andthesecondqueryoperateson100000files;then,withhighlikelihood,bothquerieswillrunonallthe10nodesbecauseoffilesbeingconsistentlyhashedontoallthe10nodes.

Workstealing.Itisknownthatconsistenthashingcanleadtoimbalancedpartitions[19].很常见的做法，闲的node会stealtask执行

Inordertoavoidoverloadingofnodesandimproveloadbalance,Snowflakeusesworkstealing,asimpleoptimizationthatallowsanodetostealataskfromanothernode

iftheexpectedcompletiontimeofthetask(sumofexecutiontimeandwaitingtime)isloweratthenewnode.

Whensuchworkstealingoccurs,thepersistentdatafilesneededtoexecutethetaskarereadfromremotepersistentdatastoreratherthanthenodeatwhichthetaskwasoriginallyscheduledon.

Thisavoidsincreasingloadonanalreadyoverloadednodewherethetaskwasoriginallyscheduled(notethatworkstealinghappensonlywhenanodeisoverloaded).

调度两个极端，task和数据完全colocate，避免读persistent的数据，但中间数据会需要传输；所有task都放一起，这样避免执行中间结果传输

FutureDirections.Schedulerscanplacetasksontonodesusingtwoextremeoptions:

oneistocolocatetaskswiththeircachedpersistentdata,asinourcurrentimplementation.

Asdiscussedintheexampleabove,thismayendupschedulingallqueriesonallnodesintheVW;

whilesuchaschedulingpolicyminimizesnetworktrafficforreadingpersistentdata,itmayleadtoincreasednetworktrafficforintermediatedataexchange.

Theotherextremeistoplacealltasksonasinglenode.Thiswouldobviate（消除）theneedofnetworktransfersforintermediatedataexchangebutwouldincreasenetworktrafficforpersistentdatareads.

Neitheroftheseextremesmaybetherightchoiceforallqueries.

Itwouldbeinterestingtocodesignqueryschedulersthatwouldpickjusttherightsetofnodestoobtainasweetspot（甜区）betweenthetwoextremes,andthenscheduleindividualtasksontothesenodes.

Inthissection,wediscusshowBlowFishdesignachievesoneofitscoregoals:resourceelasticity,thatis,scalingofcomputeandstorageresourcesonanon-demandbasis.

DisaggregatingcomputefrompersistentstorageenablesSnowflaketoindependentlyscalecomputeandpersistentstorageresources.

Storageelasticityisoffloadedtopersistentdatastores[5];computeelasticity,ontheotherhand,isachievedusingapre-warmedpoolofnodesthatcanbeadded/removedto/fromcustomerVWsonanon-demandbasis.

Bykeepingapre-warmedpoolofnodes,Snowflakeisabletoprovidecomputeelasticityatthegranularityoftensofseconds.

OneofthechallengesthatSnowflakehadtoresolveinordertoachieveelasticityefficientlyisrelatedtodatamanagementinephemeralstoragesystem.

Recallthatourephemeralstoragesystemopportunisticallycachespersistentdatafiles;eachfilecanbecachedonlyonthenodetowhichitconsistentlyhashestowithintheVW.

Theproblemissimilartosharednothingarchitectures:anyfixedpartitioningmechanism(inourcase,consistenthashing)requireslargeamountsofdatatobereshuffleduponscalingofnodes;

moreover,sincetheverysamesetofnodesarealsoresponsibleforqueryprocessing,thesystemobservesasignificantperformanceimpactduringthescalingprocess.

Snowflakeresolvesthischallengeusingalazyconsistenthashingmechanism,thatcompletelyavoidsanyreshufflingofdatauponelasticscalingofnodesbyexploitingthefactthatacopyofcacheddataisstoredatremotepersistentdatastore.

Specifically,Snowflakereliesonthecachingmechanismtoeventually“converge”totherightstate.

所谓Lazy就是，当增加node时，不会reshufflecache，当下次Task6被assign到新节点是，会从remote读取file6，此时把file6cache下来

对于当前的方案，每个VW是使用一组独占的nodes，和ephemeral存储，这样的好处是隔离性比较好；

但是问题是资源利用率会很低，因为客户的业务高峰总是短暂的，并且是错开的，所以要资源利用率好就需要资源共享，所以这里就是隔离和利用率的tradeoff

Snowflakecurrentlysupportsmulti-tenancythroughtheVWabstraction.

EachVWoperatesonanisolatedsetofnodes,withitsownephemeralstoragesystem.

ThisallowsSnowflaketoprovideperformanceisolationtoitscustomers.

Inthissection,wepresentafewsystem-widecharacteristicsforourVWsandusethesetomotivateanalternatesharingbasedarchitectureforSnowflake.

TheVWarchitectureinSnowflakeleadstothetraditionalperformanceisolationversusutilizationtradeoff.

Figure10(topfour)showthatourVWsachievefairlygood,butnotideal,averageCPUutilization;however,otherresourcesareusuallyunderutilizedonanaverage.

Figure11providessomereasonsforthelowaverageresourceutilizationinFigure10(topfour):

thefigureshowsthevariabilityofresourceusageacrossVW;specifically,weobservethatforupto30%ofVW,standarddeviationofCPUusageovertimeisaslargeasthemeanitself.

ThisresultsinunderutilizationascustomerstendtoprovisionVWstomeetpeakdemand.

Intermsofpeakutilization,severalofourVWsexperienceperiodsofheavyutilization,butsuchhigh-utilizationperiodsarenotnecessarilysynchronizedacrossVWs.

AnexampleofthisisshowninFigure10(bottomtwo),whereweseethatoveraperiodoftwohours,thereareseveralpointswhenoneVW’sutilizationishighwhiletheotherVW’sutilizationissimultaneouslylow.

WhilewewereawareofthisperformanceisolationversusutilizationtradeoffwhenwedesignedSnowflake,recenttrendsarepushingustorevisitthisdesignchoice.

Specifically,maintainingapoolofpre-warmedinstanceswascost-efficientwheninfrastructureprovidersusedtochargeatanhourlygranularity;

however,recentmovetoper-secondpricing[6]byallmajorcloudinfrastructureprovidershasraisedinterestingchallenges.

Fromour(provider’s)perspective,wewouldliketoexploitthisfiner-grainedpricingmodeltocutdownoperationalcosts.

Howeverdoingsoisnotstraightforward,asthistrendhasalsoledtoanincreaseincustomer-demandforfiner-grainedpricing.

Asaresult,maintainingapre-warmedpoolofnodesforelasticityisnolongercost-effective:

previouslyinthehourlybillingmodel,aslongasatleastonecustomerVWusedaparticularnodeduringaonehourduration,wecouldchargethatcustomerfortheentireduration.

However,withper-secondbilling,wecannotchargeunusedcyclesonpre-warmednodestoanyparticularcustomer.

Thiscost-inefficiencymakesastrongcaseformovingtoasharingbasedmodel,wherecomputeandephemeralstorageresourcesaresharedacrosscustomers:

insuchamodelwecanprovideelasticitybystatisticallymultiplexingcustomerdemandsacrossasharedsetofresources,avoidingtheneedtomaintainalargepoolofpre-warmednodes.

Inthenextsubsection,wehighlightseveraltechnicalchallengesthatneedtoberesolvedtorealizesuchasharedarchitecture.

ThevariabilityinresourceusageovertimeacrossVW,asshowninFigure11,indicatesthatseveralofourcustomerworkloadsarebursty（突发的）innature.

Hence,movingtoasharedarchitecturewouldenableSnowflaketoachievebetterresourceutilizationviafine-grainedstatisticalmultiplexing.

SnowflaketodayexposesVWsizestocustomersinabstract“T-shirt”sizes(small,large,XLetc.),eachrepresentingdifferentresourcecapacities.

CustomersarenotawareofhowtheseVWsareimplemented(no.ofnodesused,instancetypes,etc.).

IdeallywewouldliketomaintainthesameabstractVWinterfacetocustomersandchangetheunderlyingimplementationtousesharedresourcesinsteadofisolatednodes.

Thechallenge,however,istoachieveisolationpropertiesclosetoourcurrentarchitecture.挑战是在共享的情况下仍然可以达到资源隔离属性

Thekeymetricofinterestfromcustomers’pointofviewisqueryperformance,thatis,end-to-endquerycompletiontimes.

Whileapurelysharedarchitectureislikelytoprovidegoodaverage-caseperformance,maintaininggoodperformanceattailischallenging.长尾性能很难保证

ThetwokeyresourcesthatneedtobeisolatedinVWsarecomputeandephemeralstorage.

Therehasbeenalotofwork[18,35,36]oncomputeisolationinthedatacentercontext,thatSnowflakecouldleverage.

Moreover,thecentralizedtaskscheduleranduniformexecutionruntimeinSnowflakemaketheproblemeasierthanthatofisolatingcomputeingeneralpurposeclusters.

Here,weinsteadfocusontheproblemofisolatingmemoryandstorage,whichhasonlyrecentlystartedtoreceiveattentionintheresearchcommunity[25].由于计算资源的隔离已得到充分的讨论，重点放在内存和存储的隔离问题

Thegoalhereistodesignasharedephemeralstoragesystem(usingbothmemoryandSSDs)thatsupportsfine-grainedelasticitywithoutsacrificingisolationpropertiesacrosstenants.

Withrespecttosharingandisolationofephemeralstorage,weoutlinetwokeychallenges.

First,sinceourephemeralstoragesystemmultiplexesbothcachedpersistentdataandintermediatedata,bothoftheseentitiesneedtobejointlysharedwhileensuringcross-tenantisolation.

WhileSnowflakecouldleveragetechniquesfromexistingliterature[11,26]forsharingcache,weneedamechanismthatisadditionallyawareoftheco-existenceofintermediatedata.

Unfortunately,predictingtheeffectivelifetimeofcacheentriesisdifficult.

Evictingidlecacheentriesfromtenantsandprovidingthemtoothertenantswhileensuringhardisolationisnotpossible,aswecannotpredictwhenatenantwillnextaccessthecacheentry.

Somepastworks[11,33]haveusedtechniqueslikeidlememorytaxationtodealwiththisissue.

Webelievethereismoreworktobedone,bothindefiningmorereasonableisolationguaranteesanddesigninglifetime-awarecachesharingmechanismsthatcanprovidesuchguarantees.

Thesecondchallengeisthatofachievingelasticitywithoutcross-tenantinterference:

scalingupthesharedephemeralstoragesystemcapacityinordertomeetthedemandsofaparticularcustomershouldnotimpactothertenantssharingthesystem.

Forexample,ifweweretonavelyuseSnowflake’scurrentephemeralstoragesystem,isolationpropertieswillbetriviallyviolated.

SinceallcacheentriesinSnowflakeareconsistentlyhashedontothesameglobaladdressspace,scalinguptheephemeralstoragesystemcapacitywouldenduptriggeringthelazyconsistenthashingmechanismforalltenants.

Thismayresultinmultipletenantsseeingincreasedcachemisses,resultingindegradedperformance.

Resolvingthischallengewouldrequiretheephemeralstoragesystemtoprovideprivateaddressspacestoeachindividualtenant,anduponscalingofresources,toreorganizedataonlyforthosetenantsthathavebeenallocatedadditionalresources.

AveragememoryutilizationinourVWsislow(Figure10);thisisparticularlyconcerningsinceDRAMisexpensive.

AlthoughsharingresourcesharingwouldimproveCPUandmemoryutilization,itisunlikelytoleadtooptimalutilizationacrossbothdimensions.

Further,variabilitycharacteristicsofCPUandmemoryaresignificantlydifferent(Figure11),indicatingtheneedforindependentscalingoftheseresources.

Memorydisaggregation[1,14,15]providesafundamentalsolutiontothisproblem.

However,asdiscussedin§4.2,accuratelyprovisioningresourcesishard;

sinceover-provisioningmemoryisexpensive,weneedefficientmechanismstosharedisaggregatedmemoryacrossmultipletenantswhileprovidingisolationguarantees.

InthissectionwediscussrelatedworkandothersystemssimilartoSnowflake.

Ourpreviouswork[12]discussesSQLrelatedaspectsofSnowflakeandpresentsrelatedliteratureonthoseaspects.

Thispaperfocusesonthedisaggregation,ephemeralstorage,caching,taskscheduling,elasticityandmulti-tenancyaspectsofSnowflake;

intherelatedworkdiscussionbelow,weprimarilyfocusontheseaspects.

SQL-as-a-Servicesystems.

ThereareseveralothersystemsthatofferSQLfunctionalityasaserviceinthecloud.

TheseincludeAmazonRedshift[16],Aurora[4],Athena[3],GoogleBigQuery[30]andMicrosoftAzureSynapseAnalytics[24].

Whiletherearepapersthatdescribethedesignandoperationalexperienceofsomeofthesesystems,

wearenotawareofanypriorworkthatundertakesadata-drivenanalysisofworkloadandsystemcharacteristicssimilartoours.

Redshift[16]storesprimaryreplicasofpersistentdatawithincomputeVMclusters(S3isonlyusedforbackup);Redshift，shared-nothing，计算存储未分离

thus,itmaynotbeabletoachievethebenefitsthatSnowflakeachievesbydecouplingcomputefrompersistentstorage.

Aurora[4]andBigQuery[30](basedonthearchitectureofDremel[23])decouplecomputeandpersistentstoragesimilartoSnowflake.Aurora分离了，但是依赖特殊涉及到存储服务

Aurora,however,reliesonacustomdesignedpersistentstorageservicethatiscapableofoffloadingdatabaselogprocessing,insteadofatraditionalblobstore.

Decouplingcomputeandephemeralstoragesystems.

Previouswork[20]makesthecaseforflashstoragedisaggregationbystudyingakey-valuestoreworkloadfromFacebook.

Ourobservationscorroborate（证实）thisargumentandfurtherextenditinthecontextofdatawarehousingworkloads.

Pocket[22]andLocus[27]areephemeralstoragesystemsdesignedforserverlessanalyticsapplications.

IfweweretodisaggregatecomputeandephemeralstorageinSnowflake,suchsystemswouldbegoodcandidates.

However,thesesystemsdonotprovidefine-grainedresourceelasticityduringthelifetimeofaquery.

Thus,theyeitherhavetoassumeaprioriknowledgeofintermediatedatasizes(forprovisioningresourcesatthetimeofsubmittingqueries),

orsufferfromperformancedegradationifsuchknowledgeisnotavailableinadvance.

Asdiscussedin§4.1,predictingintermediatedatasizesisextremelyhard.

Itwouldbenicetoextendthesesystemstoprovidefine-grainedelasticityandcross-queryisolation.

Technologiesforhighperformanceaccesstoremoteflashstorage[13,17,21]wouldalsobeintegraltoefficientlyrealizedecouplingofcomputeandephemeralstoragesystem.

Multi-tenantresourcesharing.

ESXserver[33]pioneeredtechniquesformulti-tenantmemorysharinginthevirtualmachinecontext,includingballooningandidle-memorytaxation.

Memshare[11]considersmulti-tenantsharingofcachecapacityinDRAMcachesinthesinglemachinecontext,sharingun-reservedcapacityamongapplicationsinawaythatmaximizeshitrate.

FairRide[26]similarlyconsidersmulti-tenantcachesharinginthedistributedsettingwhiletakingintoaccountsharingofdatabetweentenants.

MechanismsforsharingandisolationofcacheresourcessimilartotheonesusedintheseworkswouldbeimportantinenablingSnowflaketoadoptaresourcesharedarchitecture.

Asdiscussedpreviously,itwouldbeinterestingtoextendthesemechanismstomakethemawareofthedifferentcharacteristicsandrequirementsofintermediateandpersistentdata.

THE END

论文解析BuildingAnElasticQueryEngineonDisaggregatedStorage(NSDI2020)fxjwind

新概念英语第三册词汇分类记忆：V新概念

高中英语语法知识点总结（精选4篇）

河北单招英语重点词组汇总，赶紧收藏！

各种各样的英语短语

如何巧记particular经作所

avarietyof主谓一致

nalidsistheplacetofindawidevarietyofproducerswww.analvids.com[HTTPS安全检测结果]

Variety

英语委婉语对社会的影响

论文解析BuildingAnElasticQueryEngineonDisaggregatedStorage(NSDI2020)fxjwind

《Writing：PiracyisstillaseriousprobleminChina.Writeacompositionofabout400wordstost》相关问答题

2024高考英语试题新课标I卷真题（附含解析）