论文解析BuildingAnElasticQueryEngineonDisaggregatedStorage(NSDI2020)fxjwind

Shared-nothingarchitectureshavebeenthefoundationoftraditionalqueryexecutionenginesanddatawarehousingsystems.Insucharchitectures,persistentdata(e.g.,customerdatastoredastables)ispartitionedacrossasetofcomputenodes,eachofwhichisresponsibleonlyforitslocaldata.Suchshared-nothingarchitectureshaveenabledqueryexecutionenginesthatscalewell,providecross-jobisolationandgooddatalocalityresultinginhighperformanceforavarietyofworkloads.

Shared-nothing的问题

当前数据和负载的变化又加剧了问题Traditionaldatawarehousingsystemsweredesignedtooperateonrecurringqueriesondatawithpredictablevolumeandrate,e.g.,datacomingfromwithintheorganization:transactionalsystems,enterpriseresourceplanningapplication,customerrelationshipmanagementapplications,etc.Thesituationhaschangedsignificantly.Today,anincreasinglylargefractionofdatacomesfromlesscontrollable,externalsources(e.g.,applicationlogs,socialmedia,webapplications,mobilesystems,etc.)resultinginad-hoc,time-varying,andunpredictablequeryworkloads.Forsuchworkloads,shared-nothingarchitecturesbegethighcost,inflexibility,poorperformanceandinefficiency,whichhurtsproductionapplicationsandclusterdeployments.

针对上述的问题,提出snowflake,keyinsight是计算和存储分离Toovercometheselimitations,wedesignedSnowflake—anelastic,transactionalqueryexecutionenginewithSQLsupportcomparabletostate-of-the-artdatabases.ThekeyinsightinSnowflakedesignisthattheaforementionedlimitationsofshared-nothingarchitecturesarerootedintightcouplingofcomputeandstorage,andthesolutionistodecouplethetwo!Snowflakethusdisaggregatescomputefrompersistentstorage;customerdataisstoredinapersistentdatastore(e.g.,AmazonS3[5],AzureBlobStorage[8],etc.)thatprovideshighavailabilityandon-demandelasticity.Computeelasticityisachievedusingapoolofpre-warmednodes,thatcanbeassignedtocustomersonanon-demandbasis.

本文主要从以下几点展开,中间存储,查询调度,扩展性,多租户Snowflakesystemhasnowbeenactiveforseveralyearsandtoday,servesthousandsofcustomersexecutingmillionsofqueriesoverpetabytesofdata,onadailybasis.ThispaperdescribesSnowflakesystemdesign,withaparticularfocusonephemeralstoragesystemdesign,queryscheduling,elasticityandefficientlysupportingmulti-tenancy.

本文分析一个14天的查询数据集,得出以下发现,查询类型比例;查询中间结果大小差异数个量级;很小的本地存储作为cache仍然可以取得很好命中率;良好的扩展性;Peak资源利用率高,平均利用率较低

Wealsousestatisticscollectedduringexecutionof70millionqueriesoveraperiodof14contiguousdaysinFebruary2018topresentadetailedstudyofnetwork,computeandstoragecharacteristicsinSnowflake.Ourkeyfindingsare:

提出3个未来研究的方向,计算和中间存储分离;更深的存储结构;亚秒级计费

Ourstudybothcorroborates(证实,confirm)excitingongoingresearchdirectionsinthecommunity,aswellashighlightsseveralinterestingvenuesforfutureresearch:-Decouplingofcomputeandephemeralstorage:Snowflakedecouplescomputefrompersistentstoragetoachieveelasticity.However,currently,computeandephemeralstorageisstilltightlycoupled.Asweshowin§4,theratioofcomputecapacityandephemeralstoragecapacityinourproductionclusterscanvarybyseveralordersofmagnitude,leadingtoeitherunderutilizationofCPUorthrashing(冲撞)ofephemeralstorage,forad-hocqueryprocessingworkloads.Tothatend(为此),recentacademicworkondecouplingcomputefromephemeralstorage[22,27]isofextremeinterest.However,moreworkisneededinephemeralstoragesystemdesign,especiallyintermsofprovidingfine-grainedelasticity,multi-tenancy,andcrossqueryisolation(§4,§7).

-Deepstoragehierarchy:

Snowflakeephemeralstoragesystem,similartorecentworkoncompute-storagedisaggregation[14,15],

usescachingoffrequentlyreadpersistentdatatobothreducethenetworktrafficandtoimprovedatalocality.

However,existingmechanismsforimprovingcachinganddatalocalityweredesignedfortwo-tierstoragesystems(memoryasthemaintierandHDD/SSDasthesecondtier).

Aswediscussin§4,thestoragehierarchyinourproductionclustersisgettingincreasinglydeeper,andnewmechanismsareneededthatcanefficientlyexploittheemergingdeepstoragehierarchy.

-Pricingatsub-secondtimescales:

Snowflakeachievescomputeelasticityatfine-grainedtimescalesbyservingcustomersusingapoolofpre-warmednodes.

Thiswascost-efficientwithcloudpricingathourlygranularity.

However,mostcloudprovidershaverecentlytransitionedtosub-secondpricing[6],leadingtonewtechnicalchallengesinefficientlyachievingresourceelasticityandresourcesharingacrossmultipletenants.

ResolvingthesechallengesmayrequiredesigndecisionsandtradeoffsthatmaybedifferentfromthoseinSnowflake’scurrentdesign(§7).

解释一波,这个系统,负载,infrastructure的泛化问题

Ourstudyhasanimportantcaveat(注意事项).

Itfocusesonaspecificsystem(Snowflake),aspecificworkload(SQLqueries),andaspecificcloudinfrastructure(S3).

Whileoursystemislarge-scale,hasthousandsofcustomersexecutingmillionsofqueries,andrunsontopofoneofthemostprominentinfrastructures,itisneverthelesslimited.

Weleaveittofutureworkanevaluationofwhetherourstudyandobservationsgeneralizetoothersystems,workloadsandinfrastructures.

However,wearehopefulthatjustlikepriorworkloadstudiesonnetworktrafficcharacteristics[9]andcloudworkloads[28]

(eachofwhichalsofocusedonaspecificsystemimplementationrunningaspecificworkloadonaspecificinfrastructure)fueled(加油)andaidedresearchinthepast,ourstudyandpubliclyreleaseddatawillbeusefulforthecommunity.

WeprovideanoverviewofSnowflakedesign.

Snowflaketreatspersistentandintermediatedatadifferently;

wedescribethesein§2.1,followedbyahigh-leveloverviewofSnowflakearchitecture(§2.2)andqueryexecutionprocess(§2.3).

系统设计首先考虑存储的hierarchy,Persistent,Intermediate,Meta三层存储

Likemostqueryexecutionenginesanddatawarehousingsystems,Snowflakehasthreeformsofapplicationstate:

Persistentdataiscustomerdatastoredastablesinthedatabase.

Eachtablemaybereadbymanyqueries,overtimeorevenconcurrently.

Thesetablesarethuslong-livedandrequirestrongdurabilityandavailabilityguarantees.

Intermediatedataisgeneratedbyqueryoperators(e.g.,joins)andisusuallyconsumedbynodesparticipatinginexecutingthatquery.

Intermediatedataisthusshort-lived.

Moreover,toavoidnodesbeingblockedonintermediatedataaccess,low-latencyhigh-throughputaccesstointermediatedataispreferredoverstrongdurabilityguarantees.

Indeed,incaseoffailureshappeningduringthe(short)lifetimeofintermediatedata,onecansimplyrerunthepartofthequerythatproducedit.

Metadatasuchasobjectcatalogs,mappingfromdatabasetablestocorrespondingfilesinpersistentstorage,statistics,transactionlogs,locks,etc.

Thispaperprimarilyfocusesonpersistentandintermediatedata,asthevolumeofmetadataistypicallyrelativelysmallanddoesnotintroduceinterestingsystemschallenges.

Figure1showsthehigh-levelarchitectureforSnowflake.

架构分四层,

服务层(管控,SQL,调度),

计算层,核心具有pre-warmed的ECSpool

中间存储层,特殊设计的分布式存储,和计算节点co-located,增删节点无需repartition

Persistent存储层

Ithasfourmaincomponents—acentralizedservicefororchestrating(编排)end-to-endqueryexecution,acomputelayer,adistributedephemeralstoragesystemandapersistentdatastore.

Wedescribeeachofthesebelow.

CentralizedControlviaCloudServices.

AllSnowflakecustomersinteractwithandsubmitqueriestoacentralizedlayercalledCloudServices(CS)[12].

Thislayerisresponsibleforaccesscontrol,queryoptimizationandplanning,scheduling,transactionmanagement,concurrencycontrol,etc.

CSisdesignedandimplementedasamulti-tenantandlong-livedservicewithsufficientreplicationforhighavailabilityandscalability.

Thus,failureofindividualservicenodesdoesnotcauselossofstateoravailability,thoughsomeofthequeriesmayfailandbere-executedtransparently.

ElasticComputeviaVirtualWarehouseabstraction.

CustomersaregivenaccesstocomputationalresourcesinSnowflakethroughtheabstractionofaVirtualWarehouse(VW).

EachVWisessentiallyasetofAWSEC2instancesontopwhichcustomerqueriesexecuteinadistributedfashion.

Customerspayforcompute-timebasedontheVWsize.

EachVWcanbeelasticallyscaledonanon-demandbasisuponcustomerrequest.

Tosupportelasticityatfine-grainedtimescales(e.g.,tensofseconds),Snowflakemaintainsapoolofpre-warmedEC2instances;

uponreceivingarequest,wesimplyadd/removeEC2instancesto/fromthatVW(incaseofaddition,weareabletosupportmostrequestsdirectlyfromourpoolofpre-warmedinstancesthusavoidinginstancestartuptime).

EachVWmayrunmultipleconcurrentqueries.

Infact,manyofourcustomersrunmultipleVWs(e.g.,onefordataingestion,andoneforexecutingOLAPqueries).

ElasticLocalEphemeralStorage.

Intermediatedatahasdifferentperformancerequirementscomparedtopersistentdata(§2.1).

Unfortunately,existingpersistentdatastoresdonotmeettheserequirements

(e.g.,S3doesnotprovidethedesiredlow-latencyandhigh-throughputpropertiesneededforintermediatedatatoensureminimalblockingofcomputenodes);

hence,webuiltadistributedephemeralstoragesystemcustom-designedtomeettherequirementsofintermediatedatainoursystem.

Thesystemisco-locatedwithcomputenodesinVWs,andisexplicitlydesignedtoautomaticallyscaleasnodesareaddedorremoved.

Weprovidemoredetailsin§4and§6,butnoteherethatasnodesareaddedandremoved,ourephemeralstoragesystemdoesnotrequiredatarepartitioningorreshuffling(thusalleviatingoneofthecorelimitationsofshared-nothingarchitectures).

EachVWrunsitsownindependentdistributedephemeralstoragesystemwhichisusedonlybyqueriesrunningonthatparticularVW.

ElasticRemotePersistentStorage.

Snowflakestoresallitspersistentdatainaremote,disaggregated,persistentdatastore.

WestorepersistentdatainS3despitetherelativelymodest(委婉的说不行)latencyandthroughputperformancebecauseofS3’selasticity,highavailabilityanddurabilityproperties.S3的优劣

S3supportsstoringimmutablefiles—filescanonlybeoverwritteninfullanddonotevenallowappendoperations.

However,S3supportsreadrequestsforpartsofafile.

TostoretablesinS3,Snowflakepartitionsthemhorizontallyintolarge,immutablefilesthatareequivalenttoblocksintraditionaldatabasesystems[12].

Withineachfile,thevaluesofeachindividualattributeorcolumnaregroupedtogetherandcompressed,asinPAX[2].文件以rowgroup的PAX的方式组织,结构类似orc,parquet

Eachfilehasaheaderthatstoresoffsetofeachcolumnwithinthefile,enablingustousethepartialreadfunctionalityofS3toonlyreadcolumnsthatareneededforqueryexecution.

AllVWsbelongingtothesamecustomerhaveaccesstothesamesharedtablesviaremotepersistentstore,andhencedonotneedtophysicallycopydatafromoneVWtoanother.

重复前面说一遍,为什么需要EphemeralStorageSystem

Snowflakeusesacustom-designeddistributedstoragesystemformanagementandexchangeofintermediatedata,duetotwolimitationsinexistingpersistentdatastores[5,8].

First,theyfallshortofprovidingthenecessarylatencyandthroughputperformancetoavoidcomputetasksbeingblocksonintermediatedataexchange.

Second,theyprovidemuchstrongeravailabilityanddurabilitysemanticsthanwhatisneededforintermediatedata.

Ourephemeralstoragesystemallowsustoovercomeboththeselimitations.

Tasksexecutingqueryoperations(e.g.,joins)onagivencomputenodewriteintermediatedatalocally;and,tasksconsumingtheintermediatedatareaditeitherlocallyorremotelyoverthenetwork

(dependingonthenodewherethetaskisscheduled,§5).

基本的设计选择就是,中间数据,除了放内存,还可能放SSD,或S3,原因很简单因为放不下

Wemadetwoimportantdesigndecisionsinourephemeralstoragesystem.

First,ratherthandesigningapurein-memorystoragesystem,wedecidedtousebothmemoryandlocalSSDs—taskswriteasmuchintermediatedataaspossibletotheirlocalmemory;

whenmemoryisfull,intermediatedataisspilledtolocalSSDs.

Ourrationale(基本原理)isthatwhilepurelyin-memorysystemscanachievesuperiorperformancewhenentiredatafitsinmemory,theyaretoorestrictivetohandlethevarietyofourtargetworkloads.

Figure3(left)showsthattherearequeriesthatexchangehundredsofgigabytesoreventerabytesofintermediatedata;forsuchqueries,itishardtofitallintermediatedatainmainmemory.

TheseconddesigndecisionwastoallowintermediatedatatospillintoremotepersistentdatastoreincasethelocalSSDcapacityisexhausted.

SpillingintermediatedatatoS3,insteadofothercomputenodes,ispreferableforanumberofreasons—

itdoesnotrequirekeepingtrackofintermediatedatalocation,italleviatestheneedforexplicitlyhandlingout-of-memoryorout-of-diskerrorsforlargequeries,andoverall,allowstokeepourephemeralstoragesystemthinandhighlyperformant.

因为无法估计查询用的资源和中间结果大小,所以很难保证突然产生大量中间结果不会用完local资源,只能放都S3。

如果要解决这问题,需要首先解耦计算层和中间结果层,独立正对query的需求进行分配,并且中间结果层要支持细粒度的扩展性。

FutureDirections.Forperformance-criticalqueries,wewantintermediatedatatoentirelyfitinmemory,oratleastinSSDs,andnotspilltoS3.

Thisrequiresaccurateresourceprovisioning(供应).However,provisioningCPU,memoryandstorageresourceswhileachievinghighutilizationturnsouttobechallengingduetotworeasons.

Thefirstreasonislimitednumberofavailablenodeinstances(eachprovidingafixedamountofCPU,memoryandstorageresources),andsignificantlymorediverseresourcedemandsacrossqueries.

Forinstance,Figure3(center)showsthat,acrossqueries,theratioofcomputerequirementsandintermediatedatasizescanvarybyasmuchassixordersofmagnitude.

Theavailablenodeinstancessimplydonotprovideenoughoptionstoaccuratelymatchnodehardwareresourceswithsuchdiversequerydemands.

Second,evenifwecouldmatchnodehardwareresourceswithquerydemands,accuratelyprovisioningmemoryandstorageresourcesrequiresapriori(先验)knowledgeofintermediatedatasizegeneratedbythequery.

However,ourexperienceisthatpredictingthevolumeofintermediatedatageneratedbyaqueryishard,orevenimpossible,formostqueries.

AsshowninFigure3,intermediatedatasizesnotonlyvaryovermultipleordersofmagnitudeacrossqueries,butalsohavelittleornocorrelationwithamountofpersistentdatareadortheexpectedexecutiontimeofthequery.

Toresolvethefirstchallenge,wecoulddecouplecomputefromephemeralstorage.

Thiswouldallowustomatchavailablenoderesourceswithqueryresourcedemandsbyindependentlyprovisioningindividualresources.

However,thechallengeofunpredictableintermediatedatasizesishardertoresolve.

Forsuchqueries,simultaneouslyachievinghighperformanceandhighresourceutilizationwouldrequirebothdecouplingofcomputeandephemeralstorage,aswellasefficienttechniquesforfine-grainedelasticityofephemeralstoragesystem.

Wediscussthelatterinmoredetailin§6.

中间结果集生命周期很短,在peak时比较大,平均很小,所以可以和cache共用本地磁盘

机会主义的共用方式,中间结果优先

Oneofthekeyobservationswemadeduringearlyphasesofephemeralstoragesystemdesignisthatintermediatedataisshort-lived.

Thus,whilestoringintermediatedatarequireslargememoryandstoragecapacityatpeak,thedemandislowonanaverage.

Thisallowsstatisticalmultiplexingofourephemeralstoragesystemcapacitybetweenintermediatedataandfrequentlyaccessedpersistentdata.

Thisimprovesperformancesince(1)queriesindatawarehousesystemsexhibithighlyskewedaccesspatternsoverpersistentdata[10];and

(2)ephemeralstoragesystemperformanceissignificantlybetterthanthatof(existing)remotepersistentdatastores.

Snowflakeenablesstatisticalmultiplexingofephemeralstoragesystemcapacitybetweenintermediatedataandpersistentdataby“opportunistically”cachingfrequentlyaccessedpersistentdatafiles,

whereopportunisticallyreferstothefactthatintermediatedatastorageisalwaysprioritizedovercachingpersistentdatafiles.

However,apersistentdatafilecannotbecachedonanynode—Snowflakeassignsinputfilesetsforthecustomertonodesusingconsistenthashingoverpersistentdatafilenames.

Afilecanonlybecachedatthenodetowhichitconsistentlyhashesto;eachnodeusesasimpleLRUpolicytodecidecachingandevictionofpersistentdatafiles.

Giventheperformancegapbetweenourephemeralstoragesystemandremotepersistentdatastore,suchopportunisticcachingofpersistentdatafilesimprovestheexecutiontimeformanyqueriesinSnowflake.

Furthermore,sincestorageofintermediatedataisalwaysprioritizedovercachingofpersistentdatafiles,suchanopportunisticperformanceimprovementinqueryexecutiontimecanbeachievedwithoutimpactingperformanceforintermediatedataaccess.

文件Cache通过一致性hash被分配到某一个node上,通过直写cache来保证一致性

并且当加减节点是,使用lazy的一致性hash,来避免resuffle,搬运数据

Maintainingtherightsystemsemanticsduringopportunisticcachingofpersistentdatafilesrequiresacarefuldesign.

First,toensuredataconsistency,the“view”ofpersistentfilesinephemeralstoragesystemmustbeconsistentwiththoseinremotepersistentdatastore.

Weachievethisbyforcingtheephemeralstoragesystemtoactasawrite-throughcacheforpersistentdatafiles.

Second,consistenthashingofpersistentdatafilesonnodesinanavewayrequiresreshufflingofcacheddatawhenVWsareelasticallyscaled.

Weimplementalazyconsistenthashingoptimizationinourephemeralstoragesystemthatavoidssuchdatareshufflingaltogether;wedescribethiswhenwediscussSnowflakeelasticityin§6.

直写cache,所以需要多写一份local数据,每次写Persistent数据时候,需要同步更新cache

Persistentdatabeingopportunisticallycachedintheephemeralstoragesystemmeansthatsomesubsetofpersistentdataaccessrequestscouldbeservedbytheephemeralstoragesystem(dependingonwhetherornotthereisacachehit).

Figure4showsthepersistentdataI/Otrafficdistribution,intermsoffractionofbytes,betweentheephemeralstoragesystemandremotepersistentdatastore.

Thewrite-throughnatureofourephemeralstoragesystemresultsinamountofdatawrittentoephemeralstoragebeingroughlyofthesamemagnitude

astheamountofdatawrittentoremotepersistentdatastore(theyarenotalwaysequalbecauseofprioritizingstorageofintermediatedataovercachingofpersistentdata).

cache的效果还不错,虽然localdisk很小

Eventhoughourephemeralstoragecapacityissignificantlylowerthanthatofacustomer’spersistentdata(around0:1%onanaverage),

skewedfileaccessdistributionsandtemporalfileaccesspatternscommonindatawarehouses[7]enablereasonablyhighcachehitrates(avg.hitrateiscloseto80%forread-onlyqueriesandaround60%forread-writequeries).

Figure5showsthehitratedistributionsacrossqueries.Themedianhitratesareevenhigher.

未来的方向,

如何平衡中间结果和cache对于有限本地磁盘的占用

随着NVM或remote临时存储的诞生,存储的hierarchy会越来越深,需要新的多层cache的新架构

FutureDirections.Figure4andFigure5suggestthatmoreworkisneededoncaching.

Inadditiontolocalityofreferenceinaccesspatterns,cachehitratealsodependsoneffectivecachesizeavailabletothequeryrelativetotheamountofpersistentdataaccessedbythequery.

Theeffectivecachesize,inturn,dependsonboththeVWsizeandthevolumeofintermediatedatageneratedbyconcurrentlyexecutingqueries.

Ourpreliminary(初步的)analysishasnotledtoanyconclusiveobservationsontheimpactoftheabovetwofactorsontheobservedcachehitrates,andamorefine-grainedanalysisisneededtounderstandfactorsthatimpactcachehitrates.

Wehighlighttwoadditionaltechnicalproblems.

First,sinceend-to-endqueryperformancedependsonboth,cachehitrateforpersistentdatafilesandI/Othroughputforintermediatedata,itisimportanttooptimizehowtheephemeralstoragesystemsplitscapacitybetweenthetwo.

Althoughwecurrentlyusethesimplepolicyofalwaysprioritizingintermediatedata,itmaynotbetheoptimalpolicywithrespecttoend-to-endperformanceobjectives(e.g.,averagequerycompletiontimeacrossallqueriesfromthesamecustomer).

Forexample,itmaybebettertoprioritizecachingapersistentdatafilethatisgoingtobeaccessedbymanyqueriesoverintermediatedatathatisaccessedbyonlyone.

Itwouldbeinterestingtoexploreextensionstoknowncachingmechanismsthatoptimizeforend-to-endqueryperformanceobjectives[7]totakeintermediatedataintoaccount.

Second,existingcachingmechanismsweredesignedfortwo-tierstoragesystems(memoryasthemaintierandHDD/SSDasthesecondtier).

InSnowflake,wealreadyhavethreetiersofhierarchywithcompute-localmemory,ephemeralstoragesystemandremotepersistentdatastore;

asemergingnon-volatilememorydevicesaredeployedinthecloudandasrecentdesignsonremoteephemeralstoragesystemsmature[22],thestoragehierarchyinthecloudwillgetincreasinglydeeper.

Snowflakeusestraditionaltwo-tiermechanisms—eachnodeimplementsalocalLRUpolicyforevictionsfromlocalmemorytolocalSSD,andanindependentLRUpolicyforevictionsfromlocalSSDtoremotepersistentdatastore.

However,toefficientlyexploitthedeepeningstoragehierarchy,weneednewcachingmechanismsthatcanefficientlycoordinatecachingacrossmultipletiers.

WebelievemanyoftheabovetechnicalchallengesarenotspecifictoSnowflake,andwouldapplymorebroadlytoanydistributedapplicationbuiltontopofdisaggregatedstorage.

WenowdescribethequeryexecutionprocessinSnowflake.

CustomerssubmittheirqueriestotheCloudServices(CS)forexecutiononaspecificVW.

CSperformsqueryparsing,queryplanningandoptimization,andcreatesasetoftaskstobescheduledoncomputenodesoftheVW.

Locality-awaretaskscheduling.

Tofullyexploittheephemeralstoragesystem,Snowflakecolocateseachtaskwithpersistentdatafilesthatitoperatesonusingalocality-awareschedulingmechanism(recall,thesefilesmaybecachedinephemeralstoragesystem).

Specifically,recallthatSnowflakeassignspersistentdatafilestocomputenodesusingconsistenthashingovertablefilenames.

Thus,forafixedVWsize,eachpersistentdatafileiscachedonaspecificnode.

Snowflakeschedulesthetaskthatoperatesonapersistentdatafiletothenodeonwhichitsfileconsistentlyhashesto.

Asaresultofthisschedulingscheme,queryparallelismistightlycoupledwithconsistenthashingoffilesonnodes—aqueryisscheduledforcachelocalityandmaybedistributedacrossallthenodesintheVW.

Forinstance,consideracustomerthathas1millionfilesworthofpersistentdata,andisrunningaVWwith10nodes.

Considertwoqueries,wherethefirstqueryoperateson100files,andthesecondqueryoperateson100000files;then,withhighlikelihood,bothquerieswillrunonallthe10nodesbecauseoffilesbeingconsistentlyhashedontoallthe10nodes.

Workstealing.Itisknownthatconsistenthashingcanleadtoimbalancedpartitions[19].很常见的做法,闲的node会stealtask执行

Inordertoavoidoverloadingofnodesandimproveloadbalance,Snowflakeusesworkstealing,asimpleoptimizationthatallowsanodetostealataskfromanothernode

iftheexpectedcompletiontimeofthetask(sumofexecutiontimeandwaitingtime)isloweratthenewnode.

Whensuchworkstealingoccurs,thepersistentdatafilesneededtoexecutethetaskarereadfromremotepersistentdatastoreratherthanthenodeatwhichthetaskwasoriginallyscheduledon.

Thisavoidsincreasingloadonanalreadyoverloadednodewherethetaskwasoriginallyscheduled(notethatworkstealinghappensonlywhenanodeisoverloaded).

调度两个极端,task和数据完全colocate,避免读persistent的数据,但中间数据会需要传输;所有task都放一起,这样避免执行中间结果传输

FutureDirections.Schedulerscanplacetasksontonodesusingtwoextremeoptions:

oneistocolocatetaskswiththeircachedpersistentdata,asinourcurrentimplementation.

Asdiscussedintheexampleabove,thismayendupschedulingallqueriesonallnodesintheVW;

whilesuchaschedulingpolicyminimizesnetworktrafficforreadingpersistentdata,itmayleadtoincreasednetworktrafficforintermediatedataexchange.

Theotherextremeistoplacealltasksonasinglenode.Thiswouldobviate(消除)theneedofnetworktransfersforintermediatedataexchangebutwouldincreasenetworktrafficforpersistentdatareads.

Neitheroftheseextremesmaybetherightchoiceforallqueries.

Itwouldbeinterestingtocodesignqueryschedulersthatwouldpickjusttherightsetofnodestoobtainasweetspot(甜区)betweenthetwoextremes,andthenscheduleindividualtasksontothesenodes.

Inthissection,wediscusshowBlowFishdesignachievesoneofitscoregoals:resourceelasticity,thatis,scalingofcomputeandstorageresourcesonanon-demandbasis.

DisaggregatingcomputefrompersistentstorageenablesSnowflaketoindependentlyscalecomputeandpersistentstorageresources.

Storageelasticityisoffloadedtopersistentdatastores[5];computeelasticity,ontheotherhand,isachievedusingapre-warmedpoolofnodesthatcanbeadded/removedto/fromcustomerVWsonanon-demandbasis.

Bykeepingapre-warmedpoolofnodes,Snowflakeisabletoprovidecomputeelasticityatthegranularityoftensofseconds.

OneofthechallengesthatSnowflakehadtoresolveinordertoachieveelasticityefficientlyisrelatedtodatamanagementinephemeralstoragesystem.

Recallthatourephemeralstoragesystemopportunisticallycachespersistentdatafiles;eachfilecanbecachedonlyonthenodetowhichitconsistentlyhashestowithintheVW.

Theproblemissimilartosharednothingarchitectures:anyfixedpartitioningmechanism(inourcase,consistenthashing)requireslargeamountsofdatatobereshuffleduponscalingofnodes;

moreover,sincetheverysamesetofnodesarealsoresponsibleforqueryprocessing,thesystemobservesasignificantperformanceimpactduringthescalingprocess.

Snowflakeresolvesthischallengeusingalazyconsistenthashingmechanism,thatcompletelyavoidsanyreshufflingofdatauponelasticscalingofnodesbyexploitingthefactthatacopyofcacheddataisstoredatremotepersistentdatastore.

Specifically,Snowflakereliesonthecachingmechanismtoeventually“converge”totherightstate.

所谓Lazy就是,当增加node时,不会reshufflecache,当下次Task6被assign到新节点是,会从remote读取file6,此时把file6cache下来

对于当前的方案,每个VW是使用一组独占的nodes,和ephemeral存储,这样的好处是隔离性比较好;

但是问题是资源利用率会很低,因为客户的业务高峰总是短暂的,并且是错开的,所以要资源利用率好就需要资源共享,所以这里就是隔离和利用率的tradeoff

Snowflakecurrentlysupportsmulti-tenancythroughtheVWabstraction.

EachVWoperatesonanisolatedsetofnodes,withitsownephemeralstoragesystem.

ThisallowsSnowflaketoprovideperformanceisolationtoitscustomers.

Inthissection,wepresentafewsystem-widecharacteristicsforourVWsandusethesetomotivateanalternatesharingbasedarchitectureforSnowflake.

TheVWarchitectureinSnowflakeleadstothetraditionalperformanceisolationversusutilizationtradeoff.

Figure10(topfour)showthatourVWsachievefairlygood,butnotideal,averageCPUutilization;however,otherresourcesareusuallyunderutilizedonanaverage.

Figure11providessomereasonsforthelowaverageresourceutilizationinFigure10(topfour):

thefigureshowsthevariabilityofresourceusageacrossVW;specifically,weobservethatforupto30%ofVW,standarddeviationofCPUusageovertimeisaslargeasthemeanitself.

ThisresultsinunderutilizationascustomerstendtoprovisionVWstomeetpeakdemand.

Intermsofpeakutilization,severalofourVWsexperienceperiodsofheavyutilization,butsuchhigh-utilizationperiodsarenotnecessarilysynchronizedacrossVWs.

AnexampleofthisisshowninFigure10(bottomtwo),whereweseethatoveraperiodoftwohours,thereareseveralpointswhenoneVW’sutilizationishighwhiletheotherVW’sutilizationissimultaneouslylow.

WhilewewereawareofthisperformanceisolationversusutilizationtradeoffwhenwedesignedSnowflake,recenttrendsarepushingustorevisitthisdesignchoice.

Specifically,maintainingapoolofpre-warmedinstanceswascost-efficientwheninfrastructureprovidersusedtochargeatanhourlygranularity;

however,recentmovetoper-secondpricing[6]byallmajorcloudinfrastructureprovidershasraisedinterestingchallenges.

Fromour(provider’s)perspective,wewouldliketoexploitthisfiner-grainedpricingmodeltocutdownoperationalcosts.

Howeverdoingsoisnotstraightforward,asthistrendhasalsoledtoanincreaseincustomer-demandforfiner-grainedpricing.

Asaresult,maintainingapre-warmedpoolofnodesforelasticityisnolongercost-effective:

previouslyinthehourlybillingmodel,aslongasatleastonecustomerVWusedaparticularnodeduringaonehourduration,wecouldchargethatcustomerfortheentireduration.

However,withper-secondbilling,wecannotchargeunusedcyclesonpre-warmednodestoanyparticularcustomer.

Thiscost-inefficiencymakesastrongcaseformovingtoasharingbasedmodel,wherecomputeandephemeralstorageresourcesaresharedacrosscustomers:

insuchamodelwecanprovideelasticitybystatisticallymultiplexingcustomerdemandsacrossasharedsetofresources,avoidingtheneedtomaintainalargepoolofpre-warmednodes.

Inthenextsubsection,wehighlightseveraltechnicalchallengesthatneedtoberesolvedtorealizesuchasharedarchitecture.

ThevariabilityinresourceusageovertimeacrossVW,asshowninFigure11,indicatesthatseveralofourcustomerworkloadsarebursty(突发的)innature.

Hence,movingtoasharedarchitecturewouldenableSnowflaketoachievebetterresourceutilizationviafine-grainedstatisticalmultiplexing.

SnowflaketodayexposesVWsizestocustomersinabstract“T-shirt”sizes(small,large,XLetc.),eachrepresentingdifferentresourcecapacities.

CustomersarenotawareofhowtheseVWsareimplemented(no.ofnodesused,instancetypes,etc.).

IdeallywewouldliketomaintainthesameabstractVWinterfacetocustomersandchangetheunderlyingimplementationtousesharedresourcesinsteadofisolatednodes.

Thechallenge,however,istoachieveisolationpropertiesclosetoourcurrentarchitecture.挑战是在共享的情况下仍然可以达到资源隔离属性

Thekeymetricofinterestfromcustomers’pointofviewisqueryperformance,thatis,end-to-endquerycompletiontimes.

Whileapurelysharedarchitectureislikelytoprovidegoodaverage-caseperformance,maintaininggoodperformanceattailischallenging.长尾性能很难保证

ThetwokeyresourcesthatneedtobeisolatedinVWsarecomputeandephemeralstorage.

Therehasbeenalotofwork[18,35,36]oncomputeisolationinthedatacentercontext,thatSnowflakecouldleverage.

Moreover,thecentralizedtaskscheduleranduniformexecutionruntimeinSnowflakemaketheproblemeasierthanthatofisolatingcomputeingeneralpurposeclusters.

Here,weinsteadfocusontheproblemofisolatingmemoryandstorage,whichhasonlyrecentlystartedtoreceiveattentionintheresearchcommunity[25].由于计算资源的隔离已得到充分的讨论,重点放在内存和存储的隔离问题

Thegoalhereistodesignasharedephemeralstoragesystem(usingbothmemoryandSSDs)thatsupportsfine-grainedelasticitywithoutsacrificingisolationpropertiesacrosstenants.

Withrespecttosharingandisolationofephemeralstorage,weoutlinetwokeychallenges.

First,sinceourephemeralstoragesystemmultiplexesbothcachedpersistentdataandintermediatedata,bothoftheseentitiesneedtobejointlysharedwhileensuringcross-tenantisolation.

WhileSnowflakecouldleveragetechniquesfromexistingliterature[11,26]forsharingcache,weneedamechanismthatisadditionallyawareoftheco-existenceofintermediatedata.

Unfortunately,predictingtheeffectivelifetimeofcacheentriesisdifficult.

Evictingidlecacheentriesfromtenantsandprovidingthemtoothertenantswhileensuringhardisolationisnotpossible,aswecannotpredictwhenatenantwillnextaccessthecacheentry.

Somepastworks[11,33]haveusedtechniqueslikeidlememorytaxationtodealwiththisissue.

Webelievethereismoreworktobedone,bothindefiningmorereasonableisolationguaranteesanddesigninglifetime-awarecachesharingmechanismsthatcanprovidesuchguarantees.

Thesecondchallengeisthatofachievingelasticitywithoutcross-tenantinterference:

scalingupthesharedephemeralstoragesystemcapacityinordertomeetthedemandsofaparticularcustomershouldnotimpactothertenantssharingthesystem.

Forexample,ifweweretonavelyuseSnowflake’scurrentephemeralstoragesystem,isolationpropertieswillbetriviallyviolated.

SinceallcacheentriesinSnowflakeareconsistentlyhashedontothesameglobaladdressspace,scalinguptheephemeralstoragesystemcapacitywouldenduptriggeringthelazyconsistenthashingmechanismforalltenants.

Thismayresultinmultipletenantsseeingincreasedcachemisses,resultingindegradedperformance.

Resolvingthischallengewouldrequiretheephemeralstoragesystemtoprovideprivateaddressspacestoeachindividualtenant,anduponscalingofresources,toreorganizedataonlyforthosetenantsthathavebeenallocatedadditionalresources.

AveragememoryutilizationinourVWsislow(Figure10);thisisparticularlyconcerningsinceDRAMisexpensive.

AlthoughsharingresourcesharingwouldimproveCPUandmemoryutilization,itisunlikelytoleadtooptimalutilizationacrossbothdimensions.

Further,variabilitycharacteristicsofCPUandmemoryaresignificantlydifferent(Figure11),indicatingtheneedforindependentscalingoftheseresources.

Memorydisaggregation[1,14,15]providesafundamentalsolutiontothisproblem.

However,asdiscussedin§4.2,accuratelyprovisioningresourcesishard;

sinceover-provisioningmemoryisexpensive,weneedefficientmechanismstosharedisaggregatedmemoryacrossmultipletenantswhileprovidingisolationguarantees.

InthissectionwediscussrelatedworkandothersystemssimilartoSnowflake.

Ourpreviouswork[12]discussesSQLrelatedaspectsofSnowflakeandpresentsrelatedliteratureonthoseaspects.

Thispaperfocusesonthedisaggregation,ephemeralstorage,caching,taskscheduling,elasticityandmulti-tenancyaspectsofSnowflake;

intherelatedworkdiscussionbelow,weprimarilyfocusontheseaspects.

SQL-as-a-Servicesystems.

ThereareseveralothersystemsthatofferSQLfunctionalityasaserviceinthecloud.

TheseincludeAmazonRedshift[16],Aurora[4],Athena[3],GoogleBigQuery[30]andMicrosoftAzureSynapseAnalytics[24].

Whiletherearepapersthatdescribethedesignandoperationalexperienceofsomeofthesesystems,

wearenotawareofanypriorworkthatundertakesadata-drivenanalysisofworkloadandsystemcharacteristicssimilartoours.

Redshift[16]storesprimaryreplicasofpersistentdatawithincomputeVMclusters(S3isonlyusedforbackup);Redshift,shared-nothing,计算存储未分离

thus,itmaynotbeabletoachievethebenefitsthatSnowflakeachievesbydecouplingcomputefrompersistentstorage.

Aurora[4]andBigQuery[30](basedonthearchitectureofDremel[23])decouplecomputeandpersistentstoragesimilartoSnowflake.Aurora分离了,但是依赖特殊涉及到存储服务

Aurora,however,reliesonacustomdesignedpersistentstorageservicethatiscapableofoffloadingdatabaselogprocessing,insteadofatraditionalblobstore.

Decouplingcomputeandephemeralstoragesystems.

Previouswork[20]makesthecaseforflashstoragedisaggregationbystudyingakey-valuestoreworkloadfromFacebook.

Ourobservationscorroborate(证实)thisargumentandfurtherextenditinthecontextofdatawarehousingworkloads.

Pocket[22]andLocus[27]areephemeralstoragesystemsdesignedforserverlessanalyticsapplications.

IfweweretodisaggregatecomputeandephemeralstorageinSnowflake,suchsystemswouldbegoodcandidates.

However,thesesystemsdonotprovidefine-grainedresourceelasticityduringthelifetimeofaquery.

Thus,theyeitherhavetoassumeaprioriknowledgeofintermediatedatasizes(forprovisioningresourcesatthetimeofsubmittingqueries),

orsufferfromperformancedegradationifsuchknowledgeisnotavailableinadvance.

Asdiscussedin§4.1,predictingintermediatedatasizesisextremelyhard.

Itwouldbenicetoextendthesesystemstoprovidefine-grainedelasticityandcross-queryisolation.

Technologiesforhighperformanceaccesstoremoteflashstorage[13,17,21]wouldalsobeintegraltoefficientlyrealizedecouplingofcomputeandephemeralstoragesystem.

Multi-tenantresourcesharing.

ESXserver[33]pioneeredtechniquesformulti-tenantmemorysharinginthevirtualmachinecontext,includingballooningandidle-memorytaxation.

Memshare[11]considersmulti-tenantsharingofcachecapacityinDRAMcachesinthesinglemachinecontext,sharingun-reservedcapacityamongapplicationsinawaythatmaximizeshitrate.

FairRide[26]similarlyconsidersmulti-tenantcachesharinginthedistributedsettingwhiletakingintoaccountsharingofdatabetweentenants.

MechanismsforsharingandisolationofcacheresourcessimilartotheonesusedintheseworkswouldbeimportantinenablingSnowflaketoadoptaresourcesharedarchitecture.

Asdiscussedpreviously,itwouldbeinterestingtoextendthesemechanismstomakethemawareofthedifferentcharacteristicsandrequirementsofintermediateandpersistentdata.

THE END
1.gardenvariety,gardenvariety,gardenIt is a common or garden sparrow.它是一只普通的麻雀。 Thiophene, a highly common substance, serves as a versatile precursor to a variety of amino acids.噻吩是一种很普通的物质,它可以作为合成各种氨基酸的母体。 权威例句 Garden VarietyGarden VarietyGarden VarietyExplaining the differences between the ...http://learn.office369.com/yingyu/67096.html
2.高考英语一轮复习知识清单(全国版)专题21名词六大类15个高频考点...thewindowoftheroom房间的窗户 3.双重所有格 指"名词+of+名词所有格"或"名词+of+名词性物主代词"。 aplayofShakespeare’s(莎士比亚的一个戏剧) afriendofmine(我的一个朋友) 主要用于表示有生命的人或物的所有关系。 (1)一般是名词词尾加-s。如:Johnshome约翰的家; (2)以-s或-es结尾的复数名词,只在词尾...https://max.book118.com/html/2024/1115/5334332341011344.shtm
3.2025年成人高考英语阅读高频单词200组(2)成人高考16.liberal a. 慷慨的;丰富的;自由的 17.transform v. 转变,变革;变换 18.transmit v. 传播,播送;传递 19.transplant v. 移植 20.transport vt. 运输,运送 n. 运输,运输工具 21.shift v. 转移;转动;转变 22.vary v. 变化,改变;使多样化 https://www.exam8.com/xueli/chengren/fudao/202411/4935140.html
4.Avarietyofhisbooks刷刷题APP(shuashuati.com)是专业的大学生刷题搜题拍题答疑工具,刷刷题提供A varietyof his books ___ been published and the variety of his writing___astonishing.A.have, areB.have, isC.has, areD.has, is的答案解析,刷刷题为用户提供专业的考试题库练习。一分钟https://www.shuashuati.com/ti/ab62c87792f74b8c8b7f1281b9bbee52.html?fm=bd8dfffccb930f015975138c131c925ad1
5....to()various(),samplingavarietyofdifferentjobs.This...( ), sampling a variety of different jobs. This often leaves their parents feeling some ( ) as they fear their children may be just ( ) rather than settling in a ( )job. As for taking on the responsibilities of ( )themselves, young people typically ( ) that ( )until much later in...http://www.ppkao.com/wangke/daan/f6699efb6e554fc7af90f5de823ad3cc
6....varietyisnomorethanadialectalvarietyofalanguage.Fromthesociolinguisticperspective,aspeechvarietyisnomorethanadialectalvarietyofalanguage.学历类判断题,自考判断题,自考专业(英语)判断题,现代语言学判断题https://www.chazidian.com/kaoshi/shiti-170188/
7.它具有各种各样的文化的翻译是:Ithasavarietyofcultures...It has a wide variety of cultural 翻译结果2复制译文编辑译文朗读译文返回顶部 It has a variety of cultures 翻译结果3复制译文编辑译文朗读译文返回顶部 It has a variety of cultures 翻译结果4复制译文编辑译文朗读译文返回顶部 It has a wide variety of cultural ...http://riyu.zaixian-fanyi.com/fan_yi_2659299
1.VarietyDefinition&MeaningMerriamThe meaning of VARIETY is the quality or state of having different forms or types : multifariousness. How to use variety in a sentence.https://www.merriam-webster.com/dictionary/variety
2.variety名词/其它解释/相关词语拉曼频移 其它解释/相关词语 土壤类别 其它解释/相关词语 呼吸道功能障碍 快一点 其它解释/相关词语 典型例子 其它解释/相关词语 Messier 形容词/其它解释/相关词语 nuclear fuel 名词/其它解释/相关词语 推荐词语 同事 其它解释/相关词语 antennas 相关词语 默认 其它解释/相关词语 objects 相关词语 乔伊斯 其它解释...https://www.xiaolaoda.com/dict/23042223447o3s01199525.html
3.考研英语二大作文:常用句型与替换词考前高效积累2.By contrast, the figure for (比较对象B) showed a decreasing trend,dropping from(起点数据)in(起点时间)to(终点数据)in(终点时间). 3.The amount of (比较对象C) increased from(起点数据)in(起点时间)to(终点数据)in(终点时间). 4.In comparison, the number of (比较对象C) had remained steady...https://www.hqwx.com/mba-kaoshi/news/17321546159654.html
4.高中英语作文高级替换词汇总拒绝平庸!表达“使用”之意,除了 use 之外,还有make use of…利用…,make good/proper use of… 利用好…,make the most of…最大限度利用好,take advantage of ... 利用…等。 5、help 表达“帮助”之意,除了help之外,还有 do me a favor, give sb a hand等。 https://www.jyjzzs.com/a/qisu/2024/1122/163324.html
5.avarietyof与thevarietyof经验交流a variety of与the variety of:a variety of 意为“各种各样的”,the variety of 意为“……的种类或多样性”,两者均可修饰可数或不可数名词。如:The audience are dressed in a variety of ways. 观众有形形色色的穿着。The variety of his writing was ahttps://kaoshi.7139.com/1455/07/59275.html
6.2011年深圳大运会志愿者必备短文88、Iamafraidyouhavegotonthewrongbus.ThebusNO.305doesnotgototheRailwayStation. 恐怕您乘错公交车了。305路公交车不去火车站。 89、Shenzhenisashopper'sparadiseforthewidevarietyofconsumergoods. 深圳消费品种类丰富,是购物者的天堂。 90、NowadaysmostpublicinstitutionsandculturalmediainChinaprovidepeoplewithadisabil...https://www.unjs.com/article/ky/sh/20110909082924_694945.html
7.2016安徽教师招聘考试英语核心考点英语语言学概论专业知识笔试指导2.★What are the design features of language? Language has seven design features as following: 1) Productivity. 2) Discreteness. 3) Displacement 4) Arbitrariness. 5) Cultural transmission 6) Duality of structure. 7) Interchangeability. 3.Why do we say language is a system? https://www.ahteacher.com/bishi/zhuanye/63781.html
8.TheOriginofSpecies:Chapter1hen we look to the individuals of the same variety or sub-variety of our older cultivated plants and animals, one of the first points which strikes...is only a variety of the wild Dipsacus; and this amount of change may have suddenly arisen in a seedling. So it has probably been with t...http://www.talkorigins.org/faqs/origin/chapter1.html
9.VarietyKeira Knightley Says ‘I Was Seen as S—‘ Due to ‘Pirates of the Caribbean’ and ‘Taken Down Publicly’; She Won’t Do More Franchises: ‘You Have No… ‘Percy Jackson’ Casts Courtney B. Vance as Zeus Following Lance Reddick’s Death: ‘I’ll Be Giving My Brother a Heavenly Hug...http://variety.com/
10.VARIETY在剑桥英语词典中的解释及翻译On teaching the standard variety to speakers of dialectal or sociolectal varieties. 来自Cambridge English Corpus The unit-based ethicist may fulfill a variety of functions through his participation in the working or teaching rounds of the unit. 来自Cambridge English Corpus A variety of other ...https://dictionary.cambridge.org/zhs/%E8%AF%8D%E5%85%B8/%E8%8B%B1%E8%AF%AD/variety