Shared-nothingarchitectureshavebeenthefoundationoftraditionalqueryexecutionenginesanddatawarehousingsystems.Insucharchitectures,persistentdata(e.g.,customerdatastoredastables)ispartitionedacrossasetofcomputenodes,eachofwhichisresponsibleonlyforitslocaldata.Suchshared-nothingarchitectureshaveenabledqueryexecutionenginesthatscalewell,providecross-jobisolationandgooddatalocalityresultinginhighperformanceforavarietyofworkloads.
Shared-nothing的问题
当前数据和负载的变化又加剧了问题Traditionaldatawarehousingsystemsweredesignedtooperateonrecurringqueriesondatawithpredictablevolumeandrate,e.g.,datacomingfromwithintheorganization:transactionalsystems,enterpriseresourceplanningapplication,customerrelationshipmanagementapplications,etc.Thesituationhaschangedsignificantly.Today,anincreasinglylargefractionofdatacomesfromlesscontrollable,externalsources(e.g.,applicationlogs,socialmedia,webapplications,mobilesystems,etc.)resultinginad-hoc,time-varying,andunpredictablequeryworkloads.Forsuchworkloads,shared-nothingarchitecturesbegethighcost,inflexibility,poorperformanceandinefficiency,whichhurtsproductionapplicationsandclusterdeployments.
针对上述的问题,提出snowflake,keyinsight是计算和存储分离Toovercometheselimitations,wedesignedSnowflake—anelastic,transactionalqueryexecutionenginewithSQLsupportcomparabletostate-of-the-artdatabases.ThekeyinsightinSnowflakedesignisthattheaforementionedlimitationsofshared-nothingarchitecturesarerootedintightcouplingofcomputeandstorage,andthesolutionistodecouplethetwo!Snowflakethusdisaggregatescomputefrompersistentstorage;customerdataisstoredinapersistentdatastore(e.g.,AmazonS3[5],AzureBlobStorage[8],etc.)thatprovideshighavailabilityandon-demandelasticity.Computeelasticityisachievedusingapoolofpre-warmednodes,thatcanbeassignedtocustomersonanon-demandbasis.
本文主要从以下几点展开,中间存储,查询调度,扩展性,多租户Snowflakesystemhasnowbeenactiveforseveralyearsandtoday,servesthousandsofcustomersexecutingmillionsofqueriesoverpetabytesofdata,onadailybasis.ThispaperdescribesSnowflakesystemdesign,withaparticularfocusonephemeralstoragesystemdesign,queryscheduling,elasticityandefficientlysupportingmulti-tenancy.
本文分析一个14天的查询数据集,得出以下发现,查询类型比例;查询中间结果大小差异数个量级;很小的本地存储作为cache仍然可以取得很好命中率;良好的扩展性;Peak资源利用率高,平均利用率较低
Wealsousestatisticscollectedduringexecutionof70millionqueriesoveraperiodof14contiguousdaysinFebruary2018topresentadetailedstudyofnetwork,computeandstoragecharacteristicsinSnowflake.Ourkeyfindingsare:
提出3个未来研究的方向,计算和中间存储分离;更深的存储结构;亚秒级计费
Ourstudybothcorroborates(证实,confirm)excitingongoingresearchdirectionsinthecommunity,aswellashighlightsseveralinterestingvenuesforfutureresearch:-Decouplingofcomputeandephemeralstorage:Snowflakedecouplescomputefrompersistentstoragetoachieveelasticity.However,currently,computeandephemeralstorageisstilltightlycoupled.Asweshowin§4,theratioofcomputecapacityandephemeralstoragecapacityinourproductionclusterscanvarybyseveralordersofmagnitude,leadingtoeitherunderutilizationofCPUorthrashing(冲撞)ofephemeralstorage,forad-hocqueryprocessingworkloads.Tothatend(为此),recentacademicworkondecouplingcomputefromephemeralstorage[22,27]isofextremeinterest.However,moreworkisneededinephemeralstoragesystemdesign,especiallyintermsofprovidingfine-grainedelasticity,multi-tenancy,andcrossqueryisolation(§4,§7).
-Deepstoragehierarchy:
Snowflakeephemeralstoragesystem,similartorecentworkoncompute-storagedisaggregation[14,15],
usescachingoffrequentlyreadpersistentdatatobothreducethenetworktrafficandtoimprovedatalocality.
However,existingmechanismsforimprovingcachinganddatalocalityweredesignedfortwo-tierstoragesystems(memoryasthemaintierandHDD/SSDasthesecondtier).
Aswediscussin§4,thestoragehierarchyinourproductionclustersisgettingincreasinglydeeper,andnewmechanismsareneededthatcanefficientlyexploittheemergingdeepstoragehierarchy.
-Pricingatsub-secondtimescales:
Snowflakeachievescomputeelasticityatfine-grainedtimescalesbyservingcustomersusingapoolofpre-warmednodes.
Thiswascost-efficientwithcloudpricingathourlygranularity.
However,mostcloudprovidershaverecentlytransitionedtosub-secondpricing[6],leadingtonewtechnicalchallengesinefficientlyachievingresourceelasticityandresourcesharingacrossmultipletenants.
ResolvingthesechallengesmayrequiredesigndecisionsandtradeoffsthatmaybedifferentfromthoseinSnowflake’scurrentdesign(§7).
解释一波,这个系统,负载,infrastructure的泛化问题
Ourstudyhasanimportantcaveat(注意事项).
Itfocusesonaspecificsystem(Snowflake),aspecificworkload(SQLqueries),andaspecificcloudinfrastructure(S3).
Whileoursystemislarge-scale,hasthousandsofcustomersexecutingmillionsofqueries,andrunsontopofoneofthemostprominentinfrastructures,itisneverthelesslimited.
Weleaveittofutureworkanevaluationofwhetherourstudyandobservationsgeneralizetoothersystems,workloadsandinfrastructures.
However,wearehopefulthatjustlikepriorworkloadstudiesonnetworktrafficcharacteristics[9]andcloudworkloads[28]
(eachofwhichalsofocusedonaspecificsystemimplementationrunningaspecificworkloadonaspecificinfrastructure)fueled(加油)andaidedresearchinthepast,ourstudyandpubliclyreleaseddatawillbeusefulforthecommunity.
WeprovideanoverviewofSnowflakedesign.
Snowflaketreatspersistentandintermediatedatadifferently;
wedescribethesein§2.1,followedbyahigh-leveloverviewofSnowflakearchitecture(§2.2)andqueryexecutionprocess(§2.3).
系统设计首先考虑存储的hierarchy,Persistent,Intermediate,Meta三层存储
Likemostqueryexecutionenginesanddatawarehousingsystems,Snowflakehasthreeformsofapplicationstate:
Persistentdataiscustomerdatastoredastablesinthedatabase.
Eachtablemaybereadbymanyqueries,overtimeorevenconcurrently.
Thesetablesarethuslong-livedandrequirestrongdurabilityandavailabilityguarantees.
Intermediatedataisgeneratedbyqueryoperators(e.g.,joins)andisusuallyconsumedbynodesparticipatinginexecutingthatquery.
Intermediatedataisthusshort-lived.
Moreover,toavoidnodesbeingblockedonintermediatedataaccess,low-latencyhigh-throughputaccesstointermediatedataispreferredoverstrongdurabilityguarantees.
Indeed,incaseoffailureshappeningduringthe(short)lifetimeofintermediatedata,onecansimplyrerunthepartofthequerythatproducedit.
Metadatasuchasobjectcatalogs,mappingfromdatabasetablestocorrespondingfilesinpersistentstorage,statistics,transactionlogs,locks,etc.
Thispaperprimarilyfocusesonpersistentandintermediatedata,asthevolumeofmetadataistypicallyrelativelysmallanddoesnotintroduceinterestingsystemschallenges.
Figure1showsthehigh-levelarchitectureforSnowflake.
架构分四层,
服务层(管控,SQL,调度),
计算层,核心具有pre-warmed的ECSpool
中间存储层,特殊设计的分布式存储,和计算节点co-located,增删节点无需repartition
Persistent存储层
Ithasfourmaincomponents—acentralizedservicefororchestrating(编排)end-to-endqueryexecution,acomputelayer,adistributedephemeralstoragesystemandapersistentdatastore.
Wedescribeeachofthesebelow.
CentralizedControlviaCloudServices.
AllSnowflakecustomersinteractwithandsubmitqueriestoacentralizedlayercalledCloudServices(CS)[12].
Thislayerisresponsibleforaccesscontrol,queryoptimizationandplanning,scheduling,transactionmanagement,concurrencycontrol,etc.
CSisdesignedandimplementedasamulti-tenantandlong-livedservicewithsufficientreplicationforhighavailabilityandscalability.
Thus,failureofindividualservicenodesdoesnotcauselossofstateoravailability,thoughsomeofthequeriesmayfailandbere-executedtransparently.
ElasticComputeviaVirtualWarehouseabstraction.
CustomersaregivenaccesstocomputationalresourcesinSnowflakethroughtheabstractionofaVirtualWarehouse(VW).
EachVWisessentiallyasetofAWSEC2instancesontopwhichcustomerqueriesexecuteinadistributedfashion.
Customerspayforcompute-timebasedontheVWsize.
EachVWcanbeelasticallyscaledonanon-demandbasisuponcustomerrequest.
Tosupportelasticityatfine-grainedtimescales(e.g.,tensofseconds),Snowflakemaintainsapoolofpre-warmedEC2instances;
uponreceivingarequest,wesimplyadd/removeEC2instancesto/fromthatVW(incaseofaddition,weareabletosupportmostrequestsdirectlyfromourpoolofpre-warmedinstancesthusavoidinginstancestartuptime).
EachVWmayrunmultipleconcurrentqueries.
Infact,manyofourcustomersrunmultipleVWs(e.g.,onefordataingestion,andoneforexecutingOLAPqueries).
ElasticLocalEphemeralStorage.
Intermediatedatahasdifferentperformancerequirementscomparedtopersistentdata(§2.1).
Unfortunately,existingpersistentdatastoresdonotmeettheserequirements
(e.g.,S3doesnotprovidethedesiredlow-latencyandhigh-throughputpropertiesneededforintermediatedatatoensureminimalblockingofcomputenodes);
hence,webuiltadistributedephemeralstoragesystemcustom-designedtomeettherequirementsofintermediatedatainoursystem.
Thesystemisco-locatedwithcomputenodesinVWs,andisexplicitlydesignedtoautomaticallyscaleasnodesareaddedorremoved.
Weprovidemoredetailsin§4and§6,butnoteherethatasnodesareaddedandremoved,ourephemeralstoragesystemdoesnotrequiredatarepartitioningorreshuffling(thusalleviatingoneofthecorelimitationsofshared-nothingarchitectures).
EachVWrunsitsownindependentdistributedephemeralstoragesystemwhichisusedonlybyqueriesrunningonthatparticularVW.
ElasticRemotePersistentStorage.
Snowflakestoresallitspersistentdatainaremote,disaggregated,persistentdatastore.
WestorepersistentdatainS3despitetherelativelymodest(委婉的说不行)latencyandthroughputperformancebecauseofS3’selasticity,highavailabilityanddurabilityproperties.S3的优劣
S3supportsstoringimmutablefiles—filescanonlybeoverwritteninfullanddonotevenallowappendoperations.
However,S3supportsreadrequestsforpartsofafile.
TostoretablesinS3,Snowflakepartitionsthemhorizontallyintolarge,immutablefilesthatareequivalenttoblocksintraditionaldatabasesystems[12].
Withineachfile,thevaluesofeachindividualattributeorcolumnaregroupedtogetherandcompressed,asinPAX[2].文件以rowgroup的PAX的方式组织,结构类似orc,parquet
Eachfilehasaheaderthatstoresoffsetofeachcolumnwithinthefile,enablingustousethepartialreadfunctionalityofS3toonlyreadcolumnsthatareneededforqueryexecution.
AllVWsbelongingtothesamecustomerhaveaccesstothesamesharedtablesviaremotepersistentstore,andhencedonotneedtophysicallycopydatafromoneVWtoanother.
重复前面说一遍,为什么需要EphemeralStorageSystem
Snowflakeusesacustom-designeddistributedstoragesystemformanagementandexchangeofintermediatedata,duetotwolimitationsinexistingpersistentdatastores[5,8].
First,theyfallshortofprovidingthenecessarylatencyandthroughputperformancetoavoidcomputetasksbeingblocksonintermediatedataexchange.
Second,theyprovidemuchstrongeravailabilityanddurabilitysemanticsthanwhatisneededforintermediatedata.
Ourephemeralstoragesystemallowsustoovercomeboththeselimitations.
Tasksexecutingqueryoperations(e.g.,joins)onagivencomputenodewriteintermediatedatalocally;and,tasksconsumingtheintermediatedatareaditeitherlocallyorremotelyoverthenetwork
(dependingonthenodewherethetaskisscheduled,§5).
基本的设计选择就是,中间数据,除了放内存,还可能放SSD,或S3,原因很简单因为放不下
Wemadetwoimportantdesigndecisionsinourephemeralstoragesystem.
First,ratherthandesigningapurein-memorystoragesystem,wedecidedtousebothmemoryandlocalSSDs—taskswriteasmuchintermediatedataaspossibletotheirlocalmemory;
whenmemoryisfull,intermediatedataisspilledtolocalSSDs.
Ourrationale(基本原理)isthatwhilepurelyin-memorysystemscanachievesuperiorperformancewhenentiredatafitsinmemory,theyaretoorestrictivetohandlethevarietyofourtargetworkloads.
Figure3(left)showsthattherearequeriesthatexchangehundredsofgigabytesoreventerabytesofintermediatedata;forsuchqueries,itishardtofitallintermediatedatainmainmemory.
TheseconddesigndecisionwastoallowintermediatedatatospillintoremotepersistentdatastoreincasethelocalSSDcapacityisexhausted.
SpillingintermediatedatatoS3,insteadofothercomputenodes,ispreferableforanumberofreasons—
itdoesnotrequirekeepingtrackofintermediatedatalocation,italleviatestheneedforexplicitlyhandlingout-of-memoryorout-of-diskerrorsforlargequeries,andoverall,allowstokeepourephemeralstoragesystemthinandhighlyperformant.
因为无法估计查询用的资源和中间结果大小,所以很难保证突然产生大量中间结果不会用完local资源,只能放都S3。
如果要解决这问题,需要首先解耦计算层和中间结果层,独立正对query的需求进行分配,并且中间结果层要支持细粒度的扩展性。
FutureDirections.Forperformance-criticalqueries,wewantintermediatedatatoentirelyfitinmemory,oratleastinSSDs,andnotspilltoS3.
Thisrequiresaccurateresourceprovisioning(供应).However,provisioningCPU,memoryandstorageresourceswhileachievinghighutilizationturnsouttobechallengingduetotworeasons.
Thefirstreasonislimitednumberofavailablenodeinstances(eachprovidingafixedamountofCPU,memoryandstorageresources),andsignificantlymorediverseresourcedemandsacrossqueries.
Forinstance,Figure3(center)showsthat,acrossqueries,theratioofcomputerequirementsandintermediatedatasizescanvarybyasmuchassixordersofmagnitude.
Theavailablenodeinstancessimplydonotprovideenoughoptionstoaccuratelymatchnodehardwareresourceswithsuchdiversequerydemands.
Second,evenifwecouldmatchnodehardwareresourceswithquerydemands,accuratelyprovisioningmemoryandstorageresourcesrequiresapriori(先验)knowledgeofintermediatedatasizegeneratedbythequery.
However,ourexperienceisthatpredictingthevolumeofintermediatedatageneratedbyaqueryishard,orevenimpossible,formostqueries.
AsshowninFigure3,intermediatedatasizesnotonlyvaryovermultipleordersofmagnitudeacrossqueries,butalsohavelittleornocorrelationwithamountofpersistentdatareadortheexpectedexecutiontimeofthequery.
Toresolvethefirstchallenge,wecoulddecouplecomputefromephemeralstorage.
Thiswouldallowustomatchavailablenoderesourceswithqueryresourcedemandsbyindependentlyprovisioningindividualresources.
However,thechallengeofunpredictableintermediatedatasizesishardertoresolve.
Forsuchqueries,simultaneouslyachievinghighperformanceandhighresourceutilizationwouldrequirebothdecouplingofcomputeandephemeralstorage,aswellasefficienttechniquesforfine-grainedelasticityofephemeralstoragesystem.
Wediscussthelatterinmoredetailin§6.
中间结果集生命周期很短,在peak时比较大,平均很小,所以可以和cache共用本地磁盘
机会主义的共用方式,中间结果优先
Oneofthekeyobservationswemadeduringearlyphasesofephemeralstoragesystemdesignisthatintermediatedataisshort-lived.
Thus,whilestoringintermediatedatarequireslargememoryandstoragecapacityatpeak,thedemandislowonanaverage.
Thisallowsstatisticalmultiplexingofourephemeralstoragesystemcapacitybetweenintermediatedataandfrequentlyaccessedpersistentdata.
Thisimprovesperformancesince(1)queriesindatawarehousesystemsexhibithighlyskewedaccesspatternsoverpersistentdata[10];and
(2)ephemeralstoragesystemperformanceissignificantlybetterthanthatof(existing)remotepersistentdatastores.
Snowflakeenablesstatisticalmultiplexingofephemeralstoragesystemcapacitybetweenintermediatedataandpersistentdataby“opportunistically”cachingfrequentlyaccessedpersistentdatafiles,
whereopportunisticallyreferstothefactthatintermediatedatastorageisalwaysprioritizedovercachingpersistentdatafiles.
However,apersistentdatafilecannotbecachedonanynode—Snowflakeassignsinputfilesetsforthecustomertonodesusingconsistenthashingoverpersistentdatafilenames.
Afilecanonlybecachedatthenodetowhichitconsistentlyhashesto;eachnodeusesasimpleLRUpolicytodecidecachingandevictionofpersistentdatafiles.
Giventheperformancegapbetweenourephemeralstoragesystemandremotepersistentdatastore,suchopportunisticcachingofpersistentdatafilesimprovestheexecutiontimeformanyqueriesinSnowflake.
Furthermore,sincestorageofintermediatedataisalwaysprioritizedovercachingofpersistentdatafiles,suchanopportunisticperformanceimprovementinqueryexecutiontimecanbeachievedwithoutimpactingperformanceforintermediatedataaccess.
文件Cache通过一致性hash被分配到某一个node上,通过直写cache来保证一致性
并且当加减节点是,使用lazy的一致性hash,来避免resuffle,搬运数据
Maintainingtherightsystemsemanticsduringopportunisticcachingofpersistentdatafilesrequiresacarefuldesign.
First,toensuredataconsistency,the“view”ofpersistentfilesinephemeralstoragesystemmustbeconsistentwiththoseinremotepersistentdatastore.
Weachievethisbyforcingtheephemeralstoragesystemtoactasawrite-throughcacheforpersistentdatafiles.
Second,consistenthashingofpersistentdatafilesonnodesinanavewayrequiresreshufflingofcacheddatawhenVWsareelasticallyscaled.
Weimplementalazyconsistenthashingoptimizationinourephemeralstoragesystemthatavoidssuchdatareshufflingaltogether;wedescribethiswhenwediscussSnowflakeelasticityin§6.
直写cache,所以需要多写一份local数据,每次写Persistent数据时候,需要同步更新cache
Persistentdatabeingopportunisticallycachedintheephemeralstoragesystemmeansthatsomesubsetofpersistentdataaccessrequestscouldbeservedbytheephemeralstoragesystem(dependingonwhetherornotthereisacachehit).
Figure4showsthepersistentdataI/Otrafficdistribution,intermsoffractionofbytes,betweentheephemeralstoragesystemandremotepersistentdatastore.
Thewrite-throughnatureofourephemeralstoragesystemresultsinamountofdatawrittentoephemeralstoragebeingroughlyofthesamemagnitude
astheamountofdatawrittentoremotepersistentdatastore(theyarenotalwaysequalbecauseofprioritizingstorageofintermediatedataovercachingofpersistentdata).
cache的效果还不错,虽然localdisk很小
Eventhoughourephemeralstoragecapacityissignificantlylowerthanthatofacustomer’spersistentdata(around0:1%onanaverage),
skewedfileaccessdistributionsandtemporalfileaccesspatternscommonindatawarehouses[7]enablereasonablyhighcachehitrates(avg.hitrateiscloseto80%forread-onlyqueriesandaround60%forread-writequeries).
Figure5showsthehitratedistributionsacrossqueries.Themedianhitratesareevenhigher.
未来的方向,
如何平衡中间结果和cache对于有限本地磁盘的占用
随着NVM或remote临时存储的诞生,存储的hierarchy会越来越深,需要新的多层cache的新架构
FutureDirections.Figure4andFigure5suggestthatmoreworkisneededoncaching.
Inadditiontolocalityofreferenceinaccesspatterns,cachehitratealsodependsoneffectivecachesizeavailabletothequeryrelativetotheamountofpersistentdataaccessedbythequery.
Theeffectivecachesize,inturn,dependsonboththeVWsizeandthevolumeofintermediatedatageneratedbyconcurrentlyexecutingqueries.
Ourpreliminary(初步的)analysishasnotledtoanyconclusiveobservationsontheimpactoftheabovetwofactorsontheobservedcachehitrates,andamorefine-grainedanalysisisneededtounderstandfactorsthatimpactcachehitrates.
Wehighlighttwoadditionaltechnicalproblems.
First,sinceend-to-endqueryperformancedependsonboth,cachehitrateforpersistentdatafilesandI/Othroughputforintermediatedata,itisimportanttooptimizehowtheephemeralstoragesystemsplitscapacitybetweenthetwo.
Althoughwecurrentlyusethesimplepolicyofalwaysprioritizingintermediatedata,itmaynotbetheoptimalpolicywithrespecttoend-to-endperformanceobjectives(e.g.,averagequerycompletiontimeacrossallqueriesfromthesamecustomer).
Forexample,itmaybebettertoprioritizecachingapersistentdatafilethatisgoingtobeaccessedbymanyqueriesoverintermediatedatathatisaccessedbyonlyone.
Itwouldbeinterestingtoexploreextensionstoknowncachingmechanismsthatoptimizeforend-to-endqueryperformanceobjectives[7]totakeintermediatedataintoaccount.
Second,existingcachingmechanismsweredesignedfortwo-tierstoragesystems(memoryasthemaintierandHDD/SSDasthesecondtier).
InSnowflake,wealreadyhavethreetiersofhierarchywithcompute-localmemory,ephemeralstoragesystemandremotepersistentdatastore;
asemergingnon-volatilememorydevicesaredeployedinthecloudandasrecentdesignsonremoteephemeralstoragesystemsmature[22],thestoragehierarchyinthecloudwillgetincreasinglydeeper.
Snowflakeusestraditionaltwo-tiermechanisms—eachnodeimplementsalocalLRUpolicyforevictionsfromlocalmemorytolocalSSD,andanindependentLRUpolicyforevictionsfromlocalSSDtoremotepersistentdatastore.
However,toefficientlyexploitthedeepeningstoragehierarchy,weneednewcachingmechanismsthatcanefficientlycoordinatecachingacrossmultipletiers.
WebelievemanyoftheabovetechnicalchallengesarenotspecifictoSnowflake,andwouldapplymorebroadlytoanydistributedapplicationbuiltontopofdisaggregatedstorage.
WenowdescribethequeryexecutionprocessinSnowflake.
CustomerssubmittheirqueriestotheCloudServices(CS)forexecutiononaspecificVW.
CSperformsqueryparsing,queryplanningandoptimization,andcreatesasetoftaskstobescheduledoncomputenodesoftheVW.
Locality-awaretaskscheduling.
Tofullyexploittheephemeralstoragesystem,Snowflakecolocateseachtaskwithpersistentdatafilesthatitoperatesonusingalocality-awareschedulingmechanism(recall,thesefilesmaybecachedinephemeralstoragesystem).
Specifically,recallthatSnowflakeassignspersistentdatafilestocomputenodesusingconsistenthashingovertablefilenames.
Thus,forafixedVWsize,eachpersistentdatafileiscachedonaspecificnode.
Snowflakeschedulesthetaskthatoperatesonapersistentdatafiletothenodeonwhichitsfileconsistentlyhashesto.
Asaresultofthisschedulingscheme,queryparallelismistightlycoupledwithconsistenthashingoffilesonnodes—aqueryisscheduledforcachelocalityandmaybedistributedacrossallthenodesintheVW.
Forinstance,consideracustomerthathas1millionfilesworthofpersistentdata,andisrunningaVWwith10nodes.
Considertwoqueries,wherethefirstqueryoperateson100files,andthesecondqueryoperateson100000files;then,withhighlikelihood,bothquerieswillrunonallthe10nodesbecauseoffilesbeingconsistentlyhashedontoallthe10nodes.
Workstealing.Itisknownthatconsistenthashingcanleadtoimbalancedpartitions[19].很常见的做法,闲的node会stealtask执行
Inordertoavoidoverloadingofnodesandimproveloadbalance,Snowflakeusesworkstealing,asimpleoptimizationthatallowsanodetostealataskfromanothernode
iftheexpectedcompletiontimeofthetask(sumofexecutiontimeandwaitingtime)isloweratthenewnode.
Whensuchworkstealingoccurs,thepersistentdatafilesneededtoexecutethetaskarereadfromremotepersistentdatastoreratherthanthenodeatwhichthetaskwasoriginallyscheduledon.
Thisavoidsincreasingloadonanalreadyoverloadednodewherethetaskwasoriginallyscheduled(notethatworkstealinghappensonlywhenanodeisoverloaded).
调度两个极端,task和数据完全colocate,避免读persistent的数据,但中间数据会需要传输;所有task都放一起,这样避免执行中间结果传输
FutureDirections.Schedulerscanplacetasksontonodesusingtwoextremeoptions:
oneistocolocatetaskswiththeircachedpersistentdata,asinourcurrentimplementation.
Asdiscussedintheexampleabove,thismayendupschedulingallqueriesonallnodesintheVW;
whilesuchaschedulingpolicyminimizesnetworktrafficforreadingpersistentdata,itmayleadtoincreasednetworktrafficforintermediatedataexchange.
Theotherextremeistoplacealltasksonasinglenode.Thiswouldobviate(消除)theneedofnetworktransfersforintermediatedataexchangebutwouldincreasenetworktrafficforpersistentdatareads.
Neitheroftheseextremesmaybetherightchoiceforallqueries.
Itwouldbeinterestingtocodesignqueryschedulersthatwouldpickjusttherightsetofnodestoobtainasweetspot(甜区)betweenthetwoextremes,andthenscheduleindividualtasksontothesenodes.
Inthissection,wediscusshowBlowFishdesignachievesoneofitscoregoals:resourceelasticity,thatis,scalingofcomputeandstorageresourcesonanon-demandbasis.
DisaggregatingcomputefrompersistentstorageenablesSnowflaketoindependentlyscalecomputeandpersistentstorageresources.
Storageelasticityisoffloadedtopersistentdatastores[5];computeelasticity,ontheotherhand,isachievedusingapre-warmedpoolofnodesthatcanbeadded/removedto/fromcustomerVWsonanon-demandbasis.
Bykeepingapre-warmedpoolofnodes,Snowflakeisabletoprovidecomputeelasticityatthegranularityoftensofseconds.
OneofthechallengesthatSnowflakehadtoresolveinordertoachieveelasticityefficientlyisrelatedtodatamanagementinephemeralstoragesystem.
Recallthatourephemeralstoragesystemopportunisticallycachespersistentdatafiles;eachfilecanbecachedonlyonthenodetowhichitconsistentlyhashestowithintheVW.
Theproblemissimilartosharednothingarchitectures:anyfixedpartitioningmechanism(inourcase,consistenthashing)requireslargeamountsofdatatobereshuffleduponscalingofnodes;
moreover,sincetheverysamesetofnodesarealsoresponsibleforqueryprocessing,thesystemobservesasignificantperformanceimpactduringthescalingprocess.
Snowflakeresolvesthischallengeusingalazyconsistenthashingmechanism,thatcompletelyavoidsanyreshufflingofdatauponelasticscalingofnodesbyexploitingthefactthatacopyofcacheddataisstoredatremotepersistentdatastore.
Specifically,Snowflakereliesonthecachingmechanismtoeventually“converge”totherightstate.
所谓Lazy就是,当增加node时,不会reshufflecache,当下次Task6被assign到新节点是,会从remote读取file6,此时把file6cache下来
对于当前的方案,每个VW是使用一组独占的nodes,和ephemeral存储,这样的好处是隔离性比较好;
但是问题是资源利用率会很低,因为客户的业务高峰总是短暂的,并且是错开的,所以要资源利用率好就需要资源共享,所以这里就是隔离和利用率的tradeoff
Snowflakecurrentlysupportsmulti-tenancythroughtheVWabstraction.
EachVWoperatesonanisolatedsetofnodes,withitsownephemeralstoragesystem.
ThisallowsSnowflaketoprovideperformanceisolationtoitscustomers.
Inthissection,wepresentafewsystem-widecharacteristicsforourVWsandusethesetomotivateanalternatesharingbasedarchitectureforSnowflake.
TheVWarchitectureinSnowflakeleadstothetraditionalperformanceisolationversusutilizationtradeoff.
Figure10(topfour)showthatourVWsachievefairlygood,butnotideal,averageCPUutilization;however,otherresourcesareusuallyunderutilizedonanaverage.
Figure11providessomereasonsforthelowaverageresourceutilizationinFigure10(topfour):
thefigureshowsthevariabilityofresourceusageacrossVW;specifically,weobservethatforupto30%ofVW,standarddeviationofCPUusageovertimeisaslargeasthemeanitself.
ThisresultsinunderutilizationascustomerstendtoprovisionVWstomeetpeakdemand.
Intermsofpeakutilization,severalofourVWsexperienceperiodsofheavyutilization,butsuchhigh-utilizationperiodsarenotnecessarilysynchronizedacrossVWs.
AnexampleofthisisshowninFigure10(bottomtwo),whereweseethatoveraperiodoftwohours,thereareseveralpointswhenoneVW’sutilizationishighwhiletheotherVW’sutilizationissimultaneouslylow.
WhilewewereawareofthisperformanceisolationversusutilizationtradeoffwhenwedesignedSnowflake,recenttrendsarepushingustorevisitthisdesignchoice.
Specifically,maintainingapoolofpre-warmedinstanceswascost-efficientwheninfrastructureprovidersusedtochargeatanhourlygranularity;
however,recentmovetoper-secondpricing[6]byallmajorcloudinfrastructureprovidershasraisedinterestingchallenges.
Fromour(provider’s)perspective,wewouldliketoexploitthisfiner-grainedpricingmodeltocutdownoperationalcosts.
Howeverdoingsoisnotstraightforward,asthistrendhasalsoledtoanincreaseincustomer-demandforfiner-grainedpricing.
Asaresult,maintainingapre-warmedpoolofnodesforelasticityisnolongercost-effective:
previouslyinthehourlybillingmodel,aslongasatleastonecustomerVWusedaparticularnodeduringaonehourduration,wecouldchargethatcustomerfortheentireduration.
However,withper-secondbilling,wecannotchargeunusedcyclesonpre-warmednodestoanyparticularcustomer.
Thiscost-inefficiencymakesastrongcaseformovingtoasharingbasedmodel,wherecomputeandephemeralstorageresourcesaresharedacrosscustomers:
insuchamodelwecanprovideelasticitybystatisticallymultiplexingcustomerdemandsacrossasharedsetofresources,avoidingtheneedtomaintainalargepoolofpre-warmednodes.
Inthenextsubsection,wehighlightseveraltechnicalchallengesthatneedtoberesolvedtorealizesuchasharedarchitecture.
ThevariabilityinresourceusageovertimeacrossVW,asshowninFigure11,indicatesthatseveralofourcustomerworkloadsarebursty(突发的)innature.
Hence,movingtoasharedarchitecturewouldenableSnowflaketoachievebetterresourceutilizationviafine-grainedstatisticalmultiplexing.
SnowflaketodayexposesVWsizestocustomersinabstract“T-shirt”sizes(small,large,XLetc.),eachrepresentingdifferentresourcecapacities.
CustomersarenotawareofhowtheseVWsareimplemented(no.ofnodesused,instancetypes,etc.).
IdeallywewouldliketomaintainthesameabstractVWinterfacetocustomersandchangetheunderlyingimplementationtousesharedresourcesinsteadofisolatednodes.
Thechallenge,however,istoachieveisolationpropertiesclosetoourcurrentarchitecture.挑战是在共享的情况下仍然可以达到资源隔离属性
Thekeymetricofinterestfromcustomers’pointofviewisqueryperformance,thatis,end-to-endquerycompletiontimes.
Whileapurelysharedarchitectureislikelytoprovidegoodaverage-caseperformance,maintaininggoodperformanceattailischallenging.长尾性能很难保证
ThetwokeyresourcesthatneedtobeisolatedinVWsarecomputeandephemeralstorage.
Therehasbeenalotofwork[18,35,36]oncomputeisolationinthedatacentercontext,thatSnowflakecouldleverage.
Moreover,thecentralizedtaskscheduleranduniformexecutionruntimeinSnowflakemaketheproblemeasierthanthatofisolatingcomputeingeneralpurposeclusters.
Here,weinsteadfocusontheproblemofisolatingmemoryandstorage,whichhasonlyrecentlystartedtoreceiveattentionintheresearchcommunity[25].由于计算资源的隔离已得到充分的讨论,重点放在内存和存储的隔离问题
Thegoalhereistodesignasharedephemeralstoragesystem(usingbothmemoryandSSDs)thatsupportsfine-grainedelasticitywithoutsacrificingisolationpropertiesacrosstenants.
Withrespecttosharingandisolationofephemeralstorage,weoutlinetwokeychallenges.
First,sinceourephemeralstoragesystemmultiplexesbothcachedpersistentdataandintermediatedata,bothoftheseentitiesneedtobejointlysharedwhileensuringcross-tenantisolation.
WhileSnowflakecouldleveragetechniquesfromexistingliterature[11,26]forsharingcache,weneedamechanismthatisadditionallyawareoftheco-existenceofintermediatedata.
Unfortunately,predictingtheeffectivelifetimeofcacheentriesisdifficult.
Evictingidlecacheentriesfromtenantsandprovidingthemtoothertenantswhileensuringhardisolationisnotpossible,aswecannotpredictwhenatenantwillnextaccessthecacheentry.
Somepastworks[11,33]haveusedtechniqueslikeidlememorytaxationtodealwiththisissue.
Webelievethereismoreworktobedone,bothindefiningmorereasonableisolationguaranteesanddesigninglifetime-awarecachesharingmechanismsthatcanprovidesuchguarantees.
Thesecondchallengeisthatofachievingelasticitywithoutcross-tenantinterference:
scalingupthesharedephemeralstoragesystemcapacityinordertomeetthedemandsofaparticularcustomershouldnotimpactothertenantssharingthesystem.
Forexample,ifweweretonavelyuseSnowflake’scurrentephemeralstoragesystem,isolationpropertieswillbetriviallyviolated.
SinceallcacheentriesinSnowflakeareconsistentlyhashedontothesameglobaladdressspace,scalinguptheephemeralstoragesystemcapacitywouldenduptriggeringthelazyconsistenthashingmechanismforalltenants.
Thismayresultinmultipletenantsseeingincreasedcachemisses,resultingindegradedperformance.
Resolvingthischallengewouldrequiretheephemeralstoragesystemtoprovideprivateaddressspacestoeachindividualtenant,anduponscalingofresources,toreorganizedataonlyforthosetenantsthathavebeenallocatedadditionalresources.
AveragememoryutilizationinourVWsislow(Figure10);thisisparticularlyconcerningsinceDRAMisexpensive.
AlthoughsharingresourcesharingwouldimproveCPUandmemoryutilization,itisunlikelytoleadtooptimalutilizationacrossbothdimensions.
Further,variabilitycharacteristicsofCPUandmemoryaresignificantlydifferent(Figure11),indicatingtheneedforindependentscalingoftheseresources.
Memorydisaggregation[1,14,15]providesafundamentalsolutiontothisproblem.
However,asdiscussedin§4.2,accuratelyprovisioningresourcesishard;
sinceover-provisioningmemoryisexpensive,weneedefficientmechanismstosharedisaggregatedmemoryacrossmultipletenantswhileprovidingisolationguarantees.
InthissectionwediscussrelatedworkandothersystemssimilartoSnowflake.
Ourpreviouswork[12]discussesSQLrelatedaspectsofSnowflakeandpresentsrelatedliteratureonthoseaspects.
Thispaperfocusesonthedisaggregation,ephemeralstorage,caching,taskscheduling,elasticityandmulti-tenancyaspectsofSnowflake;
intherelatedworkdiscussionbelow,weprimarilyfocusontheseaspects.
SQL-as-a-Servicesystems.
ThereareseveralothersystemsthatofferSQLfunctionalityasaserviceinthecloud.
TheseincludeAmazonRedshift[16],Aurora[4],Athena[3],GoogleBigQuery[30]andMicrosoftAzureSynapseAnalytics[24].
Whiletherearepapersthatdescribethedesignandoperationalexperienceofsomeofthesesystems,
wearenotawareofanypriorworkthatundertakesadata-drivenanalysisofworkloadandsystemcharacteristicssimilartoours.
Redshift[16]storesprimaryreplicasofpersistentdatawithincomputeVMclusters(S3isonlyusedforbackup);Redshift,shared-nothing,计算存储未分离
thus,itmaynotbeabletoachievethebenefitsthatSnowflakeachievesbydecouplingcomputefrompersistentstorage.
Aurora[4]andBigQuery[30](basedonthearchitectureofDremel[23])decouplecomputeandpersistentstoragesimilartoSnowflake.Aurora分离了,但是依赖特殊涉及到存储服务
Aurora,however,reliesonacustomdesignedpersistentstorageservicethatiscapableofoffloadingdatabaselogprocessing,insteadofatraditionalblobstore.
Decouplingcomputeandephemeralstoragesystems.
Previouswork[20]makesthecaseforflashstoragedisaggregationbystudyingakey-valuestoreworkloadfromFacebook.
Ourobservationscorroborate(证实)thisargumentandfurtherextenditinthecontextofdatawarehousingworkloads.
Pocket[22]andLocus[27]areephemeralstoragesystemsdesignedforserverlessanalyticsapplications.
IfweweretodisaggregatecomputeandephemeralstorageinSnowflake,suchsystemswouldbegoodcandidates.
However,thesesystemsdonotprovidefine-grainedresourceelasticityduringthelifetimeofaquery.
Thus,theyeitherhavetoassumeaprioriknowledgeofintermediatedatasizes(forprovisioningresourcesatthetimeofsubmittingqueries),
orsufferfromperformancedegradationifsuchknowledgeisnotavailableinadvance.
Asdiscussedin§4.1,predictingintermediatedatasizesisextremelyhard.
Itwouldbenicetoextendthesesystemstoprovidefine-grainedelasticityandcross-queryisolation.
Technologiesforhighperformanceaccesstoremoteflashstorage[13,17,21]wouldalsobeintegraltoefficientlyrealizedecouplingofcomputeandephemeralstoragesystem.
Multi-tenantresourcesharing.
ESXserver[33]pioneeredtechniquesformulti-tenantmemorysharinginthevirtualmachinecontext,includingballooningandidle-memorytaxation.
Memshare[11]considersmulti-tenantsharingofcachecapacityinDRAMcachesinthesinglemachinecontext,sharingun-reservedcapacityamongapplicationsinawaythatmaximizeshitrate.
FairRide[26]similarlyconsidersmulti-tenantcachesharinginthedistributedsettingwhiletakingintoaccountsharingofdatabetweentenants.
MechanismsforsharingandisolationofcacheresourcessimilartotheonesusedintheseworkswouldbeimportantinenablingSnowflaketoadoptaresourcesharedarchitecture.
Asdiscussedpreviously,itwouldbeinterestingtoextendthesemechanismstomakethemawareofthedifferentcharacteristicsandrequirementsofintermediateandpersistentdata.