Nat.Commun：单细胞RNAseq数据中扰动细胞类型的稳健识别|biorenderfreefullversion_宠物造型

AbstractSingle-celltranscriptomicshasemergedasapowerfultoolforunderstandinghowdifferentcellscontributetodiseaseprogressionbyidentifyingcelltypesthatchangeacrossdiseasesorconditions.However,detectingchangingcelltypesischallengingduetoindividual-to-individualandcohort-to-cohortvariabilityandnaiveapproachesbasedoncurrentcomputationaltoolsleadtofalsepositivefindings.

摘要单细胞转录组学已经成为一种强大的工具，可以通过识别跨疾病或条件变化的细胞类型来了解不同细胞如何促进疾病进展。然而，由于个体对个体和队列对队列的变异性，检测变化的细胞类型具有挑战性，并且基于当前计算工具的幼稚方法会导致假阳性结果。

Toaddressthis,weproposeacomputationaltool,scDist,basedonamixed-effectsmodelthatprovidesastatisticallyrigorousandcomputationallyefficientapproachfordetectingtranscriptomicdifferences.Byaccuratelyrecapitulatingknownimmunecellrelationshipsandmitigatingfalsepositivesinducedbyindividualandcohortvariation,wedemonstratethatscDistoutperformscurrentmethodsinbothsimulatedandrealdatasets,evenwithlimitedsamplesizes.

为了解决这个问题，我们提出了一种基于混合效应模型的计算工具scDist，该模型为检测转录组差异提供了一种统计上严格且计算效率高的方法。通过准确概括已知的免疫细胞关系并减轻由个体和队列变异引起的假阳性，我们证明scDist在模拟和真实数据集中都优于当前方法，即使样本量有限。

ThroughtheanalysisofCOVID-19andimmunotherapydatasets,scDistuncoverstranscriptomicperturbationsindendriticcells,plasmacytoiddendriticcells,andFCER1G+NKcells,thatprovidenewinsightsintodiseasemechanismsandtreatmentresponses.Assingle-celldatasetscontinuetoexpand,ourfasterandstatisticallyrigorousmethodoffersarobustandversatiletoolforawiderangeofresearchandclinicalapplications,enablingtheinvestigationofcellularperturbationswithimplicationsforhumanhealthanddisease..

通过对COVID-19和免疫治疗数据集的分析，scDist揭示了树突状细胞，浆细胞样树突状细胞和FCER1G+NK细胞中的转录组扰动，为疾病机制和治疗反应提供了新的见解。随着单细胞数据集的不断扩展，我们更快且统计上严格的方法为广泛的研究和临床应用提供了一种强大且通用的工具，从而能够研究对人类健康和疾病有影响的细胞扰动。。

IntroductionTheadventofsingle-celltechnologieshasenabledmeasuringtranscriptomicprofilesatsingle-cellresolution,pavingthewayfortheidentificationofsubsetsofcellswithtranscriptomicprofilesthatdifferacrossconditions.Thesecutting-edgetechnologiesempowerresearchersandclinicianstostudyhumancelltypesimpactedbydrugtreatments,infectionslikeSARS-CoV-2,ordiseaseslikecancer.

引言单细胞技术的出现使得能够以单细胞分辨率测量转录组谱，为鉴定具有不同条件的转录组谱的细胞亚群铺平了道路。这些尖端技术使研究人员和临床医生能够研究受药物治疗，SARS-CoV-2等感染或癌症等疾病影响的人类细胞类型。

Toconductsuchstudies,scientistsmustcomparesingle-cellRNA-seq(scRNA-seq)databetweentwoormoregroupsorconditions,suchasinfectedversusnon-infected1,respondersversusnon-responderstotreatment2,ortreatmentversuscontrolincontrolledexperiments.Tworelatedbutdistinctclassesofapproachesexistforcomparingconditionsinsingle-celldata:differentialabundancepredictionanddifferentialstateanalysis3.

Differentialabundanceapproaches,suchasDA-seq,Milo,andMeld4,5,6,7,focusonidentifyingcelltypeswithvaryingproportionsbetweenconditions.Incontrast,differentialstateanalysisseekstodetectpredefinedcelltypeswithdistincttranscriptomicprofilesbetweenconditions.Inthisstudy,wefocusontheproblemofdifferentialstateanalysis.PastdifferentialstatestudieshavereliedonmanualapproachesinvolvingvisuallyinspectingdatasummariestodetectdifferencesinscRNAdata.

差异丰度方法，例如DA-seq，Milo和Meld4,5,6,7，侧重于鉴定条件之间比例不同的细胞类型。相反，差异状态分析旨在检测预定义的细胞类型，这些细胞类型在条件之间具有不同的转录组学特征。在这项研究中，我们专注于微分状态分析问题。过去的差异状态研究依赖于手动方法，包括目视检查数据摘要以检测scRNA数据的差异。

Specifically,cellswereclusteredbasedongeneexpressiondataandvisualizedusinguniformmanifoldapproximation(UMAP)8.Celltypesthatappearedseparatedbetweenthetwoconditionswereidentifiedasdifferent1.Anothercommonapproachistousethenumberofdifferentiallyexpressedgenes(DEGs)asametricfortranscriptomicperturbation.

具体而言，基于基因表达数据对细胞进行聚类，并使用统一流形近似（UMAP）8进行可视化。在两种条件之间分离的细胞类型被鉴定为不同的1。另一种常见的方法是使用差异表达基因（DEG）的数量作为转录组扰动的度量。

However,asnotedbyref.9,thenumberofDEGsdependsonthechosensignificanceleveland.

然而，如参考文献9所述，DEG的数量取决于所选的显着性水平和。

(1)

whereαisavectorwithentriesαgrepresentingthebaselineexpressionforgeneg,xjisabinaryindicatorthatis0ifindividualjisinthereferencecondition,and1ifinthealternativecondition,βisavectorwithentriesβgrepresentingthedifferencebetweenconditionmeansforgeneg,ωjisarandomeffectthatrepresentsthedifferencesbetweenindividuals,andεijisarandomvector(oflengthG)thataccountsforothersourcesofvariability.

Weassumethat${{{\boldsymbol{\omega}}}}_{j}\mathop{\sim}\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\tau}^{2}I)$,${{{\boldsymbol{\varepsilon}}}}_{ij}\mathop{\sim}\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\sigma}^{2}I)$,andthattheωjandεijareindependentofeachother.Toobtainnormalizedcounts,werecommenddefiningzijtobethevectorofPearsonresidualsobtainedfromfittingaPoissonornegativebinomialGLM12,thenormalizationprocedureisimplementedinthescTransformfunction13.

我们假设\（{{{{\boldsymbol{\omega}}}}}}}{{j}\mathop{\sim}\limits^{{{\rm{i}}.{\rm{i}}.{\rm{d}}}{\mathcal{N}}（0，{\tau}^{2}I)\)^{2}I)\)，并且ωj和εij彼此独立。为了获得归一化计数，我们建议将zij定义为通过拟合泊松或负二项式GLM12获得的Pearson残差向量，归一化过程在scTransform函数13中实现。

However,ourproposedapproachcanbeusedwithothernormalizationmethodsforwhichthemodelisappropriate.Notethatinmodel(18),themeansforthetwoconditionsareαandα+β,respectively.Therefore,wequantifythedifferenceinexpressionprofilebytakingthe2normofthevectorβ:$$D:=\left|\left|{\boldsymbol{\beta}}\right|\right|_{2}={\left({\boldsymbol{\beta}}^{\top}{\boldsymbol{\beta}}\right)}^{1/2}=\sqrt{{\sum}_{g=1}^{G}{\beta}_{g}^{2}}.$$.

但是，我们提出的方法可以与模型适用的其他归一化方法一起使用。请注意，在模型（18）中，这两个条件的平均值分别为α和α+。因此，我们通过采用向量β的2-范数来量化表达谱的差异：$$D：=\左|\左|{\boldsymbol{\beta}}\右|\右||{2}={\左（{\boldsymbol{\beta}}^{\top}{\boldsymbol{\beta}}\right）}^{1/2}=\sqrt{\sum}{ug=1}^{g}{\beta}{g}^{2}}。$$。

(2)

Here,DcanbeinterpretedastheEuclideandistancebetweenconditionmeans(Fig.2A).Fig.2:VisualrepresentationofthescDistmethod.AscDistestimatesthedistancebetweenconditionmeansinhigh-dimensionalgeneexpressionspaceforeachcelltype.BToimproveefficiency,scDistcalculatesthedistanceinalow-dimensionalembeddingspace(derivedfromPCA)andemploysalinearmixed-effectsmodeltoaccountforsample-levelandothertechnicalvariability.

这里，D可以解释为条件均值之间的欧几里得距离（图2A）。图2:scDist方法的视觉表示。scDist估计每种细胞类型的高维基因表达空间中条件均值之间的距离。B为了提高效率，scDist计算低维嵌入空间（源自PCA）中的距离，并采用线性混合效应模型来考虑样本水平和其他技术变异性。

ThisfigureiscreatedwithBiorender.com,wasreleasedunderaCreativeCommonsAttribution-NonCommercial-NoDerivs4.0Internationallicense.FullsizeimageBecauseweexpectedthevectorofconditiondifferencesβtobesparse,weimprovedcomputationalefficiencybyapproximatingDwithasingularvaluedecompositiontofindaK×GmatrixU,withKmuchsmallerthanG,and$$D\,\approx\,{D}_{K}:=\sqrt{{\sum}_{k=1}^{K}{(U{\boldsymbol{\beta}})}_{k}^{2}}.$$Withthisapproximationinplace,wefittedmodelequation(18)byreplacingzijwithUzijtoobtainestimatesof(Uβ)k.

这个数字是由Biorender.com创建的，是根据知识共享署名非商业性NoDerivs4.0国际许可发布的。全尺寸图像由于我们预期条件差β的向量是稀疏的，因此我们通过用奇异值分解逼近D来找到K×G矩阵U，K比G小得多，并且$$D\，\近似\，从而提高了计算效率，{D}_{K}：=\sqrt{{\sum}}uk=1}^{K}{（U{\boldsymbol{\beta}}}uk}^{2}}$$有了这种近似，我们通过用Uzij代替zij来拟合模型方程（18），以获得（Uβ）k的估计。

AchallengewithestimatingDKisthatthemaximumlikelihoodestimatorcanhaveasignificantupwardbiaswhenthenumberofpatientsissmall(asistypicallythecase).Forthisreason,weemployedapost-hocBayesianproceduretoshrink${{(U{\boldsymbol{\beta}})}_{k}^{2}}$towardszeroandcomputeaposteriordistributionofDK14.

估计DK的一个挑战是，当患者人数较少时（通常情况下），最大似然估计量可能会有明显的向上偏差。因此，我们采用事后贝叶斯过程将\（{（U{\boldsymbol{\beta}}}}uk}^{2}}）缩小到零，并计算DK14的后验分布。

WealsoprovidedastatisticaltestforthenullhypothesisthatDK=0.WerefertotheresultingprocedureasscDist(Fig.2B).TechnicaldetailsareprovidedinMethods.WeappliedscDisttothenegativecontroldatasetbasedonbloodscRNA-seqfromsixhealthyusedtoshowthelargenumberoffalsepositivesreportedbyAugur(Fig.1)andfoundthatthefalsepositiveratewascont.

我们还为DK=0的零假设提供了统计检验。我们将所得过程称为scDist（图2B）。方法中提供了技术细节。我们将scDist应用于基于来自六名健康人的血液scRNA-seq的阴性对照数据集，用于显示Augur报告的大量假阳性（图1），并发现假阳性率为cont。

scDistdetectscelltypesthataredifferentinCOVID-19patientcomparedtocontrolsWeappliedscDisttoalargeCOVID-19dataset17consistingof1.4millioncellsof64typesfrom284PBMCsamplesfrom196individualsconsistingof171COVID-19patientsand25healthydonors.Thelargenumberofsamplesofthisdatasetpermittedfurtherevaluationofourapproachusingrealdataratherthansimulations.

scDist检测COVID-19患者与对照组相比不同的细胞类型我们将scDist应用于大型COVID-19数据集17，该数据集由来自196名个体（包括171名COVID-19患者和25名健康供体）的284个PBMC样品中的140万个64种细胞组成。。

Specifically,wedefinedtruedistancesbetweenthetwogroupsbycomputingthesumofsquaredlogfoldchanges(acrossallgenes)ontheentiredatasetandthenestimatedthedistanceonrandomsamplesoffivecasesversusfivecontrols.BecauseAugurdoesnotestimatedistancesexplicitly,weassessedthetwomethods’abilitytoaccuratelyrecapitulatetherankingofcelltypesbasedonestablishedgroundtruthdistances.

具体而言，我们通过计算整个数据集（所有基因）的平方对数倍数变化之和来定义两组之间的真实距离，然后估计五个病例与五个对照的随机样本的距离。由于Augur没有明确估计距离，因此我们评估了这两种方法根据已建立的地面真实距离准确概括细胞类型排名的能力。

WefoundthatscDistrecoverstherankingsbetterthanAugur(Fig.5A,S10).Whenthesizeofthesubsampleisincreasedto15patientspercondition,theaccuracyofscDisttorecoverthegroundtruthrankanddistanceimprovesfurther(Fig.S25).Fig.5:ComparisonofscDistandAugurperformancebasedonrealdatasimulation.ACorrelationbetweenestimatedranks(basedonsubsamplesof5casesand5controls)andtrueranksforeachmethod,withpointsabovethediagonallineindicatebetteragreementofscDistwiththetrueranking.

BPlotoftruedistancevs.distancesestimatedwithscDist(dashedlinerepresentsy=x).Poinsanderrorbarsrepresentmean,and5/95thpercentile.CAUCvaluesachievedbyAugur,wherecolorrepresentslikelytrue(blue)orfalse(orange)positivecelltypes.DSameasC,butfordistancesestimatedwithscDist.

B用scDist估计的真实距离与距离的关系图（虚线表示y=x）。点和误差线代表平均值和第5/95百分位数。CAugur获得的AUC值，其中颜色代表可能的真（蓝色）或假（橙色）阳性细胞类型。D与C相同，但对于用scDist估计的距离。

EAUCvaluesachievedbyAuguragainstthecellnumbervariationinsubsampled-datasets(offalseposi.

Augur获得的AUC值与子采样数据集（假posi）中的细胞数变化相对应。

scDistenablestheidentificationofgenesunderlyingcell-specificacross-conditiondifferencesToidentifytranscriptomicalteration,scDistassignsanimportancescoretoeachgenebasedonitscontributiontotheoverallperturbation(Methods).WeassessedthisimportancescoreforCD14+monocytesinsmallCOVID-19datasets.

scDist能够识别跨条件差异的细胞特异性基因为了识别转录组学改变，scDist根据每个基因对整体扰动的贡献（方法）为每个基因分配重要性评分。我们在小型COVID-19数据集中评估了CD14+单核细胞的重要性评分。

Inthiscelltype,scDistassignedthehighestimportancescoretogenesS100calcium-bindingproteinA8(S100A8)andS100calcium-bindingproteinA9(S100A9)(p<103,Fig.S13b).Thesegenesarecanonicalmarkersofinflammation21thatareupregulatedduringcytokinestorm.SincepatientswithsevereCOVID-19infectionsoftenexperiencecytokinestorms,theresultsuggeststhatS100A8/A9upregulationinCD14+monocytecouldbeamarkerofthecytokinestorm22.

在这种细胞类型中，scDist对基因S100钙结合蛋白A8（S100A8）和S100钙结合蛋白A9（S100A9）的重要性最高（p<10-3，图S13b）。这些基因是炎症的典型标志物21，在细胞因子风暴期间上调。由于患有严重COVID-19感染的患者经常经历细胞因子风暴，因此结果表明CD14+单核细胞中S100A8/A9的上调可能是细胞因子风暴的标志物22。

ThesetwogeneswerereportedtobeupregulatedinCOVID-19patientsinthestudyof284samples17..

据报道，在284个样本的研究中，这两个基因在COVID-19患者中被上调17。。

scDistidentifiestranscriptomicalterationsassociatedwithimmunotherapyresponseTodemonstratethereal-worldimpactofscDist,weappliedittofourpublisheddatasetusedtounderstandpatientresponsestocancerimmunotherapyinheadandneck,bladder,andskincancerpatients,respectively2,23,24,25.

Wefoundthateachindividualdatasetwasunderpoweredtodetectdifferencesbetweenrespondersandnon-responders(Fig.S15).Topotentiallyincreasepower,wecombinedthedatafromallcohorts(Fig.6A).However,wefoundthatanalyzingthecombineddatawithoutaccountingforcohort-specificvariationsledtofalsepositives.

我们发现每个单独的数据集都不足以检测响应者和非响应者之间的差异（图S15）。。但是，我们发现，在不考虑特定队列变异的情况下分析组合数据会导致误报。

Forexample,responder-non-responderdifferencesestimatedbyAugurwerehighlycorrelatedbetweenpre-andpost-treatments(Fig.6B),suggestingaconfoundingeffectofcohort-specificvariations.Furthermore,Augurpredictedthatmostcelltypeswerealteredinbothpre-treatmentandpost-treatmentsamples(AUC>0.5for41inpre-treatmentand44inpost-treatmentoutofatotalof49celltypes),whichispotentiallyduetotheconfoundingeffectofcohort-specificvariations.Fig.

。此外，Augur预测，大多数细胞类型在治疗前和治疗后样本中都发生了改变（在49种细胞类型中，治疗前41种的AUC>0.5，治疗后44种），这可能是由于队列特异性变异的混杂效应。图。

6:ImmunotherapycohortsanalysisusingscDist.AStudydesign:discoverycohortsoffourscRNAcohorts(citedinorderasshown2,23,24,25)identifycell-type-specificdifferencesandadifferentialgenesignaturebetweenrespondersandnon-responders.ThissignaturewasevaluatedinvalidationcohortsofsixbulkRNA-seqcohorts(citedinorderasshown32,33,34,35,36,37,38).

6：使用scDist进行免疫治疗队列分析。研究设计：四个scRNA队列的发现队列（按所示顺序引用2,23,24,25）确定了应答者和无应答者之间的细胞类型特异性差异和差异基因特征。在六个批量RNA-seq队列的验证队列中评估了该签名（按顺序引用，如图32,33,34,35,36,37,38所示）。

BPre-treatmentandpost-treatmentsampledifferenceswereestimatedusingAugurandscDist(Spearmancorrelationisreportedontheplot).Theerrorbarsrepresent95%confidenceintervalforthefittedlinearregressionline.CSignificanceoftheestimateddifferences(scDist).DKaplan–.

scDistiscomputationallyefficientAkeystrengthofthelinearmodelingframeworkusedbyscDististhatitisefficientonlargedatasets.Forinstance,ontheCOVID-19datasetwith13samples1,scDistcompletedtheanalysisinaround50seconds,whileAugurrequired5minutes.Tobetterunderstandhowruntimedependsonthenumberofcells,weappliedbothmethodstosubsamplesofthedatasetthatvariedinsizeandobservedthatscDistwas,onaverage,five-foldfaster(Fig.S20).

scDistisalsocapableofscalingtomillionsofcells.Onsimulateddata,scDistrequiredapproximately10minutestofitadatasetwith1,000,000cells(Fig.S21).WealsotestedthesensitivityofscDisttothenumberofPCsusedbycomparingDKforvariousvaluesofK.WeobservedthattheestimateddistancesstabilizeasKincreases(Fig.S22),justifyingK=20asareasonablechoiceformostdatasets.DiscussionTheidentificationofcelltypesinfluencedbyinfections,treatments,orbiologicalconditionsiscrucialforunderstandingtheirimpactonhumanhealthanddisease.

scDist还能够扩展到数百万个细胞。在模拟数据上，scDist需要大约10分钟才能拟合具有1000000个单元的数据集（图S21）。我们还通过比较K的各种值的DK来测试scDist对PC数量的敏感性。我们观察到估计的距离随着K的增加而稳定（图S22），证明K=20是大多数数据集的合理选择。讨论鉴定受感染，治疗或生物条件影响的细胞类型对于了解其对人类健康和疾病的影响至关重要。

WepresentscDist,astatisticallyrigorousandcomputationallyfastmethodfordetectingcell-typespecificdifferencesacrossmultiplegroupsorconditions.Byusingamixed-effectsmodel,scDistestimatesthedifferencebetweengroupswhilequantifyingthestatisticaluncertaintyduetoindividual-to-individualvariationandothersourcesofvariability.

WevalidatedscDistthroughtheunbiasedrecapitulationofknownrelationshipsbetweenimmunecellsanddemonstrateditseffectivenessinmitigatingfalsepositivesfrompatient-levelandtechnicalvariationsinbothsimulatedandrealdatasets.Notably,scDistfacilitatesbiologicaldiscoveriesfromscRNAcohorts,evenwhenthenumberofindividualsislimite.

我们通过无偏见地概括免疫细胞之间的已知关系来验证scDist，并证明其在减轻患者水平的假阳性以及模拟和真实数据集中的技术变化方面的有效性。值得注意的是，即使个体数量有限，scDist也有助于scRNA队列的生物学发现。

(3)

$$\log{\mu}_{g}={\beta}_{0g}+{\beta}_{1g}\log{r}_{ij}$$

$$\log{\you}{g}={\beta}{0g}+{\beta}{1g}\log{r}_{ij}$$

(4)

whererijisthetotalnumberofUMIcountsfortheparticularcell.ThenormalizedcountsaregivenbythePearsonresidualsoftheabovemodel:$${z}_{ijg}=\frac{{y}_{ijg}-{\hat{\mu}}_{g}}{\sqrt{{\hat{\mu}}_{g}+{\hat{\mu}}_{g}^{2}/{\hat{\alpha}}_{g}}}$$

其中rij是特定单元格的UMI计数总数。归一化计数由上述模型的Pearson残差给出：$${z}_{ijg}=\frac{{y}_{ijg}-{\hat{\mu}}}{\sqrt{\hat{\mu}}}{\ug}+{\hat{\mu}}}\ug}^{2}/{\hat{\alpha}}\ug}}$$

(5)

DistanceinnormalizedexpressionspaceInthissection,wedescribetheinferentialprocedureofscDistforcaseswithoutadditionalcovariates.However,theprocedurecanbegeneralizedtothefullmodel(18)witharbitrarycovariates(designmatrix)incorporatingrandomandfixedeffects,aswellasnested-effectmixedmodels.

归一化表达空间中的距离在本节中，我们描述了scDist对没有额外协变量的情况的推理过程。然而，该过程可以推广到包含随机和固定效应的任意协变量（设计矩阵）以及嵌套效应混合模型的完整模型（18）。

Foragivencelltype,wemodeltheG-dimensionalvectorofnormalizedcountsas$${\boldsymbol{z}}_{ij}={\boldsymbol{\alpha}}+{x}_{ij}\,{\boldsymbol{\beta}}+{\boldsymbol{\omega}}_{j}+{\boldsymbol{\varepsilon}}_{{ij}}$$.

对于给定的细胞类型，我们将归一化计数的G维向量建模为$${\boldsymbol{z}}{ij}={\boldsymbol{\alpha}+{x}_{ij}\，{\boldsymbol{\beta}}+{\boldsymbol{\omega}}{j}+{\boldsymbol{\varepsilon}}{{{ij}$$。

(6)

where${\boldsymbol{\alpha}},{\boldsymbol{\beta}}\in{{\mathbb{R}}}^{G}$,xijisabinaryindicatorofcondition,${\boldsymbol{\omega}}_{j}\sim{{\mathcal{N}}}(0,{\tau}^{2}{I}_{G})$,and${\boldsymbol{\varepsilon}}_{ij}\sim{{\mathcal{N}}}(0,{\sigma}^{2}{I}_{G})$.ThequantityofinterestistheEuclideandistancebetweenconditionmeansαandα+β:$$D:=\sqrt{{\boldsymbol{\beta}}^{T}{\boldsymbol{\beta}}}=\left|\left|{\boldsymbol{\beta}}\right|\right|_{2}$$.

{I}_{G}）\）和\（{\boldsymbol{\varepsilon}}}uij}\sim{\mathcal{N}}}（0，{\sigma}^{2}{I}_{G}）\）。感兴趣的数量是条件均值α和α+β之间的欧几里德距离：$$D：=\sqrt{{\boldsymbol{\beta}}^{\T}{\boldsymbol{\beta}}=\left|\left|{\boldsymbol{\beta}}\right|\right||{2}$$。

(7)

If$U\in{{\mathbb{R}}}^{G\timesG}$isanorthonormalmatrix,wecanapplyUtoequation(6)toobtainthetransformedmodel:$$U{\boldsymbol{z}}_{ij}=U{\boldsymbol{\alpha}}+{x}_{ij}U{\boldsymbol{\beta}}+U{\boldsymbol{\omega}}_{j}+U{\boldsymbol{\varepsilon}}_{ij}$$

如果\（U\in{{\mathbb{R}}}^{G\timesG}\）是一个正交矩阵，我们可以将U应用于方程（6）以获得变换后的模型：$$U{\boldsymbol{z}}uij}=U{\boldsymbol{\alpha}+{x}_{ij}U{\boldsymbol{\beta}}+U{\boldsymbol{\omega}}+U{\boldsymbol{\varepsilon}}+U{\ij}$$

(8)

SinceUisorthogonal,UωjandUεijstillhavesphericalnormaldistributions.Wealsohavethat$${(U{\boldsymbol{\beta}})}^{T}(U{\boldsymbol{\beta}})={{\boldsymbol{\beta}}}^{T}{\boldsymbol{\beta}}={D}^{2}$$

由于U是正交的，因此Uωj和Uεij仍然具有球形正态分布。我们还有$${（U{\boldsymbol{\beta}}}}^{T}（U{\boldsymbol{\beta}}）={\boldsymbol{\beta}}}^{T}{\boldsymbol{\beta}}={D}^{2}$$

(9)

Thismeansthatthedistanceinthetransformedmodelisthesameasintheoriginalmodel.Asmentionedearlier,ourgoalistofindUsuchthat$${D}_{K}:=\sqrt{{\sum}_{k=1}^{K}{(U{\boldsymbol{\beta}})}_{k}^{2}}\,\approx\,D$$

这意味着转换后的模型中的距离与原始模型中的距离相同。如前所述，我们的目标是找到$${D}_{K}：=\sqrt{\sum}\uk=1}^{K}{（U{\boldsymbol{\beta}}}{K}^{2}}}，\近似值，D$$

(10)

withKG.Let$Z\in{{\mathbb{R}}}^{n\timesG}$bethematrixwithrowszij(wherenisthetotalnumberofcells).Intuitively,wewanttochooseaUsuchthattheprojectionofzijontothefirstKrowsofU(${u}_{1},\ldots,{u}_{K}\in{{\mathbb{R}}}^{G}$)minimizesthereconstructionerror$$\sum_{i=1}^{n}||{z}_{i}-(\mu+{v}_{i1}{u}_{1}+\cdots+{v}_{iK}{u}_{K})|{|}_{2}^{2}$$.

其中KG.Let\（Z\in{\mathbb{R}}^{n\乘以G}\）是具有行zij的矩阵（其中n是细胞总数）。直观地说，我们想选择一个U，使得zij投影到U的前K行上(\({u}_{1}，\l点，{u}_{K}{z}_{i}-（\亩+{v}_{i1}{u}_{1}+\c点+{v}_{iK}{u}_{K}）|{|}{2}^{2}$$。

(11)

where$\mu\in{{\mathbb{R}}}^{G}$isashiftvectorand$({v}_{ik})\in{{\mathbb{R}}}^{n\timesK}$isamatrixofcoefficients.ItcanbeshownthatthePCAofZyieldsthe(orthornormal)u1,…,uKthatminimizesthisreconstructionerror26.InferenceGivenanestimator$\widehat{{(U{\boldsymbol{\beta}})}_{k}}$of(Uβ)k,anaiveestimatorofDKisgivenbytakingthesquarerootofthesumofsquaredestimates:$$\sqrt{{\sum}_{k=1}^{K}{\widehat{{(U{\boldsymbol{\beta}})}_{k}}}^{2}}.$$.

其中\（{{\mathbb{R}}}^{G}中的\mu\是移位向量，并且\(({v}_{ik}）{{\mathbb{R}}}^{n次K}}中的是系数矩阵。可以证明，Z的PCA产生（正态）u1，…，uK，使这种重建误差最小化26.推论给定（Uβ）k的估计量\（\widehat{（U{\boldsymbol{\beta}）}uK}），DK的朴素估计量是通过取平方估计和的平方根给出的：$$\sqrt{\sum}uK=1}^{k}{\widehat{（U}{\boldsymbol{\beta}}}}{k}}}}^{2}}.$$。

(12)

However,thisestimatorcanhavesignificantupwardbiasduetosamplingvariability.Forinstance,evenifthetruedistanceis0,$\widehat{{(U\beta)}_{k}}$isunlikelytobeexactlyzero,andthatnoisebecomesstrictlypositivewhensquaring.Toaccountforthis,weapplyapost-hocBayesianproceduretothe${\widehat{U\beta}}_{k}$toshrinkthemtowardszerobeforecomputingthesumofsquares.

然而，由于抽样变异性，该估计量可能具有显着的向上偏差。。

Inparticular,weadoptthespikeslabmodelof14$$\widehat{{(U{\boldsymbol{\beta}})}_{k}}\sim{{\mathcal{N}}}\left({(U{\boldsymbol{\beta}})}_{k},{{\rm{Var}}}\left[\widehat{{(U{\boldsymbol{\beta}})}_{k}}\right]\right)$$.

。

(13)

$${(U{\boldsymbol{\beta}})}_{k}\sim{\pi}_{0}{\delta}_{0}+\sum_{t=1}^{T}{\pi}_{t}{{\mathcal{N}}}(0,{\tau}_{t})$$

$${（U{\boldsymbol{\beta}}}}}}{k}\sim{\pi}}}{0}{\delta}}{0}+\sum{t=1}^{t}{\pi}}{t}{{\数学{N}}（0，{\tau}}{t}）$$

(14)

where${{\rm{Var}}}[\widehat{{(U{\boldsymbol{\beta}})}_{k}}]$isthevarianceoftheestimator$\widehat{{(U{\boldsymbol{\beta}})}_{k}}$,δ0isapointmassat0,andπ0,π1,…πTaremixingweights(thatis,theyarenon-negativeandsumto1).14providesafastempiricalBayesapproachtoestimatethemixingweightsandobtainposteriorsamplesof(Uβ)k.

其中\（{{rm{Var}}[{widehat{（U{boldsymbol{beta}}}}uk}}]\）是估计量的方差\（{（U{boldsymbol{beta}}}uk}}），δ0是0处的点质量，π0，π1，…πT是混合权重（即它们是非负的，总和为1）。14提供了一种快速的经验贝叶斯方法估计混合权重并获得（Uβ）k的后验样本。

ThensamplesfromtheposteriorofDKareobtainedbyapplyingtheformula(12)totheposteriorsamplesof(Uβ)k.Wethensummarizetheposteriordistributionbyreportingthemedianandotherquantiles.Advantageofthisparticularspecificationisthattheamountofshrinkagedependsontheuncertaintyintheinitialestimateof(Uβ)k.Weusethefollowingproceduretoobtain${\widehat{U\beta}}_{k}$:.

然后通过将公式（12）应用于（Uβ）k的后验样本来获得DK后验样本。然后，我们通过报告中位数和其他分位数来总结后验分布。这个特定规范的优点是收缩量取决于（Uβ）k的初始估计值的不确定性。我们使用以下程序来获得\（{\widehat{U\beta}}}uk}）：。

UsethematrixofPCAloadingsasapluginestimatorforU.ThenUzijisthevectorofPCscoresforcelliinsamplej.

使用PCA加载矩阵作为U的插件估计器。然后Uzij是样本j中单元i的PC得分向量。

Estimate(Uβ)kbyusinglme427tofitthemodel(6)usingthePCscorescorrespondingtothek-thloading(i.e.,eachdimensionisfitindependently).

通过使用lme427来估计（Uβ）k，以使用对应于第k次加载的PC分数来拟合模型（6）（即，每个维度是独立拟合的）。

NotethatonlythefirstKrowsofUneedtobestored.WeareparticularlyinterestedintestingthenullhypothesisofDK=0againstthealternativeDd>0.Becausethenullhypothesiscorrespondsto(Uβ)k=0forall1≤k≤d,wecanusethesumofindividualWaldstatisticsasourteststatistic:$$W=\sum_{k=1}^{K}{W}_{k}=\sum_{k=1}^{K}{\left(\frac{{\widehat{(U{\boldsymbol{\beta}})}}_{k}}{\widehat{{{\rm{se}}}}\left[{\widehat{(U{\boldsymbol{\beta}})}}_{k}\right]}\right)}^{2}$$.

注意，只需要存储U的前K行。我们特别感兴趣的是针对替代Dd>0检验DK=0的零假设。因为零假设对应于所有1≤k≤d的（Uβ）k=0，所以我们可以使用单个Wald统计量的总和作为我们的检验统计量：$$W=\sum{k=1}^{k}{W}_{k}=\sum{k=1}^{k}{\left（\frac{\widehat{（U{\boldsymbol{\beta}}}}}}{k}}{\widehat{{{\rm{se}}}\left[{\widehat{（U{\boldsymbol{\beta}}}}}}}}uk}\right]}\right）}^{2}$$。

(15)

Underthenullhypothesisthat(Uβ)k=0,Wkcanbeapproximatedbya${F}_{{\nu}_{k},1}$distribution.νkisestimatedusingSatterthwaite’sapproximationinlmerTest.Thisimpliesthat$$W\sim\sum_{k=1}^{K}{F}_{{\nu}_{k},1}$$

在（Uβ）k=0的零假设下，Wk可以近似为a\({F}_{{\nu}{k}，1}\）分布。νk是使用lmerTest中的Satterthwaite近似估计的。这意味着$$W\sim\sum{k=1}^{k}{F}_{{\nu}\uk}，1}$$

(16)

underthenull.Moreover,theWkareindependentbecausewehaveassumedthatcovariancematricesforthesampleandcell-levelnoisearemultiplesoftheidentity.Equation(16)isnotaknowndistributionbutquantilescanbeapproximatedusingMonteCarlosamples.Tomakethisprecise,letW1,…,WMbedrawsfromequation(16),whereM=105andletW*bethevalueofequation(15)(i.e.,theactualteststatistic).

在null下。此外，Wk是独立的，因为我们假设样本和细胞级噪声的协方差矩阵是身份的倍数。方程（16）不是已知的分布，但可以使用蒙特卡罗样本近似分位数。为了精确起见，让W1，…，WM从等式（16）中得出，其中M=105，让W*是等式（15）的值（即实际测试统计量）。

Thentheempiricalp-value28iscomputedas$$\frac{{\sum}_{i=1}^{M}I({W}_{i}\,>\,{W}^{*})+1}{M+1}$$.

然后将经验p值28计算为$$\frac{{\sum}{i=1}^{M}I（笑声）({W}_{i}\，>\，{W}^{*}）+1}{M+1}$$。

(17)

ControllingforadditionalcovariatesBecausescDistisbasedonalinearmodel,itisstraightforwardtocontrolforadditionalcovariatessuchasageorsexofapatientintheanalysis.Inparticular,model(18)canbereplacedwith$${\boldsymbol{z}}_{ij}={\boldsymbol{\alpha}}+{x}_{j}{\boldsymbol{\beta}}+\mathop{\sum}_{k=1}^{p}{w}_{ijk}{\boldsymbol{\gamma}}_{k}+{\boldsymbol{\omega}}_{j}+{\boldsymbol{\varepsilon}}_{ij}$$.

控制其他协变量由于scDist基于线性模型，因此在分析中直接控制其他协变量，例如患者的年龄或性别。特别是，模型（18）可以替换为$${\boldsymbol{z}}{ij}={\boldsymbol{\alpha}+{x}_{j}{\boldsymbol{\beta}+\mathop{\sum}\uk=1}^{p}{w}_{ijk}{\boldsymbol{\gamma}}}{{\boldsymbol{\omega}}}{\boldsymbol{\varepsilon}}}{{ij}$$。

(18)

where${w}_{ijk}\in{\mathbb{R}}$isthevalueofthekthcovariateforcelliinsamplejand${\boldsymbol{\gamma}}_{k}\in{{\mathbb{R}}}^{G}$isthecorrespondinggene-specificeffectcorrespondingtothekthcovariate.ChoosingthenumberofprincipalcomponentsAnimportantchoiceinscDististhenumberofprincipalcomponentsd.

在哪里\({w}_{ijk}\in{\mathbb{R}}\）是样本j中细胞i的第k个协变量的值，而{\mathbb{R}}}^{G}\（{\boldsymbol{\gamma}}中的{k}\是对应于第k个协变量的相应基因特异性效应。选择主成分的数量scDist中的一个重要选择是主成分的数量d。

Ifdischosentoosmall,thenestimationaccuracymaysufferasthefirstfewPCsmaynotcaptureenoughofthedistance.Ontheotherhand,ifdischosentoolargethenthepowermaysufferasamajorityofthePCswillsimplybecapturingrandomnoise(andaddingtodegreesoffreedomtotheWaldstatistic).

如果选择的d太小，则估计精度可能会受到影响，因为前几台PC可能无法捕获足够的距离。另一方面，如果选择的d太大，那么功率可能会受到影响，因为大多数PC只会捕获随机噪声（并增加Wald统计的自由度）。

Moreover,itisimportantthatdischosenapriori,aschoosingthedthatproducesthelowestpvaluesisakintop-hacking.Ifthemodeliscorrectlyspecifiedthenitisreasonabletochoosed=J1,whereJisthenumberofsamples(orpatients).Toseewhy,noticethatthemeanexpressioninsample1≤j≤Jis$${x}_{\cdotj}{\boldsymbol{\beta}}+{\omega}_{j}\in{{\mathbb{R}}}^{G}$$.

此外，重要的是事先选择d，因为选择产生最低p值的d类似于p-hacking。如果正确指定了模型，那么选择d=J-1是合理的，其中J是样本（或患者）的数量。要了解原因，请注意样本1中的平均表达式≤≤≤≤≤≤j为$${x}_{\cdotj}{\boldsymbol{\beta}}+{\omega}\uj}\in{\mathbb{R}}}^{G}$$。

(19)

Inparticular,theJsamplemeanslieona(J1)-dimensionalsubspacein${{\mathbb{R}}}^{G}$.Undertheassumptionthattheconditiondifferenceandsample-levelvariabilityislargerthantheerrorvarianceσ2,weshouldexpectthatthefirstJ1PCvectorscaptureallofthevarianceduetodifferencesinsamplemeans.Inpractice,however,themodelcannotbeexpectedtobecorrectlyspecified.

特别是，J样本均值位于\（{\mathbb{R}}}^{G}\）中的（J-1）维子空间上。在条件差异和样本水平变异性大于误差方差σ2的假设下，我们应该期望第一个J-1PC向量捕获由于样本均值差异而产生的所有方差。然而，在实践中，不能期望正确指定模型。

Forthisreason,wefindthatd=20isareasonablechoicewhenthenumberofsamplesissmall(asisusuallythecaseinscRNA-seq)andd=50fordatasetswithalargenumberofsamples.Thisislinewithothersingle-cellmethods,wherethenumberofPCsretainedisusuallybetween20and50.Celltypeannotationand“doubledipping”scDisttakesasinputanannotatedlistofcells.

因此，我们发现当样本数量较少（scRNA-seq通常是这种情况）时，d=20是一个合理的选择，对于具有大量样本的数据集，d=50是一个合理的选择。这与其他单细胞方法一致，其中保留的PC数量通常在20到50之间。细胞类型注释和“双重浸渍”scDist将带注释的细胞列表作为输入。

Acommonapproachtoannotatecellsistoclusterbasedongeneexpression.SincescDistalsousesthegeneexpressiondatatomeasuretheconditiondifferencethereareconcernsassociatedwith“double-dipping”orusingthedatatwice.Inparticular,iftheconditiondifferenceisverylargeandallofthedataisusedtoclusteritispossiblethatthecellsinthetwoconditionswouldbeassignedtodifferentclusters.

InthiscasescDistwouldbeunabletoestimatetheinter-conditiondistance,leadingtoafalsenegative.Inotherwords,theissueofdoubledippingcouldcausescDisttobemoreconservative.Notethattheoppositeproblemoccurswhenperformingdifferentialexpressionbetweentwoestimatedclusters;inthiscase,thep-valuescorrespondingtogeneswillbeanti-conservative29.Toillustrate,wesimulatedanormalizedcountmatrixwith4000cellsand1000genesinsuchawaythattherearetwo“true”celltypesandatrueconditiondistanceof4forboth.

在这种情况下，scDist将无法估计条件间距离，从而导致假阴性。换句话说，二次探底的问题可能会导致scDist更加保守。请注意，当在两个估计的聚类之间执行差异表达时，会出现相反的问题；在这种情况下，对应于基因的p值将是反保守的29。为了说明这一点，我们模拟了一个具有4000个细胞和1000个基因的归一化计数矩阵，这样就有两种“真实”细胞类型，两种细胞的真实条件距离都为4。

(20)

Theweighteddistancecanbewritteninmatrixformbyletting$W\in{{\mathbb{R}}}^{G\timesG}$beadiagonalmatrixwithWgg=wg,sothat$${D}_{w}={\boldsymbol{\beta}}^{\top}W{\boldsymbol{\beta}}$$

加权距离可以用矩阵形式表示，方法是让\（W\in{\mathbb{R}}}^{G\乘以G}\）是Wgg==wg的对角矩阵，以便$${D}_{w}={\boldsymbol{\beta}}^{\top}w{\boldsymbol{\beta}}$$

(21)

Thus,theweighteddistancecanbeestimatedbyinsteadconsideredthetransformedmodelwhere$U\sqrt{W}$isappliedtoeachzij.Afterthisdifferenttransformedmodelisobtained,estimationandinferenceofDwproceedsinexactlythesamewayastheunweightedcase.Totesttheaccuracyoftheweighteddistanceestimate,weconsideredasimulationwhereeachgenehadonlya10%chanceofhavingβg≠0(otherwise${\beta}_{g}\sim{{\mathcal{N}}}(0,1)$).

因此，可以通过将（U\sqrt{W}）应用于每个zij的转换模型来估计加权距离。在获得这种不同的转换模型之后，Dw的估计和推断以与未加权情况完全相同的方式进行。为了测试加权距离估计的准确性，我们考虑了一个模拟，其中每个基因只有10%的机会具有βg≠0（否则\（{\beta}uu{g}\sim{\mathcal{N}}（0,1）\）。

Wethenconsideredthreescenarios:wg=1ifβg≠0andwg=0otherwise(correctweighting),wg=1forallg(unweighted),andwg=1randomlywithprobability0.1(incorrectweights).Wethenquantifiedtheperformancebytakingtheabsolutevalueoftheerrorbetween${\sum}_{g}{\beta}_{g}^{2}$andtheestimateddistance.

然后，我们考虑了三种情况：如果βg≠0，则wg=1，否则wg=0（正确加权），对于所有g（未加权），wg=1，随机概率为0.1（不正确权重）。然后，我们通过取\（{\sum}{g}{\beta}{g}^{2}\）与估计距离之间的误差的绝对值来量化性能。

FigureS3showsthatcorrectweightingslightlyoutperformsunweightedscDistbutrandomweightsaresignificantlyworse.Thus,theunweightedversionofscDistshouldbepreferredunlessstrongaprioriinformationisavailable.RobustnesstomodelmisspecificationThescDistmodelassumesthatthecell-specificvarianceσ2andsample-specificvarianceτ2aresharedacrossgenes.

图S3显示正确的权重略优于未加权的scDist，但随机权重明显较差。因此，除非有强有力的先验信息，否则应首选未加权版本的scDist。模型错误的稳健性scDist模型假设细胞特异性方差σ2和样本特异性方差τ2在基因间共享。

Thepurposeofthisassumptionistoensurethatthenoiseinthetransformedmodelfollowsasphericalnormaldistribution.Violationsofthisassumptioncouldleadtomiscalibratedstandarderrorsandhypothesistestsbutshouldnoteffectestimation.Todemonstratethis,weconsideredsimulateddatawhereeachgenehasσg~Gamma(r,r)andτg~Gamma(r/2,r).

该假设的目的是确保变换模型中的噪声遵循球形正态分布。。为了证明这一点，我们考虑了每个基因具有σgGamma（r，r）和τgGamma（r/2，r）的模拟数据。

Asrvaries,thequalityofthedistanceestimatesdoesnotchangesignificantly(Fig.S26).Semi-simulatedCOVID-19dataCOVID-19patientdatafortheanalysiswasobtainedfromref.17,containing1.4millioncellsof64typesfrom284PB.

随着r的变化，距离估计的质量没有显着变化（图S26）。用于分析的半模拟COVID-19数据COVID-19患者数据来自参考文献17，其中包含来自284PB的64种类型的140万个细胞。

Foreachgeneg,wecomputedthelogfoldchangesLgbetweenCOVID-19casesandcontrols,withLg=Eg(Covid)Eg(Control),whereEgdenotesthelog-transformedexpressiondata$\log(1+x)$.

对于每个基因g，我们计算了COVID-19病例和对照之间的对数倍数变化Lg，其中Lg==Eg（COVID）-Eg（对照），其中Eg表示对数转换的表达数据\（\log（1+x）\）。

Thegroundtruthdistanceisthendefinedas$D={\sum}_{g}{L}_{g}^{2}$.

然后将地面真值距离定义为\（D={\sum}\{g}{L}_{g}^{2}\）。

Subsequently,weexcludedanycelltypesnotpresentinmorethan10%ofthesamplesfromfurtheranalysis.Fortruenegativecelltypes,weidentifiedthetop5withthesmallestfoldchangeandarepresentationofover20,000cellswithintheentiredataset.Whenattemptingsimilarfilteringbasedoncellcountalone,nocelltypesdemonstratedasufficientlylargetruedistance.

随后，我们从进一步分析中排除了超过10%的样品中不存在的任何细胞类型。对于真正的阴性细胞类型，我们确定了折叠变化最小的前5个，并且在整个数据集中表示了超过20000个细胞。当尝试仅基于细胞计数的类似过滤时，没有细胞类型显示出足够大的真实距离。

Consequently,wechosethetopfourcelltypeswithover5000cellsasourtruepositivesFig.S11.Usingthegroundtruth,weperformedtwoseparatesimulationanalyses:.

因此，我们选择了具有5000多个细胞的前四种细胞类型作为我们的真实阳性图S11。使用基本事实，我们进行了两个单独的模拟分析：。

1:SimulationanalysesI(Fig.5A,B):Usingonehalfofthedataset(712621cells,132casesamples,20controlsamples),wecreated100subsamplesconsistingof5casesand5controls.Foreachsubsample,weappliedbothscDistandAugurtoestimateperturbation/distancebetweencasesandcontrolsforeachcelltype.

1：模拟分析I（图5A，B）：使用数据集的一半（712621个细胞，132个病例样本，20个对照样本），我们创建了100个子样本，由5个病例和5个对照组成。对于每个子样本，我们同时应用scDist和Augur来估计每种细胞类型的病例和对照之间的扰动/距离。

Thenwecomputedthecorrelationbetweenthegroundtruthranking(orderingcellsbysumoflogfoldchangesonthewholedataset)andtherankingobtainedbybothmethods.ForscDist,werestrictedtocelltypesthathadanon-zerodistanceestimateineachsubsample,andforAugurwerestrictedtocelltypesthathadanAUCgreaterthan0.5(Fig.5A).

ForFig.5B,wetookthemeanestimateddistanceacrosssubsamplesforwhichthegivencelltypehadanon-zerodistanceestimate.Thisisbecauseinsomesubsamplesagivencelltypecouldbecompletelyabsent..

对于图5B，我们采用了给定细胞类型具有非零距离估计的子样本的平均估计距离。这是因为在某些子样本中，给定的细胞类型可能完全不存在。。

2:SimulationanalysesII(Fig.5C–F):WesubsampledtheCOVID-19cohortwith284samples(284PBMCsamplesfrom196individuals:171withCOVID-19infectionand25healthycontrols)tocreate1,000downsampledcohorts,eachcontainingsamplesfrom10individuals(5withCOVID-19and5healthycontrols).

2：模拟分析II（图5C–F）：我们用284个样本（来自196个个体的284个PBMC样本：171个患有COVID-19感染和25个健康对照）对COVID-19队列进行了二次采样，以创建1000个下采样队列，每个队列包含来自10个个体的样本（5个患有COVID-19和5个健康对照）。

Werandomlyselectedeachsamplefromthedownsampledcohort,furtherdownsampledthenumberofcellsforeachcelltype,andselectedthemfromtheoriginalCOVID-19cohort.Thisdownsamplingprocedureincreasesbothcohortvariabilityandcell-numbervariations..

我们从下采样队列中随机选择每个样本，进一步下采样每种细胞类型的细胞数量，并从原始COVID-19队列中选择它们。这种下采样程序增加了队列变异性和细胞数量变异。。

PerformanceEvaluationinSubsampledCohorts:WeappliedscDistandAugurtoeachsubsampledcohort,comparingtheresultsfortruepositiveandfalsepositivecelltypes.Wepartitionedthesampledcohortsinto10groupsbasedoncell-numbervariation,definedasthenumberofcellsinasamplewiththehighestnumberofcellsforfalse-negativecelltypesdividedbytheaveragenumberofcellsincelltypes.

子样本队列中的性能评估：我们将scDist和Augur应用于每个子样本队列，比较真阳性和假阳性细胞类型的结果。我们根据细胞数量变化将采样队列分为10组，定义为样本中假阴性细胞类型的细胞数量最高的细胞数量除以细胞类型的平均细胞数量。

Thisprocedurehighlightsthevulnerabilityofcomputationalmethodstocellnumbervariation,particularlyinnegativecelltypes..

这个过程突出了计算方法对细胞数量变化的脆弱性，特别是在负细胞类型中。。

免疫治疗队列数据收集的分析我们从四个队列中获得了单细胞数据2,23,24,25，包括表达计数和患者反应信息。。模型来解释队列和样本差异为了解释队列特异性和样本特异性批次效应，scDist将标准化基因表达建模为：$$Z\simX+（1|\gamma:\omega）$$。

(22)

Here,Zrepresentsthenormalizedcountmatrix,Xdenotesthebinaryindicatorofcondition(responder=1,non-responder=0);γandωarecohortandsample-levelrandomeffects,and(1∣γ:ω)modelsnestedeffectsofsampleswithincohorts.Theinferenceprocedurefordistance,itsvariance,andsignificanceforthemodelwithmultiplecohortsisanalogoustothesingle-cohortmodel.SignatureWeestimatedthesignatureintheNK-2celltypeusingdifferentialexpressionbetweenrespondersandnon-responders.

这里，Z表示归一化计数矩阵，X表示条件的二进制指示符（响应者1，无响应者0）；γ和ω是队列和样本水平的随机效应，而（1γ：ω）模型是队列中样本的嵌套效应。具有多个队列的模型的距离，方差和显着性的推断过程类似于单队列模型。签名我们使用应答者和无应答者之间的差异表达来估计NK-2细胞类型中的签名。

Toaccountforcohort-specificandpatient-specificeffectsindifferentialexpressionestimation,weemployedalinearmixedmodeldescribedaboveforestimatingdistances,performinginferenceforeachgeneseparately.ThecoefficientofXinferredfromthelinearmixedmodelswasusedastheestimateofdifferentialexpression:$$Z\simX+(1|\gamma:\omega)$$.

为了解释差异表达估计中的队列特异性和患者特异性效应，我们采用了上述线性混合模型来估计距离，分别对每个基因进行推断。从线性混合模型推断出的X系数用作差异表达式的估计：$$Z\simX+（1|\gamma:\omega）$$。

(23)

Here,Zrepresentsthenormalizedcountmatrix,Xdenotesthebinaryindicatorofcondition(responder=1,non-responder=0);γandωarecohortandsample-levelrandomeffects,and(1∣γ:ω)modelsnestedeffectsofsampleswithincohorts.BulkRNA-seqcohortsWeobtainedbulkRNA-seqdatafromsevencancercohorts32,33,34,35,36,37,38,comprisingatotalof789patients.

这里，Z表示归一化计数矩阵，X表示条件的二进制指示符（响应者1，无响应者0）；γ和ω是队列和样本水平的随机效应，而（1γ：ω）模型是队列中样本的嵌套效应。大量RNA-seq队列我们从七个癌症队列32,33,34,35,36,37,38获得了大量RNA-seq数据，共包括789名患者。

Withineachcohort,weconvertedcountsofeachgenetoTPMandnormalizedthemtozeromeanandunitstandarddeviation.Wecollectedsurvivaloutcomes(bothprogression-freeandoverall)andradiologic-basedresponses(partial/completerespondersandnon-responderswithstable/progressivedisease)foreachpatient.EvaluationofsignatureinbulkRNA-seqcohortsWescoredeachbulktranscriptome(sample)forthesignatureusingthestrategydescribedinref.

在每个队列中，我们将每个基因的计数转换为TPM，并将其标准化为零均值和单位标准差。我们收集了每位患者的生存结果（无进展和总体）和基于放射学的反应（部分/完全反应者和稳定/进展性疾病的无反应者）。批量RNA-seq队列中签名的评估我们使用参考文献中描述的策略对每个批量转录组（样品）进行签名评分。

39.Specifically,thescorewasdefinedastheSpearmancorrelationbetweenthenormalizedexpressionanddifferentialexpressioninthesignature.Westratifiedpatientsintotwogroupsusingthemedianscoreforpatientstratification.Kaplan–Meierplotsweregeneratedusingthesestratifications,andthesignificanceofsurvivaldifferenceswasassessedusingthelog-ranktest.

Todemonstratetheassociationofsignaturelevelswithradiologicalresponse,weplottedsignaturelevelsseparatelyfornon-responders,partial-responders,andresponders.EvaluatingAugurSignatureinBulkRNA-SeqCohortsAdifferentialsignaturewasderivedforAugur’stopprediction,plasmacells,usingaprocedureanalogoustotheonedescribedaboveforscDist.

为了证明特征水平与放射反应的关联，我们分别绘制了无反应者，部分反应者和反应者的特征水平。评估大量RNA-Seq队列中的Augur签名使用类似于上述scDist的程序，为Augur的最高预测浆细胞推导了差异签名。

ThisplasmasignaturewasthenassessedinbulkRNA-seqcohortsfollowingthesameevaluationstrategyasappliedtothescDistsignature.StatisticsandreproducibilityNostatisticalmethodwas.

然后按照与scDist签名相同的评估策略，在大量RNA-seq队列中评估该血浆签名。统计和可重复性没有统计方法。

Dataavailability

数据可用性

Table1givesalistofthedatasetsusedineachfigure,aswellasdetailsabouthowthedatasetscanbeobtained.Sourcedataareprovidedwiththispaper.

。本文提供了源数据。

Codeavailability

代码可用性

scDistisavailableasanRpackageandcanbedownloadedfromGitHub40:github.com/phillipnicol/scDist.TherepositoryalsoincludesscriptstoreplicatesomeofthefiguresandademoofscDistusingsimulateddata.

scDist是一个R包，可以从GitHub40下载：github.com/phillipnicol/scDist。该存储库还包括用于复制某些数字的脚本以及使用模拟数据的scDist演示。

ReferencesWilk,A.J.etal.Asingle-cellatlasoftheperipheralimmuneresponseinpatientswithseverecovid-19.Nat.Med.26,1070–1076(2020).Article

PubMed

PubMedCentral

公共医学中心

GoogleScholar

谷歌学者

Yuen,K.C.etal.Highsystemicandtumor-associatedil-8correlateswithreducedclinicalbenefitofpd-l1blockade.Nat.Med.26,693–698(2020).Article

Crowell,H.L.etal.Muscatdetectssubpopulation-specificstatetransitionsfrommulti-samplemulti-conditionsingle-celltranscriptomicsdata.Nat.Commun.11,6077(2020).Article

ADS

Helmink,B.A.etal.Bcellsandtertiarylymphoidstructurespromoteimmunotherapyresponse.Nature577,549–555(2020).Article

Zhao,J.etal.Detectionofdifferentiallyabundantcellsubpopulationsinscrna-seqdata.Proc.Natl.Acad.Sci.118,e2100293118(2021).Article

Dann,E.,Henderson,N.C.,Teichmann,S.A.,Morgan,M.D.&Marioni,J.C.Differentialabundancetestingonsingle-celldatausingk-nearestneighborgraphs.Nat.Biotechnol.40,245–253(2022).Article

Burkhardt,D.B.etal.Quantifyingtheeffectofexperimentalperturbationsatsingle-cellresolution.Nat.Biotechnol.39,619–629(2021).Article

Zimmerman,K.D.,Espeland,M.A.&Langefeld,C.D.Apracticalsolutiontopseudoreplicationbiasinsingle-cellstudies.Nat.Commun.12,1–9(2021).Article

Korsunsky,I.etal.Fast,sensitiveandaccurateintegrationofsingle-celldatawithharmony.Nat.Methods16,1289–1296(2019).Article

Townes,F.W.,Hicks,S.C.,Aryee,M.J.&Irizarry,R.A.Featureselectionanddimensionreductionforsingle-cellrna-seqbasedonamultinomialmodel.GenomeBiol.20,1–16(2019).Article

Hafemeister,C.&Satija,R.Normalizationandvariancestabilizationofsingle-cellRNA-seqdatausingregularizednegativebinomialregression.GenomeBiol.20,1–15(2019).Article

Stephens,M.Falsediscoveryrates:anewdeal.Biostatistics18,275–294(2017).MathSciNet

Stephens，M.《错误发现率：新政》。生物统计学18275-294（2017）。数学网

Zheng,G.X.etal.Massivelyparalleldigitaltranscriptionalprofilingofsinglecells.Nat.Commun.8,14049(2017).Article

Duò,A.,Robinson,M.D.&Soneson,C.Asystematicperformanceevaluationofclusteringmethodsforsingle-cellRNA-seqdata.F1000Res.7,1141(2018).Ren,X.etal.Covid-19immunefeaturesrevealedbyalarge-scalesingle-celltranscriptomeatlas.Cell184,1895–1913(2021).Article.

Galati,D.,Zanotta,S.,Capitelli,L.&Bocchino,M.Abird’seyeviewontheroleofdendriticcellsinsars-cov-2infection:Perspectivesforimmune-basedvaccines.Allergy77,100–110(2022).Article

Pérez-Gómez,A.etal.Dendriticcelldeficienciespersistsevenmonthsaftersars-cov-2infection.Cell.Mol.Immunol.18,2128–2139(2021).Article

Mellett,L.&Khader,S.A.S100a8/a9incovid-19pathogenesis:impactonclinicaloutcomes.CytokineGrowthFactorRev.63,90–97(2022).Article

Luoma,A.M.etal.Tissue-residentmemoryandcirculatingtcellsareearlyresponderstopre-surgicalcancerimmunotherapy.Cell185,2918–2935(2022).Article

Yost,K.E.etal.Clonalreplacementoftumor-specifictcellsfollowingpd-1blockade.Nat.Med.25,1251–1259(2019).Article

Sade-Feldman,M.etal.DefiningTcellstatesassociatedwithresponsetocheckpointimmunotherapyinmelanoma.Cell175,998–1013(2018).Article

Pearson,K.Liii.onlinesandplanesofclosestfittosystemsofpointsinspace.Lond.Edinb.DublinPhilos.Mag.J.Sci.2,559–572(1901).Article

Bates,D.,Mchler,M.,Bolker,B.&Walker,S.Fittinglinearmixed-effectsmodelsusinglme4.J.Stat.Softw.67,1–48(2015).Article

North,B.V.,Curtis,D.&Sham,P.C.AnoteonthecalculationofempiricalpvaluesfromMonteCarloprocedures.Am.J.Hum.Genet.71,439–441(2002).Article

etal.Integratedanalysisofmultimodalsingle-celldata.Cell184,3573–3587(2021).Article.

Mariathasan,S.etal.Tgfβattenuatestumourresponsetopd-l1blockadebycontributingtoexclusionofTcells.Nature554,544–548(2018).Article

Weber,J.S.etal.Sequentialadministrationofnivolumabandipilimumabwithaplannedswitchinpatientswithadvancedmelanoma(checkmate064):anopen-label,randomised,phase2trial.LancetOncol.17,943–955(2016).Article

Liu,D.etal.Integrativemolecularandclinicalmodelingofclinicaloutcomestopd1blockadeinpatientswithmetastaticmelanoma.Nat.Med.25,1916–1927(2019).Article

McDermott,D.F.etal.Clinicalactivityandmolecularcorrelatesofresponsetoatezolizumabaloneorincombinationwithbevacizumabversussunitinibinrenalcellcarcinoma.Nat.Med.24,749–757(2018).Article

Riaz,N.etal.Tumorandmicroenvironmentevolutionduringimmunotherapywithnivolumab.Cell171,934–949(2017).Article

Miao,D.etal.Genomiccorrelatesofresponsetoimmunecheckpointtherapiesinclearcellrenalcellcarcinoma.Science359,801–806(2018).Article

VanAllen,E.M.etal.Genomiccorrelatesofresponsetoctla-4blockadeinmetastaticmelanoma.Science350,207–211(2015).Article

Sahu,A.etal.Discoveryoftargetsforimmune–metabolicantitumordrugsidentifiesestrogen-relatedreceptoralpha.CancerDiscov.13,672–701(2023).Article

WeexpressourgratitudetoAdrienneM.Luoma,ShengbaoSuo,andKaiW.WucherpfennigforprovidingthescRNAdata23.WealsothankZexianZengforassistancewithdownloadingandaccessingthebulkRNA-seqdataset.AuthorinformationAuthorsandAffiliationsHarvardUniversity,Cambridge,MA,USAPhillipB.

我们感谢AdrienneM.Luoma，ShengbaoSuo和KaiW.Wucherpfennig提供scRNA数据23。我们还感谢ZexianZeng在下载和访问批量RNA-seq数据集方面的帮助。作者信息作者和附属机构哈佛大学，剑桥，马萨诸塞州，美国菲利普B。

Nicol&DaniellePaulsonUniversityofCaliforniaSanDiegoSchoolofMedicine,SanDiego,CA,USAGegeQianDana-FarberCancerInstitute,Boston,MA,USAX.ShirleyLiu&RafaelIrizarryUniversityofNewMexicoComprehensiveCancerCenter,Albuquerque,NM,USAAvinashD.SahuAuthorsPhillipB.NicolViewauthorpublicationsYoucanalsosearchforthisauthorin.

Nicol＆DaniellePaulsonUniversityofCalifornia圣地亚哥医学院，加利福尼亚州圣地亚哥，USAGege千达纳-法伯癌症研究所，波士顿，马萨诸塞州，USAX。ShirleyLiu和RafaelIrizarryUniversityofNewMexico综合癌症中心，新墨西哥州阿尔伯克基，USAAvinashD.SahuAuthorsPhillipB.NicolView作者出版物您也可以在中搜索这位作者。

PubMedGoogleScholarDaniellePaulsonViewauthorpublicationsYoucanalsosearchforthisauthorin

PubMedGoogleScholarDaniellePaulsonView作者出版物您也可以在

PubMedGoogleScholarGegeQianViewauthorpublicationsYoucanalsosearchforthisauthorin

PubMedGoogleScholarGegeQianView作者出版物您也可以在

PubMedGoogleScholarX.ShirleyLiuViewauthorpublicationsYoucanalsosearchforthisauthorin

PubMed谷歌学术期刊。ShirleyLiuView作者出版物您也可以在

PubMedGoogleScholarRafaelIrizarryViewauthorpublicationsYoucanalsosearchforthisauthorin

PubMedGoogleScholararafaelIrizarryView作者出版物您也可以在

PubMedGoogleScholarAvinashD.SahuViewauthorpublicationsYoucanalsosearchforthisauthorin

PubMedGoogleScholarAvinashD.SahuView作者出版物您也可以在

PubMedGoogleScholarContributionsP.B.N.,D.P.,G.Q.,X.S.L.,R.I.,andA.D.S.conceivedthestudy.P.B.N.andA.D.S.implementedthemethodandperformedtheexperiments.P.B.N.,R.I.,andA.D.S.wrotethemanuscript.CorrespondingauthorsCorrespondenceto

PubMed谷歌学术贡献SP。B、N.，D.P.，G.Q.，X.S.L.，R.I。和A.D.S.构思了这项研究。P、B.N.和A.D.S.实施了该方法并进行了实验。P、B.N.，R.I。和A.D.S.撰写了手稿。通讯作者通讯

RafaelIrizarryorAvinashD.Sahu.Ethicsdeclarations

拉斐尔·伊里扎里或阿维纳什·D·萨胡。道德宣言

Competinginterests

相互竞争的利益

X.S.L.conductedtheworkwhilebeingonthefacultyatDFCI,andiscurrentlyaboardmemberandCEOofGV20Therapeutics.P.B.N.,D.P.,G.Q.,R.I.,andA.D.S.declarenocompetinginterests.

Peerreview

同行评审

Peerreviewinformation

同行评审信息

NatureCommunicationsthankstheanonymousreviewer(s)fortheircontributiontothepeerreviewofthiswork.Apeerreviewfileisavailable.

NatureCommunications感谢匿名审稿人对这项工作的同行评审做出的贡献。可以获得同行评审文件。

AdditionalinformationPublisher’snoteSpringerNatureremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations.SupplementaryinformationSupplementaryInformationPeerReviewFileReportingSummarySourcedataSourceDataRightsandpermissions

AdditionalinformationPublisher的注释SpringerNature在已发布的地图和机构隶属关系中的管辖权主张方面保持中立。补充信息补充信息同行评审文件报告摘要源数据源数据权限

OpenAccessThisarticleislicensedunderaCreativeCommonsAttribution4.0InternationalLicense,whichpermitsuse,sharing,adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicence,andindicateifchangesweremade.

Theimagesorotherthirdpartymaterialinthisarticleareincludedinthearticle’sCreativeCommonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthearticle’sCreativeCommonslicenceandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder.

ReprintsandpermissionsAboutthisarticleCitethisarticleNicol,P.B.,Paulson,D.,Qian,G.etal.Robustidentificationofperturbedcelltypesinsingle-cellRNA-seqdata.

转载和许可本文引用本文Nicol，P.B.，Paulson，D.，Qian，G。等人。单细胞RNA-seq数据中扰动细胞类型的稳健鉴定。

ProvidedbytheSpringerNatureSharedItcontent-sharinginitiative

由SpringerNatureSharedIt内容共享计划提供

CommentsBysubmittingacommentyouagreetoabidebyourTermsandCommunityGuidelines.Ifyoufindsomethingabusiveorthatdoesnotcomplywithourtermsorguidelinespleaseflagitasinappropriate.

THE END

Nat.Commun：单细胞RNAseq数据中扰动细胞类型的稳健识别

Naturelo美国wholefood女性素食多种矿物质胶囊维生素

个性化健康管理前沿：健康长寿的辅助诊断和预测健康人医学抗衰老科学辅助诊断

Nat.Commun：单细胞RNAseq数据中扰动细胞类型的稳健识别

开源后的英伟达BioNeMo与谷歌AlphaFold3将如何重塑生物医药行业？英伟达生物医药制药