单细胞/时空文章代码复现——数据预处理及整合

lichengxin • 20小时前 • 杂文

在上一篇中，我们对皮肤损伤愈合文章中的损伤愈合组数据进行了伪bulk分析，接下来，我们开始正式对损伤愈合组的单细胞转录组数据进行预处理和多样本整合。
本篇用到的数据来自GEO数据库的GSE241132（sc）、GSE241124（st）和GSE265972（VU）。

1. 数据预处理

sc数据依然存放在/sc_data/目录下，每个以样本名命名的目录下为cellranger输出的三个矩阵文件（matrix.mtx.gz、features.tsv.gz、barcodes.tsv.gz）。在sc_data/下sample.list文件中写入所有样本名，每个样本一行。
根据文章的方法中给出，单细胞RNA数据的前处理包括：去除低质量（genes<500，percent.mt>20%，counts<1000）的细胞、去掉线粒体、血红蛋白基因以及在小于10个细胞中表达的基因、去掉Scrublet和DoubletFinder都认为是双胞的细胞。
首先导入所有我们需要的包，因为我们要用到Scrublet来去除双胞，而这是一个python包，所以我们还需要reticulate包和一个安装了Scrublet的python环境。

library(Seurat)
library(reticulate)
use_python("/path/to/python")
library(DoubletFinder)
library(dplyr)
library(harmony)
library(ggplot2)
###导入scrublet包
scru <- import('scrublet')
###cell cycling genes
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

我们直接使用循环来读入所有样本的数据，每个样本的处理相同：首先过滤掉在小于10个细胞中表达的基因；然后保留基因数大于500、UMI数大于1000且线粒体基因比例小于20%的细胞；之后将线粒体基因和血红蛋白基因删除；使用DoubletFinder识别双胞，DoubletFinder中有一个参数指定了预期的双胞率，文中对每个样本设定了不同的阈值，我们的复现使用默认的0.075；再使用scrublet进行双胞识别，这里我们依然是使用默认的预期双胞率0.06；将两个软件都认为是双胞的细胞去除。

l <- read.table('./sample.list', header=FALSE)
mlist <- list()

for (i in 1:length(l$V1)){
    count <- Read10X(paste0('/sc_data/',l$V1[i],'/'))
    obj <- CreateSeuratObject(count, assay='RNA', min.cells=10, min.features=200, project=l$V1[i])

    obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")

    VlnPlot(obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
    ggsave(paste0(l$V1[i],'_vln.pdf'), width=8, height=6)

    obj <- subset(obj, subset = nFeature_RNA > 500 & nCount_RNA > 1000 & percent.mt < 20)

    ###remove mito and hemo genes
    obj <- obj[!grepl("^MT-", rownames(obj)),]
    obj <- obj[!grepl("^HB[^(P)]", rownames(obj)),]

    ###DoubletFinder
    obj <- NormalizeData(obj)
    obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
    obj <- ScaleData(obj)
    obj <- RunPCA(obj)
    obj <- FindNeighbors(obj, dims=20)
    obj <- FindClusters(obj)
    obj <- RunUMAP(obj, dims = 1:20)

    sweep.res.list <- paramSweep_v3(obj, PCs = 1:20, sct = FALSE)
    sweep.stats <- summarizeSweep(sweep.res.list, GT = FALSE)   ###GT: whether there is ground truth label 
    bcmvn <- find.pK(sweep.stats)
    pK_bcmvn <- as.numeric(bcmvn$pK[which.max(bcmvn$BCmetric)])

    homotypic.prop <- modelHomotypic(obj$seurat_clusters)
    nExp_poi <- round(0.075 *nrow([email protected]))
    nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop))
    
    obj <- doubletFinder_v3(obj, PCs = 1:20, pN = 0.25, pK = pK_bcmvn, nExp = nExp_poi.adj, reuse.pANN = F, sct = F)
    colnames([email protected])[length(colnames([email protected]))-1] <- 'pANN'
    colnames([email protected])[length(colnames([email protected]))] <- 'DF.classifications'

    ###scrublet
    count <- t(as.matrix(obj@assays$RNA@counts))
    scrub = scru$Scrublet(count, expected_doublet_rate=0.06)
    np = as.integer(30)
    tmp_data = scrub$scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=np)
    obj$scr_doublet_scores <- tmp_data[[1]]
    obj$scr_predicted_doublets <- tmp_data[[2]]

    ###doublet filter
    obj <- subset(obj, subset = DF.classifications=='Singlet' | scr_predicted_doublets == 'FALSE')

    saveRDS(obj, paste0(l$V1[i], ".qc.rds"))

    mlist[[i]] <- obj
}

经过上述过滤后，我们共得到61,947个细胞，多于文章中的58,823个细胞，说明我们在双胞的认定条件设置上相对宽松，原文使用了更加严格的认定阈值。

2. 多样本数据整合

之后，我们使用Harmony进行多样本数据整合。在整合前使用SCTransform对数据进行标准化、计算细胞周期得分、选取top4000高可变基因进行PCA降维、然后使用RunHarmony函数进行整合，整合后再以resolution=0.8进行聚类。我们可以在整合前后分别画出UMAP图来对比整合的效果。

data_merge <- merge(x = mlist[[1]], y = mlist[2:length(mlist)])
###data preprocess
data_merge <- SCTransform(data_merge, variable.features.n=4000, vars.to.regress = "percent.mt", verbose = FALSE, return.only.var.genes = FALSE)
data_merge <- CellCycleScoring(data_merge, s.features = s.genes, g2m.features = g2m.genes)
data_merge <- RunPCA(data_merge, verbose = FALSE)
data_merge <- RunUMAP(data_merge, dims=1:30)
DimPlot(data_merge, reduction = "umap", group.by='orig.ident', split.by='orig.ident', ncol=4, pt.size=0.3, label = F, repel = TRUE)
ggsave('before_inte_umap.png', width=10, height=8)

data_merge <- RunHarmony(data_merge, group.by.vars = "orig.ident", dims = 1:40, reduction.use='pca', assay.use = "SCT")

data_merge <- data_merge %>%
              FindNeighbors(reduction = "harmony", dims = 1:30) %>%
              FindClusters(resolution = 0.8)

data_merge <- RunUMAP(data_merge, reduction = "harmony", dims = 1:30)

DimPlot(data_merge, reduction = "umap", group.by='orig.ident', split.by='orig.ident', ncol=4, pt.size=0.5, label = F, repel = TRUE)
ggsave('after_inte_umap.png', width=10, height=8)
DimPlot(data_merge, reduction = "umap", group.by='seurat_clusters', pt.size=1, label = F, repel = TRUE)
ggsave('cluster_umap.png', width=8, height=8)

saveRDS(data_merge, 'sc_inte.rds')

整合前各样本UMAP图

整合后各样本UMAP图

我们可以看到在整合前各时期之间的UMAP具有较大差异，而在整合后，12个样本间的批次效应基本被消除。
经过聚类，我们共得到31个cluster。

3. 差异基因筛选

在聚类后，我们将默认assay改回RNA，然后使用FindAllMarkers的MAST算法计算cluster间的差异基因，这些基因被用来进行细胞类型注释。

DefaultAssay(data_merge) <- 'RNA'
marker <- FindAllMarkers(data_merge, logfc.threshold = 0.25, min.pct = 0.25, test.use='MAST', only.pos=T)
marker <- marker %>% group_by(cluster) %>% dplyr::filter(p_val_adj < 0.05)
write.csv(marker, 'marker_filter.csv', quote=F)
top3 <- marker %>% group_by(cluster) %>% slice_head(n = 3)

DefaultAssay(data_merge) <- 'SCT'
DotPlot(data_merge, features = unique(top3$gene), col.min=0, dot.scale=4) + RotatedAxis() + scale_x_discrete("") + scale_y_discrete("") + theme(axis.text=element_text(size=8)) + scale_color_gradient(low='white', h
igh='blue')

Top3 DEG

4. 空间转录组数据的处理

同样的，st的所有数据存放至/st_data/，在st_sample.list文件中写入所有样本名，每个样本一行。
我们依然用循环来读入所有样本的数据，每个样本的处理相同：保留基因数大于100的spot；将线粒体基因、血红蛋白基因和MALAT1基因删除。

library(Seurat)
library(dplyr)
library(harmony)
library(ggplot2)

l <- read.table('./st_sample.list', header=FALSE)
mlist <- list()

for (i in 1:length(l$V1)){
    obj <- Load10X_Spatial(paste0('/st_data/',l$V1[i]), filename='/filtered_feature_bc_matrix.h5')
    obj$orig.ident <- l$V1[i]

    VlnPlot(obj, features = c("nFeature_Spatial", "nCount_Spatial"), ncol = 2)
    ggsave(paste0(l$V1[i],'_vln.pdf'), width=8, height=6)

    obj <- subset(obj, subset = nFeature_Spatial > 100)

    ###remove mito and hemo genes
    obj <- obj[!grepl("^MT-", rownames(obj)),]
    obj <- obj[!grepl("^HB[^(P)]", rownames(obj)),]
    obj <- obj[!grepl("MALAT1", rownames(obj)),]

    saveRDS(obj, paste0(l$V1[i], ".qc.rds"))
    mlist[[i]] <- obj
}

之后使用Harmony进行多样本数据整合。在整合前使用SCTransform对数据进行标准化、PCA降维、然后使用RunHarmony函数进行整合，整合后再以resolution=0.5进行聚类。我们可以在整合前后分别画出UMAP图来对比整合的效果。这里的代码与上面类似，只是要注意在SCTransform时的assay要指定为Spatial。

整合前的UMAP图

整合后的UMAP图

使用相同的方法进行差异基因筛选。

Top3 DEG

5. VU数据的处理

我们对VU数据使用与创伤愈合组相同的处理流程进行预处理、聚类、整合和差异基因筛选。
这样，我们就完成了文章中所有自产数据的预处理和整合，在下一篇，我们将着重介绍细胞类型的注释过程。

版权声明：
作者：lichengxin
链接：https://www.techfm.club/p/198392.html
来源：TechFM
文章版权归作者所有，未经允许请勿转载。

THE END

代码

二维码

30年后，为何我们还在怀念张学友的<人间道>无损flac网盘资源轩祥音乐

< <上一篇

Ac-N-Me-Tyr-Val-Ala-Asp-aldehyde,160806-26-8

下一篇>>

搜索内容