서로 다른 scRNA-seq data integration하기!_Public Data [R]

SMALL

저번부터 서로 다른 논문에서 나온 2개의 single cell RNA seq data를 분석하기 위해서 integration하는 방법을 정리하고 있습니다. 저번에는 merge하고 UMAP을 그렸을 때, 각각의 paper끼리 나누어지는 것을 볼 수 있었습니다. 이는 batch effect에 의해서 나타나는 형상인데요. 이러한 형상을 없애고, 각각 같은 cell type끼리 합쳐서 보이게 하기위해서 batch correction을 해볼 것입니다.

#Batch correction: canonical correlation analysis (CCA) + mutual nearest neighbors (MNN) using Seurat v4
#--------------------------------------------------------
# The first piece of code will identify variable genes that are highly variable in at least 2/4 datasets. We will use these variable genes in our batch correction.
# Why would we implement such a requirement?
ob.list <- list(Mock.F, h24.F, h48.F, mock.R,h24.R,h48.R)

# Identify anchors on the 6 datasets, commonly shared variable genes across samples, 
# and integrate samples.
merge.anchors <- FindIntegrationAnchors(object.list = ob.list, anchor.features = 2000, dims = 1:30)
# this command creates an 'integrated' data assay
comdined <- IntegrateData(anchorset = merge.anchors)

# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
DefaultAssay(comdined) <- "integrated"

# Run the standard workflow for visualization and clustering
comdined <- ScaleData(comdined, verbose = FALSE)
comdined <- RunPCA(comdined, npcs = 30, verbose = FALSE)
DimPlot(comdined, dims = c(1, 2), reduction = "pca", split.by = "time")
DimPlot(comdined, dims = c(1, 2), reduction = "pca", split.by = "paper")
DimPlot(comdined, dims = c(1, 2), reduction = "pca", split.by = "each")
comdined <- FindNeighbors(comdined, reduction = "pca", dims = 1:30)
comdined <- FindClusters(comdined, resolution = 0.15)
comdined <- RunUMAP(comdined, reduction = "pca", dims = 1:30)

저번에 했었던 것과 다르게 모든 세포들이 비슷한 모양으로 퍼져있는 것을 확인할 수 있습니다.

# After data integration, use the original expression data in all visualization and DE tests.
# The integrated data must not be used in DE tests as it violates assumptions of independence in DE tests!
DefaultAssay(comdined) <- "RNA"  

# Visualize the Louvain clustering and the batches on the UMAP. 
# Remember, the clustering is stored in @meta.data in column seurat_clusters 
# and the technology is stored in the column tech. Remember you can also use DimPlot.
p1 <- DimPlot(comdined, reduction = "umap", group.by = "seurat_clusters")
p2 <- DimPlot(comdined, reduction = "umap", group.by = "paper")
p3 <- DimPlot(comdined, reduction = "umap", group.by = "time")
p4 <- DimPlot(comdined, reduction = "umap", group.by = "each")
p1 + p2
p3 + p4

결과 cluster도 잘 나뉘어졌고, 전과 다르게 paper끼리 나누어진 것이 아니라 골고루 서로 잘 섞여있는 것을 확인할 수 있다. 어찌되었든 난 바이러스를 전공을 하고 있어서, 내가 보고자 하는 바이러스 RNA의 발현률을 확인을 해보았다.

각 paper마다 바이러스 gene에 붙여놓은 테깅이 달라서 각 paper에서 어떤 것으로 했는지 확인이 필요했다. 이런 부분은 single cell RNA seq data분석을 할 때 cell ranger라는 프로그램을 통해 preprocessing 과정이 있는데 그 부분에서 정해지는 듯 하다.

comdined[["percent.cov2"]] <- PercentageFeatureSet(comdined, pattern = "^cov2-")
comdined[["percent.cov2.2"]] <- PercentageFeatureSet(comdined, pattern = "^scv2-")
comdined[["cov2.total.percent"]] <- comdined[["percent.cov2"]]+comdined[["percent.cov2.2"]]

comdined %>% mutate(cov2.transcript=nCount_RNA*cov2.total.percent/100) -> comdined
comdined %>% mutate(log2.cov2.transcript=log2(nCount_RNA*cov2.total.percent/100+1)) -> comdined

mutate(comdined, infection.status = ifelse(cov2.total.percent > 0.1, "infected", "uninfected")) -> comdined

FeaturePlot(comdined, features = "cov2.total.percent")

그래서 바이러스 gene이 10% 이상 있는 경우는 infection되어있는 cell, 그렇지 않은 것은 uninfection으로 나누어주었다.

VlnPlot(comdined, features = "log2.cov2.transcript", group.by = "time")
VlnPlot(comdined, features = "log2.cov2.transcript", group.by = "each")

VlnPlot(comdined, features = "cov2.total.percent", group.by = "time")
VlnPlot(comdined, features = "cov2.total.percent", group.by = "each")

이렇게 우선 각기 다른 paper에서 public한 data를 integration해보았다. 하지만 이렇게 batch correction하는 방법은 다양하다고 한다. 전에 읽었던 meta-analysis paper에서는 harmony라는 것을 이용하였다고 한다.

해서 다음에는 이 package를 가지고 integration 해보려 한다.

LIST

'바이오인포메틱스' 카테고리의 다른 글

DNA sequence (sanger sequence vs NGS) and DNA microarray (0)	2022.12.22
바이오인포메틱스를 이해하기위해 알아야하는 생물학 지식!!!_1 (3)	2022.11.07
서로 다른 scRNA-seq data integration하기!_Public Data [R]_1 (3)	2022.10.20
Seurat이용해서 single cell RNA-seq 분석하기_1) Seurat Object만들기 + 구조 이해하기 (5)	2022.10.18