deseq2差异表达分析原理,deseq2差异表达分析

DESeq2的适用性

分析来自RNA-seq的计数数据，基因任务是检测差异表达基因。
也适用于其他分析：ChIP-Seq、HiC、shRNA筛选。

快速开始 dds = DESeqDataFromMatrix(countData = cts, colData = colData, design = ~batch + condition)dds = DESeq(dds)resultsNames(dds) #lists the coefficientsres = results(dds, contrast=c("condition","treated","untreated"))res = lfcShrink(dds,coef = "condition_trt_vs_untrt",type = "apeglm") 输入数据

输入数据需要归一化计数吗
不需要 输入矩阵的值为非标准计数或测序读数即可，DESeq2模型会对文库大小进行校正。

cts和colData的格式
cts列名为SampleName，行名为Gene。
colData列记录Sample的分组信息（不限一列），行名为SampleName。

预过滤
若Gene在所用Samples中的计数小于10则过滤掉。
keep = rowSums(counts(dds)) >= 10
dds = dds[keep,]

因子水平
通过设置factor的参考水平来告诉DESeq2函数与哪个水平进行比较。

1. dds$condtion = factor(dds$condition,levels = c("untreated","treated"))2. dds$condtion = relevel(dds$condtion,ref = "untreated") 差异表达分析

results函数
results(dds)生成结果表。

lfcShrink函数
logFoldChange缩小有助于基因的可视化和比较。可指定apeglm方法将dds对象传递给lfcShrink来缩小。
lfcShrink(dds,coef = "condition_treated_vs_untreated",type = "apeglm")

P值和调整后的P值
按P值对结果表进行排序。
使用summary(res)总结差异分析结果。
有多少个调整p值小于0.1？sum(res$padj < 0.1,na.rm = TRUE)
使用results函数设置padj cutoff results(dds,alpha = 0.05)]

什么时候能将P值设为NA

gene在所有样本中的计数均为0。某行包含一个极端计数异常的样本。 多因素设置
colData矩阵描述分组分组。
可对dds对象使用colData函数查看分组因子水平。
多因素设置在design参数增加比较因素。 DESeq2可视化 MA plot
图中x轴为baseMean，y轴为logFoldChange，若padj<0.1点标为红色。 plotMA(res,ylim = c(-2,2))缩小LFC可消除low readcount基因中与LFC变化相关的噪声。resLFC = lfcShrink(res,coef = "condition_treated_vs_untreated",type = "apeglm")plotMA(resLFC,ylim = c(-2,2))

可使用功能识别通过单击图来交互来检测单个基因行数：
identify(res$baseMean,res$log2FoldChange)

PlotCounts
对单个基因在各组中的读数进行可视化。计数经归一化。
plotCounts(dds,gene = which.min(res$padj),intgroup = "condition",returnData = False)

将returnData设置为TRUE，使用ggplot2绘制plotCounts。 d = plotCounts(dds,gene = which.min(res$padj),intgroup = "condition",returnData = TRUE)ggplot(d,aes(x = condition,y = count)) +geom_point(position = position_jitter(w = 0.1,h = 0)) +scale_y_log10() 数据转换和可视化

计数数据转换
变换的目的是消除方差对均值的依赖性（低均值，高方差）。在转换之后，具有相同均值的基因没有完全相同的标准差，但是整个实验范围内的趋势趋于平缓。
VST：方差稳定变换。大样本推荐选择。
rlog：正规对数。

盲散估计
VST和rlog函数有一个blind参数。当下游分析时将blind设置为FALSE。

提取转换后的值

vsd = vst(dds,blind = FALSE)rld = rlog(dds,blind = FALSE)assay(vsd)assay(rld) 数据质量评估 计数矩阵热图 library(pheatmap)select = order(rowMeans(counts(dds,normalized = TRUE)),decreasing = TRUE)[1:20]df = as.data.frame(colData(dds)[,c("condtion","type")])ntd <- normTransform(dds) #this gives log2(n+1)pheatmap(assay(ntd)[select,],annotation_col = df,cluster_rows = FALSE, cluster_cols = FALSE,show_rownames = FALSE) 样本及样本间的相关性
获得样本到样本的距离
sampleDist = dist(t(assay(vsd)))
定义距离矩阵，行名为分组信息，列名为空 sampleDistMatrix = as.matrix(sampleDist)rownames(sampleDistMatrix) = paste(vsd$condition,vsd$type,sep='_')colnames(sampleDistMatrix) = NULLpheatmap(sampleDistMatrix,clustering_distance_rows = sampleDist,clustering_distance_cols = sampleDist) PCA主成分图
plotPCA(vsd,intgroup = c("condition","type"))一分快三技巧1,h = 0)) +scale_y_log10() 数据转换和可视化

盲散估计
VST和rlog函数有一个blind参数。当下游分析时将blind设置为FALSE。

提取转换后的值