Single-cell RNA sequencing (scRNA-seq) methods are typically unable to quantify the expression levels of all genes in a cell, which creates an urgent need for the computational prediction of those missing values (‘dropout imputation’). However, most existing dropout imputation methods exclusively use the scRNA-seq dataset at hand and do not exploit external information about gene regulatory networks.
Here, we show that utilizing co-expression relationships learned from external, independent gene expression data substantially improves dropout imputation. Using scRNA-seq datasets from three studies covering a broad range of human cell types we demonstrate that our network-based approach outperforms published state-of-the-art methods. As a baseline reference we imputed missing values using the sample-wide average expression level of a gene, i.e. completely ignoring cel-to-cell variation in gene expression. Unexpectedly, we found that – depending on the dataset – 52% to 77% of the genes were better predicted using this baseline method than any of the other methods, suggesting that cell-to-cell variation of expression levels is negligible for many genes. The network-based method performed particularly well for lowly expressed genes. For example, it was capable of recovering cell-type-specific transcriptional regulators that were not correctly imputed with other approaches.
Our work shows that there is no single best imputation method. Rather, the best method depends on gene-specific features, such as expression level and expression variation across cells. Therefore we implemented an R-package called ADImpute that determines the best imputation method for each gene in a dataset.
R-package: ADImpute_V1.0.zip (version 1.0)
Network matrix (R object): network.coefficients_V2.0.rds (version 2.0)
Network matrix (txt file): Network.coefficient_V2.0.zip (version 2.0)