# Suppress warnings
options(warn = -1)
MicroRNAs (miRNAs) regulate gene expression by binding to mRNAs, and consequently reduce target gene expression levels and expression variability, which is also known as ‘noise’. Single-cell sequencing technology now enables profiling mRNA and miRNA expression in single cells. With this technology, we are able to better quantify gene expression noise by measuring cell-to-cell variability, and therefore analyze how miRNAs regulate the expression of their target genes.
This vignette will demonstrate an analysis workflow using R programming language for miRNA regulation in single cells.
We will use TargetScan to predict the target genes of miRNAs. The following data are needed in the analysis.
We will illustrate the workflow with a dataset from Wang et al. 2019. They performed co-sequencing of miRNAs and mRNAs in same K562 single cells using a half-cell genomics approach. The dataset is available at GSE114071. It contains mRNA and miRNA sequencing results of 19 successfully profiled single cells. Single-cell mRNA sequencing is performed using Smart-seq protocol.
Unluckily, raw read count matrix of mRNA is not provided in this dataset, which is required in our analysis. Therefore, we quantified gene expression from raw sequencing files using RSEM, with GENCODE v27 annotation. RSEM is a popular software for accurately quantifying gene and transcript expression levels from RNA-Seq data. The algorithm is described in Li et al. 2011, and its detailed usage is available on Github.
The following tables generated by RSEM, along with a copy of other data are available in this Github repository.
Read miRNA expression matrix. After removing redundant columns, the first column shows the name of miRNAs and column 2~20 correspond to 19 successfully profiled single cells (K562_HalfCell_01 ~ K562_HalfCell_20 excluding K562_HalfCell_06, because it did not pass quality control).
Please note that values in the matrix are normalized read counts. The raw read count of a miRNA is divided by the library size, applied with a minimum expression level of 10−4 (i.e. values less than 10−4 were set to 10−4), and log2-transformed to get the normalized read count. We call this normalized read count 'log2 fraction'. The minimum value in the matrix is log2(10−4)=−13.287712
mirna <- read.table('GSE114071_NW_scsmRNA_K562_norm_log2.gct', header=T, stringsAsFactors=F)
mirna <- mirna[,c(1,3:21)] # Column 2 containing descriptions is removed
head(mirna)
Read mRNA expression matrix. The values in the matrix are RPKM values. We match the columns of this matrix to the columns of miRNA expression matrix, such that column 2~20 correspond to 19 successfully profiled single cells.
mrna_rpkm <- read.table('GSE114071_RPKM.tsv', header=T, stringsAsFactors=F)
# We retain 19 successfully profiled single cells, excluding K562_HalfCell_06
mrna_rpkm <- mrna_rpkm[,c(1:6,8:21)]
colnames(mrna_rpkm) <- colnames(mirna)
head(mrna_rpkm)
Read target prediction table from TargetScan. We will use family_info
to get the miRNA family of each miRNA, and TS
to get the predicted target genes of each miRNA family.
# TargetScan prediction
TS <- read.table('Predicted_Targets_Info.default_predictions.txt', header=T, sep='\t', stringsAsFactors=F)
family_info <- read.table('miR_Family_Info.txt', header=T, sep='\t', stringsAsFactors=F)
In this section, we are going to estimate the expression levels (average expression) and expression noises (the variability of expression) of mRNAs in single cells with RPKM values.
To get an accurate estimate of mean and variability, mRNAs expressed in less than 5 single cells are filtered out.
# Column 2~20 correspond to 19 single cells
index_cells <- 2:20
# We retained mRNAs expressed in no less than 5 single cells.
gene_filt <- which(apply(mrna_rpkm[,index_cells] > 0, 1, sum) >= 5)
The most common measure to quantify the variability of a random variable is standard deviation (SD). Here, we first calculate the mean and SD of the expression levels of mRNAs. The meta data of retained mRNAs are saved in dataframe meta
.
meta <- data.frame(gene=mrna_rpkm[gene_filt, 'Name']) # Gene name
meta$mean_rpkm <- apply(mrna_rpkm[gene_filt, index_cells], 1, mean) # Mean RPKM
meta$sd_rpkm <- apply(mrna_rpkm[gene_filt, index_cells], 1, sd) # SD RPKM
Unluckily, SD is not a good measure to quantify expression noises of mRNAs in single cells. This is because SD RPKM values generally tend to increase when mean RPKM values increase in single cells. In other word, mean RPKM and SD RPKM are not independent. Here we can see that mean RPKM and SD RPKM values have a strong linear correlation on a logarithmic scale (R2=0.947).
# Linear regression model for log SD RPKM and log mean RPKM
fit <- lm(log10(sd_rpkm) ~ log10(mean_rpkm), meta)
summary(fit)
We can also visualize the result using ggplot2. The red line in the plot below represents the ordinary least squares (OLS) linear regression model, and the blue line represents a smooth curve fitted by Generalized Additive Models (GAM). The linear model fits well with data. Apparant discrepancy between two curves is only observed at two sides.
# Load required package for plotting and set plot size
library(ggplot2)
options(repr.plot.width=5, repr.plot.height=5)
ggplot(meta, aes(x=log10(mean_rpkm), y=log10(sd_rpkm))) +
geom_point(color='gray', size=0.1) +
geom_smooth(fill='cyan', size=1) +
geom_smooth(method=lm, color='red', fill='pink', size=1) +
labs(x=expression(log[10]~'Mean RPKM'), y=expression(log[10]~'SD RPKM')) +
annotate('text', x=3, y=-1, label=expression(R^2==0.947), size=5) +
theme_bw()
If we use SD RPKM values as expression noises of mRNAs, highly expressed mRNAs will always have higher expression noises. This is undesirable, because in this way we cannot separately measure miRNAs' regulative effect on target genes' expression levels and noises. Therefore we have to account for the effect of mean RPKM on SD RPKM.
Residual SDs are defined as the residuals of the linear model after 'regressing out' mean RPKM. Residual SD=SD RPKM−^SD RPKM where ^SD RPKM is the fitted value from the linear model. A Residual SD measures the relative noise of a mRNA compared to the mean noise of mRNAs at a same expression level. It is not linearly correlated with mean RPKM so we would like to use it as the expression noise.
# Calculate Residual SD
meta$rsd_rpkm <- fit$residuals
ggplot(meta, aes(x=log10(mean_rpkm), y=rsd_rpkm)) +
geom_point(color='gray', size=0.1) +
geom_smooth(fill='cyan', size=1) +
geom_smooth(method=lm, color='red', fill='pink', size=1) +
labs(x=expression(log[10]~'Mean RPKM'), y='Residual SD RPKM') +
theme_bw()
In summary, we will use mean RPKM values (meta$mean_rpkm
) as expression levels and Residual SD RPKM values (meta$rsd_rpkm
) as expression noise of mRNAs in the following analysis.
After quantifying the expression levels and noises of mRNAs, we are going to measure how miRNAs regulate target mRNAs' expression levels and noises. Here we define a function mir2meta_var
to get the target mRNAs' expression levels and noises of any miRNA. As TargetScan predicts target mRNAs by miRNA families, we first obtain the families of miRNAs, and then get target mRNAs. Finally, we retrieve the expression levels or noises of target mRNAs from dataframe meta
.
# Get the target mRNAs' expression levels and noises of miRNAs
# mir: vector of string, miRNA names
# meta_var: string, a variable saved in dataframe meta.
mir2meta_var <- function(mir, meta_var)
{
# Convert miRNAs into miRNA families
mirna_family <- family_info[family_info$MiRBase.ID %in% mir, 'miR.family']
# Predict target mRNAs of miRNA families
mirna_target <- unique(TS[TS$miR.Family %in% mirna_family, 'Gene.Symbol'])
# Get the expression levels or noises of target mRNAs calculated above
target_meta_var <- meta[meta$gene %in% mirna_target, meta_var]
return(target_meta_var)
}
We assume that highly expressed miRNAs will be more likely to regulate their target mRNAs. Therefore, we divide miRNAs into four groups according to their expression levels:
For miRNAs in each group, we get their target mRNAs' expression levels and noises using the function mir2meta_var
defined above.
# Calculate the mean expression levels of miRNAs
mirna_mean <- apply(mirna[,index_cells], 1, mean)
# Divide miRNAs into four groups according to their expression levels, and get expression levels (mean RPKM) in different groups
min_expression <- min(mirna[,-1])
level_background <- mir2meta_var(mirna[mirna_mean > min_expression & mirna_mean <= -12, 1], 'mean_rpkm')
level_LE_mirna <- mir2meta_var(mirna[mirna_mean > -12 & mirna_mean <= -9, 1], 'mean_rpkm')
level_ME_mirna <- mir2meta_var(mirna[mirna_mean > -9 & mirna_mean <= -6, 1], 'mean_rpkm')
level_HE_mirna <- mir2meta_var(mirna[mirna_mean > -6, 1], 'mean_rpkm')
# Divide miRNAs into four groups according to their expression levels, and get expression noises (RCV RPKM) in different groups
noise_background <- mir2meta_var(mirna[mirna_mean > min_expression & mirna_mean <= -12, 1], 'rsd_rpkm')
noise_LE_mirna <- mir2meta_var(mirna[mirna_mean > -12 & mirna_mean <= -9, 1], 'rsd_rpkm')
noise_ME_mirna <- mir2meta_var(mirna[mirna_mean > -9 & mirna_mean <= -6, 1], 'rsd_rpkm')
noise_HE_mirna <- mir2meta_var(mirna[mirna_mean > -6, 1], 'rsd_rpkm')
Here we can check the numbers of miRNAs and target mRNAs in each group, which are saved in a dataframe group_size
.
# In this dataframe, each row contains the name of a group, the number of miRNA and the number of target mRNA in the group.
group_size <- data.frame(mirna_expression=c('background','LE','ME','HE'),
mirna_number=c(sum(mirna_mean > min_expression & mirna_mean <= -12), sum(mirna_mean > -12 & mirna_mean <= -9), sum(mirna_mean > -9 & mirna_mean <= -6), sum(mirna_mean > -6)),
target_number=c(length(level_background), length(level_LE_mirna), length(level_ME_mirna), length(level_HE_mirna)))
group_size
Now we are ready to examine how miRNAs regulate their target mRNAs' expression. We assume that the regulative effect of miRNAs globally depends on their expression levels, so we are going to compare the distribution of expression levels and noises of target mRNAs in different groups. These data are saved in a dataframe meta_var_df
.
# In this dataframe, each row contains the group of a target mRNA, its expression level and expression noise.
meta_var_df <- data.frame(group=rep(group_size$mirna_expression, group_size$target_number),
level=c(level_background, level_LE_mirna, level_ME_mirna, level_HE_mirna),
noise=c(noise_background, noise_LE_mirna, noise_ME_mirna, noise_HE_mirna))
We can use Kruskal–Wallis test to test whether target mRNAs' expression levels or noises in different groups are from the same distributions. Here the small P-values strongly support the alternative hypothesis: Target mRNAs' expression levels or noises in different groups are from different distributions.
kruskal.test(level ~ group, data = meta_var_df)
kruskal.test(noise ~ group, data = meta_var_df)
To determine the statistical significance of difference between expression levels and noises from any two groups, we use Kolmogorov–Smirnov (KS) test to get the P-values and save them in a dataframe p_df
.
# In this dataframe, each row contains names of two groups, statistical significance of difference between expression levels and noises from two groups.
p_df <- data.frame(x=c('background', 'background', 'background', 'LE', 'LE', 'ME'), # group A
y=c('LE', 'ME', 'HE', 'ME', 'HE', 'HE'), # group B
p_level=c(ks.test(level_background, level_LE_mirna)$p.value,
ks.test(level_background, level_ME_mirna)$p.value,
ks.test(level_background, level_HE_mirna)$p.value,
ks.test(level_LE_mirna, level_ME_mirna)$p.value,
ks.test(level_LE_mirna, level_HE_mirna)$p.value,
ks.test(level_ME_mirna, level_HE_mirna)$p.value),
p_noise=c(ks.test(noise_background, noise_LE_mirna)$p.value,
ks.test(noise_background, noise_ME_mirna)$p.value,
ks.test(noise_background, noise_HE_mirna)$p.value,
ks.test(noise_LE_mirna, noise_ME_mirna)$p.value,
ks.test(noise_LE_mirna, noise_HE_mirna)$p.value,
ks.test(noise_ME_mirna, noise_HE_mirna)$p.value))
Now we are going to visualize results obtained above. We use empirical cumulative distribtion functions (ECDFs) to show the distributions of target mRNAs' expression levels and noises (left plots), and heatmap to show statistical significance between any two groups (right plots). We defind two functions plot_exp_level
and plot_exp_noise
for plotting, which must be run after obtaining meta_var_df
and p_df
.
In left plots, each line corresponds to the ECDF of target mRNAs' expression levels or noises in each group. The leftmost line corresponds to the group with the lowest overall expression levels or noises.
In right plots, −log10 P values from KS-tests are labeled to show the statistical significance.
library(ggpubr)
options(repr.plot.width=10, repr.plot.height=5)
# Plot ECDFs and P-value heatmap for target mRNA' expression levels in different groups
plot_exp_level <- function()
{
plot_ecdf <- ggplot(meta_var_df, aes(x=log10(level), color=group)) +
stat_ecdf(geom = "step") +
lims(x=quantile(log10(meta_var_df$level),c(0.01,0.99))) +
labs(x=expression(log[10]~'Mean RPKM') ,y='Quantile', title="miRNA regulation on target mRNAs' expression levels") +
scale_color_discrete(breaks=c('HE','ME','LE','background'),
name='miRNA mean expression',
labels=expression(log[2]~fraction > -6, -9 < log[2]~fraction <= -6, -12 < log[2]~fraction <= -9, 'All expressed')) +
theme_bw() +
theme(legend.position=c(0.7,0.2), title=element_text(size=10))
plot_p <- ggplot(p_df,aes(x=x,y=y)) +
geom_raster(aes(fill = -log10(p_level))) +
geom_text(aes(label = round(-log10(p_level),2)), size=7) +
scale_fill_gradient(low = "white", high = "red") +
scale_x_discrete(limits=c('background','LE','ME'), labels=expression('All expressed', -12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6)) +
scale_y_discrete(limits=c('LE','ME','HE'), labels=expression(-12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6, log[2]~' fraction > -6')) +
labs(x='',y='',title=expression(-log[10]~'P value (KS-test)')) +
theme(legend.position = 'none', axis.text = element_text(angle=30, hjust=1))
ggarrange(plot_ecdf, plot_p)
}
plot_exp_level()
# Plot ECDFs and P-value heatmap for target mRNA' expression noises in different groups
plot_exp_noise <- function()
{
plot_ecdf <- ggplot(meta_var_df, aes(x=noise, color=group)) +
stat_ecdf(geom = "step") +
lims(x=quantile(meta_var_df$noise,c(0.01,0.99))) +
labs(x='Residual SD RPKM' ,y='Quantile', title="miRNA regulation on target mRNAs' expression noises") +
scale_color_discrete(breaks=c('HE','ME','LE','background'),
name='miRNA mean expression',
labels=expression(log[2]~fraction > -6, -9 < log[2]~fraction <= -6, -12 < log[2]~fraction <= -9, 'All expressed')) +
theme_bw() +
theme(legend.position=c(0.7,0.2), title=element_text(size=10))
plot_p <- ggplot(p_df,aes(x=x,y=y)) +
geom_raster(aes(fill = -log10(p_noise))) +
geom_text(aes(label = round(-log10(p_noise),2)), size=7) +
scale_fill_gradient(low = "white", high = "red") +
scale_x_discrete(limits=c('background','LE','ME'), labels=expression('All expressed', -12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6)) +
scale_y_discrete(limits=c('LE','ME','HE'), labels=expression(-12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6, log[2]~fraction > -6)) +
labs(x='',y='',title=expression(-log[10]~'P value (KS-test)')) +
theme(legend.position = 'none', axis.text = element_text(angle=30, hjust=1))
ggarrange(plot_ecdf, plot_p)
}
plot_exp_noise()
From plots above, we can see that ECDF lines from left to the right are always: green-purple-blue-red. P-values show that mRNAs targeted by highly expressed miRNAs have significantly lower expression levels and noises than those of lowly expressed miRNAs. This suggest that the overall expression levels and noises of target mRNAs tend to decrease as miRNA mean expression level increases. In other word, miRNAs regulate their target mRNAs by reducing their expression levels and noises.
In the above analysis, target mRNAs in four groups are not exclusive. For example, a mRNA can be targeted by both a highly expressed miRNA and a lowly expressed miRNA, and therefore it is in more than one group. We can remove those common target mRNAs in some groups to see how results change. Here we first get the names of target mRNA in different groups.
# Get target mRNA names in different groups
gene_background <- mir2meta_var(mirna[mirna_mean > min_expression & mirna_mean <= -12, 1], 'gene')
gene_LE_mirna <- mir2meta_var(mirna[mirna_mean > -12 & mirna_mean <= -9, 1], 'gene')
gene_ME_mirna <- mir2meta_var(mirna[mirna_mean > -9 & mirna_mean <= -6, 1], 'gene')
gene_HE_mirna <- mir2meta_var(mirna[mirna_mean > -6, 1], 'gene')
As we assume that highly expressed miRNAs have stronger regulative effect, mRNAs targeted by multiple miRNAs should be predominantly regulated by highly expressed miRNAs. Therefore they should be kept in higher expression groups (groups defined by miRNAs with higher expression levels) rather than lower expression groups (groups defined by miRNAs with lower expression levels). We are going to remove mRNAs in lower expression groups that are also present in higher expression groups. This includes:
After removing common target mRNAs, the numbers of target mRNA in four groups are much closer.
# Remove mRNAs in lower expression groups that are also present in higher expression groups.
retain_ME <- !gene_ME_mirna %in% gene_HE_mirna
retain_LE <- !gene_LE_mirna %in% gene_ME_mirna & !gene_LE_mirna %in% gene_HE_mirna
retain_background <- !gene_background %in% gene_ME_mirna & !gene_background %in% gene_HE_mirna & !gene_background %in% gene_LE_mirna
level_ME_mirna <- level_ME_mirna[retain_ME]
level_LE_mirna <- level_LE_mirna[retain_LE]
level_background <- level_background[retain_background]
noise_ME_mirna <- noise_ME_mirna[retain_ME]
noise_LE_mirna <- noise_LE_mirna[retain_LE]
noise_background <- noise_background[retain_background]
# In this dataframe, each row contains the name of a group, the number of miRNA and the number of target mRNA in the group.
group_size <- data.frame(mirna_expression=c('background','LE','ME','HE'),
mirna_number=c(sum(mirna_mean > min_expression & mirna_mean <= -12), sum(mirna_mean > -12 & mirna_mean <= -9), sum(mirna_mean > -9 & mirna_mean <= -6), sum(mirna_mean > -6)),
target_number=c(length(level_background), length(level_LE_mirna), length(level_ME_mirna), length(level_HE_mirna)))
group_size
# In this dataframe, each row contains the group of a target mRNA, its expression level and expression noise.
meta_var_df <- data.frame(group=rep(group_size$mirna_expression, group_size$target_number),
level=c(level_background, level_LE_mirna, level_ME_mirna, level_HE_mirna),
noise=c(noise_background, noise_LE_mirna, noise_ME_mirna, noise_HE_mirna))
# In this dataframe, each row contains names of two groups, statistical significance of difference between expression levels and noises from two groups.
p_df <- data.frame(x=c('background', 'background', 'background', 'LE', 'LE', 'ME'), # group A
y=c('LE', 'ME', 'HE', 'ME', 'HE', 'HE'), # group B
p_level=c(ks.test(level_background, level_LE_mirna)$p.value,
ks.test(level_background, level_ME_mirna)$p.value,
ks.test(level_background, level_HE_mirna)$p.value,
ks.test(level_LE_mirna, level_ME_mirna)$p.value,
ks.test(level_LE_mirna, level_HE_mirna)$p.value,
ks.test(level_ME_mirna, level_HE_mirna)$p.value),
p_noise=c(ks.test(noise_background, noise_LE_mirna)$p.value,
ks.test(noise_background, noise_ME_mirna)$p.value,
ks.test(noise_background, noise_HE_mirna)$p.value,
ks.test(noise_LE_mirna, noise_ME_mirna)$p.value,
ks.test(noise_LE_mirna, noise_HE_mirna)$p.value,
ks.test(noise_ME_mirna, noise_HE_mirna)$p.value))
kruskal.test(level ~ group, data = meta_var_df)
kruskal.test(noise ~ group, data = meta_var_df)
plot_exp_level()
plot_exp_noise()
Here we can see that after processing common target mRNAs, much more significant differences in expression levels and noises among four groups or between any two groups are observed.
Despite single-cell sequencing has many strengths, various technical factors like amplification bias, library size differences and low RNA capture rate lead to substantial technical noise in single-cell sequencing experiments. A common feature of single-cell sequencing data is the sparsity of gene expression matrix. This is because some transcripts are not captured in reverse transcription step and it produces false zero read counts, which are also known as “dropout” events. Therefore some tools have been developed to "denoise" gene expression matrix for the downstream analysis.
Deep Count Autoencoder (DCA) is a command-line software to denoise scRNA-seq datasets. It takes a raw count matrix as input, and outputs the denoised count matrix. The algorithm and application are described in Eraslan et al. 2019, and its detailed usage is available on Github. Similar methods for single-cell data denoising include MAGIC and SAVER. Here we use DCA to denoise our mRNA expression matrix and repeat the analysis.
Before installing DCA, you have to install Python and Tensorflow. Please refer to their websites for instructions if you have not installed them. After that, you can type following commands in Shell to install and run DCA.
# Run this in shell rather than in R
# Install DCA
$ pip install dca
# Run DCA
$ dca GSE114071_count.tsv GSE114071_DCA_results
The denoised, library size-normalized count matrix is available at GSE114071_DCA_results/mean_norm.tsv. We rename it into GSE114071_DCA.tsv in the downstream analysis. Notice that if you run DCA multiple times, the output matrices will be slightly different. Here we only demonstrate the workflow using one output matrix. You can validate the result by repeating the analysis with more output matrices.
Read mRNA expression matrix by DCA. The values in the matrix are denoised counts by DCA. We match the columns of this matrix to the columns of miRNA expression matrix, such that column 2~20 correspond to 19 successfully profiled single cells.
# Read DCA output
mrna_dca <- read.table('GSE114071_DCA.tsv', header=T, stringsAsFactors=F, row.names = NULL)
mrna_dca <- mrna_dca[,c(1:6,8:21)]
colnames(mrna_dca) <- colnames(mirna)
head(mrna_dca)
DCA is designed for denoising Unique Molecule Identifier (UMI) count matrix, so it does not take gene length into account. As Smart-seq is designed for full-length mRNA sequencing and does not incorporate UMIs, normalization for gene length is necessary. Therefore we have to further normalize DCA output by dividing gene length. The resultant values, named DCA normalized counts, are similar as RPKM values but are denoised.
# Read matrix of effective gene length
mrna_length <- read.table('GSE114071_length.tsv', header=T, stringsAsFactors=F)
mrna_length <- mrna_length[,c(1:6,8:21)]
colnames(mrna_length) <- colnames(mirna)
# Normalize DCA output by dividing gene length
mrna_dca[,-1] <- mrna_dca[,-1] / mrna_length[mrna_length$Name %in% mrna_dca$Name, -1]
We match genes in DCA output to genes in RPKM matrix. Notice that DCA removes unexpressed genes so there are fewer genes (rows) in DCA output.
# Only retain the matched genes
mrna_dca <- mrna_dca[mrna_dca$Name %in% meta$gene,]
matched_rows <- which(meta$gene %in% mrna_dca$Name)
Next we repeat the analysis workflow with DCA normalized counts. Here mean DCA normalized counts (meta$mean_dca
) are used as expression levels and Residual SD DCA normalized counts (meta$rsd_dca
) are used as expression noises of mRNAs.
meta[matched_rows, 'mean_dca'] <- apply(mrna_dca[,index_cells], 1, mean) # Mean DCA normalized counts
meta[matched_rows, 'sd_dca'] <- apply(mrna_dca[,index_cells], 1, sd) # SD DCA normalized counts
# Calculate Residual SD
fit <- lm(log10(sd_dca) ~ log10(mean_dca), meta)
meta[matched_rows, 'rsd_dca'] <- fit$residuals
summary(fit)
plot_left <- ggplot(meta, aes(x=log10(mean_dca), y=log10(sd_dca))) +
geom_point(color='gray', size=0.1) +
geom_smooth(fill='cyan', size=1) +
geom_smooth(method=lm, color='red', fill='pink', size=1) +
labs(x=expression(log[10]~'Mean DCA normalized count'), y=expression(log[10]~'SD DCA normalized count')) +
annotate('text', x=0, y=-3.5, label=expression(R^2==0.968), size=5) +
theme_bw()
plot_right <- ggplot(meta, aes(x=log10(mean_rpkm), y=rsd_rpkm)) +
geom_point(color='gray', size=0.1) + theme_bw() +
geom_smooth(fill='cyan', size=1) +
geom_smooth(method=lm, color='red', fill='pink', size=1) +
labs(x=expression(log[10]~'Mean DCA normalized count'), y='Residual SD DCA normalized count')
ggarrange(plot_left, plot_right)
# Get expression levels (mean DCA normalized counts) in different groups
level_background <- mir2meta_var(mirna[mirna_mean > min_expression & mirna_mean <= -12, 1], 'mean_dca')
level_LE_mirna <- mir2meta_var(mirna[mirna_mean > -12 & mirna_mean <= -9, 1], 'mean_dca')
level_ME_mirna <- mir2meta_var(mirna[mirna_mean > -9 & mirna_mean <= -6, 1], 'mean_dca')
level_HE_mirna <- mir2meta_var(mirna[mirna_mean > -6, 1], 'mean_dca')
# Get expression noises (RCV DCA normalized counts) in different groups
noise_background <- mir2meta_var(mirna[mirna_mean > min_expression & mirna_mean <= -12, 1], 'rsd_dca')
noise_LE_mirna <- mir2meta_var(mirna[mirna_mean > -12 & mirna_mean <= -9, 1], 'rsd_dca')
noise_ME_mirna <- mir2meta_var(mirna[mirna_mean > -9 & mirna_mean <= -6, 1], 'rsd_dca')
noise_HE_mirna <- mir2meta_var(mirna[mirna_mean > -6, 1], 'rsd_dca')
# The number of miRNAs and target mRNAs in each group.
group_size <- data.frame(mirna_expression=c('background','LE','ME','HE'),
mirna_number=c(sum(mirna_mean > min_expression & mirna_mean <= -12), sum(mirna_mean > -12 & mirna_mean <= -9), sum(mirna_mean > -9 & mirna_mean <= -6), sum(mirna_mean > -6)),
target_number=c(length(noise_background), length(noise_LE_mirna), length(noise_ME_mirna), length(noise_HE_mirna)))
# In this dataframe, each row contains the group of a target mRNA, its expression level and expression noise.
meta_var_df <- data.frame(group=rep(group_size$mirna_expression, group_size$target_number),
level=c(level_background, level_LE_mirna, level_ME_mirna, level_HE_mirna),
noise=c(noise_background, noise_LE_mirna, noise_ME_mirna, noise_HE_mirna))
p_df <- data.frame(x=c('background','background','background','LE','LE','ME'),
y=c('LE','ME','HE','ME','HE','HE'),
p_level=c(ks.test(level_background, level_LE_mirna)$p.value,
ks.test(level_background, level_ME_mirna)$p.value,
ks.test(level_background, level_HE_mirna)$p.value,
ks.test(level_LE_mirna, level_ME_mirna)$p.value,
ks.test(level_LE_mirna, level_HE_mirna)$p.value,
ks.test(level_ME_mirna, level_HE_mirna)$p.value),
p_noise=c(ks.test(noise_background, noise_LE_mirna)$p.value,
ks.test(noise_background, noise_ME_mirna)$p.value,
ks.test(noise_background, noise_HE_mirna)$p.value,
ks.test(noise_LE_mirna, noise_ME_mirna)$p.value,
ks.test(noise_LE_mirna, noise_HE_mirna)$p.value,
ks.test(noise_ME_mirna, noise_HE_mirna)$p.value))
kruskal.test(level ~ group, data = meta_var_df)
kruskal.test(noise ~ group, data = meta_var_df)
plot_ecdf <- ggplot(meta_var_df, aes(x=log10(level), color=group)) +
stat_ecdf(geom = "step") +
lims(x=quantile(log10(meta_var_df$level), c(0.01,0.99))) +
labs(x=expression(log[10]~'Mean DCA normalized count') ,y='Quantile', title="miRNA regulation on target mRNAs' expression levels") +
scale_color_discrete(breaks=c('HE','ME','LE','background'),
name='miRNA Mean Expression',
labels=expression(log[2]~fraction > -6, -9 < log[2]~fraction <= -6, -12 < log[2]~fraction <= -9, 'All expressed')) +
theme_bw() +
theme(legend.position=c(0.7,0.2), title=element_text(size=10))
plot_p <- ggplot(p_df,aes(x=x,y=y)) +
geom_raster(aes(fill = -log10(p_level))) +
geom_text(aes(label = round(-log10(p_level),2)), size=7) +
scale_fill_gradient(low = "white", high = "red") +
scale_x_discrete(limits=c('background','LE','ME'), labels=expression('All expressed', -12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6)) +
scale_y_discrete(limits=c('LE','ME','HE'), labels=expression(-12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6, log[2]~fraction > -6)) +
labs(x='',y='',title=expression(-log[10]~'P value (KS-test)')) +
theme(legend.position = 'none', axis.text = element_text(angle=30, hjust=1))
ggarrange(plot_ecdf, plot_p)
plot_ecdf <- ggplot(meta_var_df, aes(x=noise, color=group)) +
stat_ecdf(geom = "step") +
lims(x=quantile(meta_var_df$noise, c(0.01,0.99))) +
labs(x='Residual SD DCA normalized count' ,y='Quantile', title="miRNA regulation on target mRNAs' expression noises") +
scale_color_discrete(breaks=c('HE','ME','LE','background'),
name='miRNA Mean Expression',
labels=expression(log[2]~fraction > -6, -9 < log[2]~fraction <= -6, -12 < log[2]~fraction <= -9, 'All expressed')) +
theme_bw() +
theme(legend.position=c(0.7,0.2), title=element_text(size=10))
plot_p <- ggplot(p_df,aes(x=x,y=y)) +
geom_raster(aes(fill = -log10(p_noise))) +
geom_text(aes(label = round(-log10(p_noise),2)), size=7) +
scale_fill_gradient(low = "white", high = "red") +
scale_x_discrete(limits=c('background','LE','ME'), labels=expression('All expressed', -12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6)) +
scale_y_discrete(limits=c('LE','ME','HE'), labels=expression(-12 < log[2]~fraction <= -9, -9 < log[2]~fraction <= -6, log[2]~fraction > -6)) +
labs(x='',y='',title=expression(-log[10]~'P value (KS-test)')) +
theme(legend.position = 'none', axis.text = element_text(angle=30, hjust=1))
ggarrange(plot_ecdf, plot_p)
Compared with the results obtained from RPKM values, more significant difference in expression noises among different groups is observed when using DCA normalized counts. Besides, expression noises between any two groups also become more significantly different. These results further support that miRNAs regulate their target mRNAs by reducing their expression levels and noises.
sessionInfo()