1 Read molecular and phenotype data

Read the molecular and phenotype data stored in serialized SummarizedExperiment objects.

library(SummarizedExperiment)

seM <- readRDS(file.path("_raw_data", "se.Moufarrej22.GENCODEv44.hg38.rds"))
dim(seM)
[1] 37928   404
seD <- readRDS(file.path("_raw_data", "se.DelVecchio21.GENCODEv44.hg38.rds"))
dim(seD)
[1] 37928   134

2 Filtering of lowly-expressed genes and normalization

We will use the edgeR package for processing the raw counts and doing some exploratory analyses. We start by building the corresponding DGEList objects with the corresponding row and column data.

library(edgeR)

dgeM <- DGEList(counts=assay(seM), genes=rowData(seM), samples=colData(seM))
dgeD <- DGEList(counts=assay(seD), genes=rowData(seD), samples=colData(seD))

Filter lowly expressed genes using the function filterByExpr() and the Preeclampsia phenotype column as grouping factor. We need to filter both, the SummarizedExperiment and DGEList types of objects, to keep them parallel to each other.

mask <- filterByExpr(dgeM, group=seM$Preeclampsia)
sum(mask)
[1] 5344
seM.filt <- seM[mask, ]
dim(seM.filt)
[1] 5344  404
dgeM.filt <- dgeM[mask, , keep.lib.sizes=FALSE]
dim(dgeM.filt)
[1] 5344  404

mask <- filterByExpr(dgeD, group=seD$Preeclampsia)
sum(mask)
[1] 7313
seD.filt <- seD[mask, ]
dim(seD.filt)
[1] 7313  134
dgeD.filt <- dgeD[mask, , keep.lib.sizes=FALSE]
dim(dgeD.filt)
[1] 7313  134

Calculate normalization factors on the filtered data.

dgeM.filt <- calcNormFactors(dgeM.filt)
assays(seM.filt)$logCPM <- cpm(dgeM.filt, log=TRUE)

dgeD.filt <- calcNormFactors(dgeD.filt)
assays(seD.filt)$logCPM <- cpm(dgeD.filt, log=TRUE)

Divide samples in the (Moufarrej et al. 2022) study in two subsets, one for training in which we will put together the samples from the Discovery and Validation1 cohorts, and another one for testing formed by the samples of the Validation2 cohort. In both cases, we exclude samples drawn after delivery, labeled as Post-partum in the SamplingGAgroup column.

table(seM.filt$Cohort)

  Discovery Validation1 Validation2 
        209         106          89 
mask <- seM.filt$Cohort %in% c("Discovery", "Validation1") &
        !seM.filt$SamplingGAgroup %in% "Post-partum"
seM.filt.training <- seM.filt[, mask]
dim(seM.filt.training)
[1] 5344  244
dgeM.filt.training <- dgeM.filt[, mask]
dim(dgeM.filt.training)
[1] 5344  244
table(seM.filt.training$Preeclampsia)

 no yes 
177  67 

mask <- seM.filt$Cohort %in% "Validation2" &
        !seM.filt$SamplingGAgroup %in% "Post-partum"
seM.filt.testing <- seM.filt[, mask]
dim(seM.filt.testing)
[1] 5344   89
dgeM.filt.testing <- dgeM.filt[, mask]
dim(dgeM.filt.testing)
[1] 5344   89
table(seM.filt.testing$Preeclampsia)

 no yes 
 61  28 

From the study by (Del Vecchio et al. 2021), discard samples from non-pregnant women, drawn after delivery, or taken from the cord blood of the neonate.

table(seD.filt$SamplingGAgroup)

         1st Trimester          2nd Trimester          3rd Trimester 
                    25                     26                     25 
            cord blood               delivery Non pregnant Patient 1 
                    25                     26                      1 
Non pregnant Patient 2 Non pregnant Patient 3 Non pregnant Patient 4 
                     1                      1                      1 
Non pregnant Patient 5 Non pregnant Patient 6 Non pregnant Patient 7 
                     1                      1                      1 
mask <- !grepl("Non pregnant", seD.filt$SamplingGAgroup) &
        !seD.filt$SamplingGAgroup %in% c("delivery", "cord blood")
seD.filt.subset <- seD.filt[, mask]
dim(seD.filt.subset)
[1] 7313   76
dgeD.filt.subset <- dgeD.filt[, mask]
dim(dgeD.filt.subset)
[1] 7313   76
table(seD.filt.subset$Preeclampsia)

 no yes 
 52  24 

Figure 1 below shows samples dissimilarity labeled by the diagnosis of preeclampsia in each of the three datasets.

Sample dissimilarity labeled by the diagnosis of preeclampsia. Multidimensional scaling (MDS) plot of sample dissimilarity for the training (a) and testing (b) data from the study by @moufarrej2022early, and for the data from the study by [@delvecchio2021cell].

Figure 1: Sample dissimilarity labeled by the diagnosis of preeclampsia
Multidimensional scaling (MDS) plot of sample dissimilarity for the training (a) and testing (b) data from the study by Moufarrej et al. (2022), and for the data from the study by (Del Vecchio et al. 2021).

Save DGEList and SummarizedExperiment objects including filtered data, normalization factors and normalized log CPM units of expression.

saveRDS(dgeM.filt, file=file.path("_processed_data", "dgeM.filt.rds"))
saveRDS(seM.filt, file=file.path("_processed_data", "seM.filt.rds"))

saveRDS(dgeD.filt, file=file.path("_processed_data", "dgeD.filt.rds"))
saveRDS(seD.filt, file=file.path("_processed_data", "seD.filt.rds"))

saveRDS(dgeM.filt.training,
        file=file.path("_processed_data", "dgeM.filt.training.rds"))
saveRDS(seM.filt.training,
        file=file.path("_processed_data", "seM.filt.training.rds"))
saveRDS(dgeM.filt.testing,
        file=file.path("_processed_data", "dgeM.filt.testing.rds"))
saveRDS(seM.filt.testing,
        file=file.path("_processed_data", "seM.filt.testing.rds"))

saveRDS(dgeD.filt.subset,
        file=file.path("_processed_data", "dgeD.filt.subset.rds"))
saveRDS(seD.filt.subset,
        file=file.path("_processed_data", "seD.filt.subset.rds"))

3 Dispersion estimates and hypervariable genes

We calculate dispersion estimates and biological coefficients of variation (McCarthy, Chen, and Smyth 2012) and use them to identify hypervariable genes for different models of increasing complexity that will help us to select the most sensible one for the differential expression analysis between preeclampsia-affected and unafected mothers.

The four considered models are: (1) a model with the intercept term only, (2) a model with Preeclampsia as main explanatory variable, (3) a model with Preeclampsia as main explanatory variable and SamplingGA as a covariate, and (4) a model with Preeclampsia as main explanatory variable, and SamplingGA and surrogate variables (SVs), calculated with SVA (Leek and Storey 2007) as covariates. The estimation of SVs will be done using only the expression values of the top-10% most variable genes.

3.1 Training dataset from Moufarrej et al. (2022)

library(sva)

mod <- model.matrix(~ 1, colData(seM.filt.training))
dgeM.filt.training.interceptonly <- estimateDisp(dgeM.filt.training, design=mod,
                                                 robust=TRUE)
summary(dgeM.filt.training.interceptonly$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.338   8.442   8.572   8.512   8.572   8.572 
mod <- model.matrix(~ Preeclampsia, colData(seM.filt.training))
dgeM.filt.training.PE <- estimateDisp(dgeM.filt.training, design=mod,
                                      robust=TRUE)
summary(dgeM.filt.training.PE$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.394   8.394   8.639   8.572   8.639   8.639 
mod <- model.matrix(~ Preeclampsia + SamplingGA, colData(seM.filt.training))
dgeM.filt.training.PEGA <- estimateDisp(dgeM.filt.training, design=mod,
                                        robust=TRUE)
summary(dgeM.filt.training.PEGA$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.396   8.396   8.650   8.580   8.650   8.650 
mod0 <- model.matrix(~ SamplingGA, colData(seM.filt.training))
IQRs <- apply(assays(seM.filt.training)$logCPM, 1, IQR)
mask <- IQRs > quantile(IQRs, prob=0.9)
sv <- sva(assays(seM.filt.training)$logCPM[mask, ],
          mod=mod, mod0=mod0)
Number of significant surrogate variables is:  9 
Iteration (out of 5 ):1  2  3  4  5  
mod <- cbind(mod, sv$sv)
dgeM.filt.training.PEGASVs <- estimateDisp(dgeM.filt.training,
                                           design=mod, robust=TRUE)
summary(dgeM.filt.training.PEGASVs$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3918 31.2056 31.2056 28.6264 31.2056 31.2056 

Figure 2 shows, for each considered model, the biological coefficient of variation (BCV) as function of the average expression, in log CPM units.

Loading required package: plotrix
Biological coefficient of variation (BCV). BCV values as function of the average gene expression in log-CPM units.

Figure 2: Biological coefficient of variation (BCV)
BCV values as function of the average gene expression in log-CPM units.

Table 1 below shows the top-20 hypervariable genes with lowest prior degrees of freedom for the last model.

Table 1: Top-20 hypervariable genes for the model Preeclampsia + SamplingGA + SVs.
prior.df BCV Gene Description
0.3917911 1.0583241 HBG1 hemoglobin subunit gamma 1
0.3917932 1.1637203 HBG2 hemoglobin subunit gamma 2
0.3917943 0.7956527 ALAS2 5’-aminolevulinate synthase 2
0.3918679 1.5304564 C5orf64 chromosome 5 putative open reading frame 64
0.3921824 0.7139674 RPS26 ribosomal protein S26
0.3925444 0.6502067 IGF2 insulin like growth factor 2
0.3926117 0.9827843 CST7 cystatin F
0.3930199 0.6056745 SLC4A1 solute carrier family 4 member 1 (Diego blood group)
0.3933930 0.7048308 SNORD3B-2 small nucleolar RNA, C/D box 3B-2
0.3984848 0.7196402 RNVU1-18 RNA, variant U1 small nuclear 18
0.3984848 0.7616192 VTRNA1-2 vault RNA 1-2
0.3992817 0.5866055 OSBP2 oxysterol binding protein 2
0.3992817 0.5858608 ND6 NADH dehydrogenase subunit 6
0.4017543 0.6021897 H19 H19 imprinted maternally expressed transcript
0.4037194 0.6662309 EGR3 early growth response 3
0.4112340 0.5752273 SLC25A39 solute carrier family 25 member 39
0.4132431 0.8511395 MAGI2-AS3 MAGI2 antisense RNA 3
0.4141007 0.5521355 RN7SL767P RNA, 7SL, cytoplasmic 767, pseudogene
0.4161676 0.5409125 RNU1-3 RNA, U1 small nuclear 3
0.4311507 0.5089252 RN7SL151P RNA, 7SL, cytoplasmic 151, pseudogene

3.2 Testing dataset from Moufarrej et al. (2022)

mod <- model.matrix(~ 1, colData(seM.filt.testing))
dgeM.filt.testing.interceptonly <- estimateDisp(dgeM.filt.testing, design=mod,
                                                robust=TRUE)
summary(dgeM.filt.testing.interceptonly$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.536   3.536   3.536   3.638   3.813   3.813 
mod <- model.matrix(~ Preeclampsia, colData(seM.filt.testing))
dgeM.filt.testing.PE <- estimateDisp(dgeM.filt.testing, design=mod,
                                     robust=TRUE)
summary(dgeM.filt.testing.PE$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.613   3.613   3.613   3.714   3.881   3.881 
mod <- model.matrix(~ Preeclampsia + SamplingGA, colData(seM.filt.testing))
dgeM.filt.testing.PEGA <- estimateDisp(dgeM.filt.testing, design=mod,
                                       robust=TRUE)
summary(dgeM.filt.testing.PEGA$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.606   3.606   3.606   3.709   3.878   3.878 
mod0 <- model.matrix(~ SamplingGA, colData(seM.filt.testing))
IQRs <- apply(assays(seM.filt.testing)$logCPM, 1, IQR)
mask <- IQRs > quantile(IQRs, prob=0.9)
sv <- sva(assays(seM.filt.testing)$logCPM[mask, ],
          mod=mod, mod0=mod0)
Number of significant surrogate variables is:  4 
Iteration (out of 5 ):1  2  3  4  5  
mod <- cbind(mod, sv$sv)
dgeM.filt.testing.PEGASVs <- estimateDisp(dgeM.filt.testing,
                                          design=mod, robust=TRUE)
summary(dgeM.filt.testing.PEGASVs$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.292   7.273   7.273   7.243   7.273   7.273 

Figure 3 shows, for each considered model, the biological coefficient of variation (BCV) as function of the average expression, n log CPM units.

Biological coefficient of variation (BCV). BCV values as function of the average gene expression in log-CPM units.

Figure 3: Biological coefficient of variation (BCV)
BCV values as function of the average gene expression in log-CPM units.

Table 2: Top-20 hypervariable genes for the model Preeclampsia + SamplingGA + SVs.
prior.df BCV Gene Description
5.291823 0.8676389 MAGI2-AS3 MAGI2 antisense RNA 3
5.291823 0.8874649 TMEM70 transmembrane protein 70
5.291823 0.8941953 ABO ABO, alpha 1-3-N-acetylgalactosaminyltransferase and alpha 1-3-galactosyltransferase
5.291823 0.9219868 GCOM1 GCOM1, MYZAP-POLR2M combined locus
5.291823 0.8449318 SPNS1 SPNS lysolipid transporter 1, lysophospholipid
5.291823 0.9620111 CST7 cystatin F
5.291823 0.8165029 PVALB parvalbumin
5.707770 0.7745764 TTN titin
5.707770 0.7631880 ANXA3 annexin A3
5.707770 0.8373243 VTRNA1-2 vault RNA 1-2
5.892624 0.6684776 KLHL5 kelch like family member 5
6.252762 0.6564837 VCAN versican
6.252762 0.6669697 RBPMS2 RNA binding protein, mRNA processing factor 2
6.532833 0.5147473 MTCO2P12 MT-CO2 pseudogene 12
6.532833 0.5535638 MFN2 mitofusin 2
6.532833 0.7356862 IFI44L interferon induced protein 44 like
6.532833 0.6359629 OBSCN obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF
6.532833 0.6050213 PDZK1IP1 PDZK1 interacting protein 1
6.532833 0.5860977 FCGR3B Fc gamma receptor IIIb
6.532833 0.7977558 PIGC phosphatidylinositol glycan anchor biosynthesis class C

3.3 Subset from Del Vecchio et al. (2021)

mod <- model.matrix(~ 1, colData(seD.filt.subset))
dgeD.filt.subset.interceptonly <- estimateDisp(dgeD.filt.subset, design=mod,
                                               robust=TRUE)
summary(dgeD.filt.subset.interceptonly$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4514 11.2173 11.2173 10.3761 11.2173 11.2173 
mod <- model.matrix(~ Preeclampsia, colData(seD.filt.subset))
dgeD.filt.subset.PE <- estimateDisp(dgeD.filt.subset, design=mod,
                                    robust=TRUE)
summary(dgeD.filt.subset.PE$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.413  11.531  11.531  10.688  11.531  11.531 
mod <- model.matrix(~ Preeclampsia + SamplingGA, colData(seD.filt.subset))
dgeD.filt.subset.PEGA <- estimateDisp(dgeD.filt.subset, design=mod,
                                      robust=TRUE)
summary(dgeD.filt.subset.PEGA$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4149 11.7678 11.7678 10.8987 11.7678 11.7678 
mod0 <- model.matrix(~ SamplingGA, colData(seD.filt.subset))
IQRs <- apply(assays(seD.filt.subset)$logCPM, 1, IQR)
mask <- IQRs > quantile(IQRs, prob=0.9)
sv <- sva(assays(seD.filt.subset)$logCPM[mask, ],
          mod=mod, mod0=mod0)
Number of significant surrogate variables is:  8 
Iteration (out of 5 ):1  2  3  4  5  
mod <- cbind(mod, sv$sv)
dgeD.filt.subset.PEGASVs <- estimateDisp(dgeD.filt.subset,
                                         design=mod, robust=TRUE)
summary(dgeD.filt.subset.PEGASVs$prior.df)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.45   50.16   50.16   46.96   50.16   50.16 

Figure 4 shows, for each considered model, the biological coefficient of variation (BCV) as function of the average expression, n log CPM units.

Biological coefficient of variation (BCV). BCV values as function of the average gene expression in log-CPM units.

Figure 4: Biological coefficient of variation (BCV)
BCV values as function of the average gene expression in log-CPM units.

Table 3 below shows the top-20 hypervariable genes with lowest prior degrees of freedom for the last model.

Table 3: Top-20 hypervariable genes for the model Preeclampsia + SamplingGA + SVs.
prior.df BCV Gene Description
0.4500197 0.6215532 PLAC4 placenta enriched 4
0.4500197 0.6716895 ALB albumin
0.4500197 0.5194991 FHDC1 FH2 domain containing 1
0.4500197 0.4084455 TRAK2 trafficking kinesin protein 2
0.4500200 0.4612708 ALAS2 5’-aminolevulinate synthase 2
0.4500200 0.3798014 ND5 NADH dehydrogenase subunit 5
0.4500201 0.3725009 ND4 NADH dehydrogenase subunit 4
0.4500201 0.3848531 CYTB cytochrome b
0.4500214 0.3783909 COX3 cytochrome c oxidase subunit III
0.4500219 0.3541374 COX1 cytochrome c oxidase subunit I
0.4500239 0.4780198 ND6 NADH dehydrogenase subunit 6
0.4500266 0.4469658 MTND1P23 MT-ND1 pseudogene 23
0.4500268 0.4087547 IGF2 insulin like growth factor 2
0.4500287 0.5637600 HBG1 hemoglobin subunit gamma 1
0.4500317 0.4111922 MAP3K7CL MAP3K7 C-terminal like
0.4500318 0.3544274 COX2 cytochrome c oxidase subunit II
0.4500346 0.9197978 RPL23AP7 ribosomal protein L23a pseudogene 7
0.4500392 0.5096776 MTND4P12 MT-ND4 pseudogene 12
0.4500407 0.4149040 YOD1 YOD1 deubiquitinase
0.4500407 0.3507107 BNIP3L BCL2 interacting protein 3 like

4 Session information

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Ventura 13.5.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] plotrix_3.8-4               sva_3.52.0                 
 [3] BiocParallel_1.38.0         genefilter_1.86.0          
 [5] mgcv_1.9-1                  nlme_3.1-164               
 [7] edgeR_4.2.0                 limma_3.60.2               
 [9] SummarizedExperiment_1.34.0 Biobase_2.64.0             
[11] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
[13] IRanges_2.38.0              S4Vectors_0.42.0           
[15] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
[17] matrixStats_1.3.0           kableExtra_1.4.0           
[19] knitr_1.46                  BiocStyle_2.32.0           

loaded via a namespace (and not attached):
 [1] viridisLite_0.4.2       blob_1.2.4              Biostrings_2.72.0      
 [4] fastmap_1.2.0           XML_3.99-0.16.1         digest_0.6.35          
 [7] lifecycle_1.0.4         survival_3.5-8          statmod_1.5.0          
[10] KEGGREST_1.44.0         RSQLite_2.3.6           magrittr_2.0.3         
[13] compiler_4.4.0          rlang_1.1.3             sass_0.4.9             
[16] tools_4.4.0             yaml_2.3.8              S4Arrays_1.4.1         
[19] bit_4.0.5               DelayedArray_0.30.1     xml2_1.3.6             
[22] abind_1.4-5             grid_4.4.0              xtable_1.8-4           
[25] colorspace_2.1-0        scales_1.3.0            tinytex_0.51           
[28] cli_3.6.2               rmarkdown_2.27          crayon_1.5.2           
[31] rstudioapi_0.16.0       httr_1.4.7              DBI_1.2.2              
[34] cachem_1.1.0            stringr_1.5.1           zlibbioc_1.50.0        
[37] splines_4.4.0           parallel_4.4.0          AnnotationDbi_1.66.0   
[40] BiocManager_1.30.23     XVector_0.44.0          vctrs_0.6.5            
[43] Matrix_1.7-0            jsonlite_1.8.8          bookdown_0.39          
[46] bit64_4.0.5             systemfonts_1.1.0       locfit_1.5-9.9         
[49] jquerylib_0.1.4         annotate_1.82.0         glue_1.7.0             
[52] codetools_0.2-20        stringi_1.8.4           UCSC.utils_1.0.0       
[55] munsell_0.5.1           htmltools_0.5.8.1       GenomeInfoDbData_1.2.12
[58] R6_2.5.1                evaluate_0.23           lattice_0.22-6         
[61] highr_0.10              png_0.1-8               memoise_2.0.1          
[64] bslib_0.7.0             Rcpp_1.0.12             svglite_2.1.3          
[67] SparseArray_1.4.3       xfun_0.44              

References

Del Vecchio, Giorgia, Qingjiao Li, Wenyuan Li, Shanthie Thamotharan, Anela Tosevska, Marco Morselli, Kyunghyun Sung, et al. 2021. “Cell-Free DNA Methylation and Transcriptomic Signature Prediction of Pregnancies with Adverse Outcomes.” Epigenetics 16 (6): 642–61.
Leek, Jeffrey T, and John D Storey. 2007. “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis.” PLOS Genetics 3 (9): e161.
McCarthy, Davis J, Yunshun Chen, and Gordon K Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97.
Moufarrej, Mira N, Sevahn K Vorperian, Ronald J Wong, Ana A Campos, Cecele C Quaintance, Rene V Sit, Michelle Tan, et al. 2022. “Early Prediction of Preeclampsia in Pregnancy with Cell-Free RNA.” Nature 602 (7898): 689–94.