vignettes/importGMTfiles.Rmd
importGMTfiles.RmdVignette built using GSVA, version 1.53.20.
An important source of gene sets is the Molecular Signatures Database (MSigDB) (Subramanian et al. 2005), which stores them in plain text files following the so-called gene matrix transposed (GMT) format. In the GMT format, each line stores a gene set with the following values separated by tabs:
Because each different gene set may consist of a different number of genes, each
line in a GMT file may contain a different number of tab-separated values. This
means that the GMT format is not a tabular format, and therefore cannot be directly
read with base R functions such as read.table() or read.csv().
We need a specialized function to read GMT files. We can find such a function in the
GSEABase package with getGmt(), or
in the qusage package with read.gmt().
GSVA also provides such a function called readGMT(), which takes as first argument
the filename or URL of a, possibly compressed, GMT file. The call below illustrates
how to read a GMT file from MSigDB providing its URL. Note that we also load the
package
GSEABase because, by default, the value
returned by readGMT() is a GeneSetCollection object defined in that package.
library(GSEABase)
library(GSVA)
URL <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt"
genesets <- readGMT(URL, geneIdType=SymbolIdentifier())
class(genesets)
#> [1] "GeneSetCollection"
#> attr(,"package")
#> [1] "GSEABase"
length(genesets)
#> [1] 4872
genesets
#> GeneSetCollection
#> names: GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP, ..., KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP (4872 total)
#> unique identifiers: ABCA2, ABCC5, ..., LINC00841 (20457 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: NullCollection (1 total)If we have problems accessing that URL, we can also load a compressed version of this GMT file, stored in the workshop package, as follows.
fname <- system.file("extdata", "c7.immunesigdb.v2023.2.Hs.symbols.gmt.gz",
package="GSVAEuroBioC2024")
genesets <- readGMT(fname, geneIdType=SymbolIdentifier())
class(genesets)
#> [1] "GeneSetCollection"
#> attr(,"package")
#> [1] "GSEABase"
length(genesets)
#> [1] 4872
genesets
#> GeneSetCollection
#> names: GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP, ..., KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP (4872 total)
#> unique identifiers: ABCA2, ABCC5, ..., LINC00841 (20457 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: NullCollection (1 total)Next to the filename of the GMT file, we provided the argument
geneIdType=SymbolIdentifier(). This argument adds the necessary metadata that
later on allows other software to figure out what kind of gene identifiers are
used in this collection of gene sets, to attempt mapping them to other type of
identifiers, if necessary. While this argument is optional, we should always
try to provide it.
The specification of the GMT format
establishes that duplicated gene set names are not allowed. For this reason,
the getGmt() function from the
GSEABase package prompts an error
when duplicated gene names are found, while the read.gmt() function from the
qusage package silently accepts them
in a list with duplicated element names.
The GSVA readGMT() function deals with duplicated gene set names as follows.
By default, readGMT() warns the user about a duplicated gene set name and
keeps only the first occurrence of the duplicated gene set in the returned
object. We can illustrate this situation with a GMT file from the MSigDB
database that happens to have duplicated gene set names and which we have
stored as part of this workshop package.
fname <- system.file("extdata", "c2.all.v7.5.symbols.gmt.gz",
package="GSVAEuroBioC2024")
genesets <- getGmt(fname, geneIdType=SymbolIdentifier())
#> Error in validObject(.Object): invalid class "GeneSetCollection" object: each setName must be distinctWe can see that getGmt() prompts an error. We can see below that this does
not happen with readGMT() and that, by default, all but the first occurrence of the
duplicated gene set have been removed.
genesets <- readGMT(fname, geneIdType=SymbolIdentifier())
#> Warning in deduplicateGmtLines(lines, deduplUse): GMT contains duplicated gene
#> set names; deduplicated using method: first
genesets
#> GeneSetCollection
#> names: CORONEL_RFX7_DIRECT_TARGETS_UP, FOROUTAN_TGFB_EMT_UP, ..., SA_TRKA_RECEPTOR (6363 total)
#> unique identifiers: ABAT, DIP2A, ..., IGHG3 (21728 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: NullCollection (1 total)
any(duplicated(names(genesets)))
#> [1] FALSEThe parameter deduplUse in the readGMT() function allow one to apply other
policies to deal with duplicated gene set names, see its help page with
?readGMT.
list instead of a GeneSetCollection
Storing gene sets in a GeneSetCollection object offers a number of advantages,
including the addition of metadata that enables an automatic mapping of the gene
identifiers that form the gene sets. For this reason, readGMT() returns such
an object by default. However, if a user prefers using gene sets stored as a
list object of character vectors, this is possible by setting the argument
valueType="list" in the call to readGMT().
genesets <- readGMT(fname, valueType="list")
#> Warning in deduplicateGmtLines(lines, deduplUse): GMT contains duplicated gene
#> set names; deduplicated using method: first
class(genesets)
#> [1] "list"
any(duplicated(names(genesets)))
#> [1] FALSE
head(lapply(genesets, head))
#> $CORONEL_RFX7_DIRECT_TARGETS_UP
#> [1] "ABAT" "DIP2A" "KLF9" "PTMS" "TOB2" "ARL15"
#>
#> $FOROUTAN_TGFB_EMT_UP
#> [1] "ABCA1" "ACTN1" "JAG1" "ALOX5AP" "APBB2" "ARNTL"
#>
#> $FOROUTAN_TGFB_EMT_DN
#> [1] "ADORA2B" "ALDH1A3" "ALDH3A2" "ANK3" "BIRC3" "AQP3"
#>
#> $FOROUTAN_PRODRANK_TGFB_EMT_UP
#> [1] "ABCA1" "ACTN1" "JAG1" "ALOX5AP" "APBB2" "ARNTL"
#>
#> $FOROUTAN_PRODRANK_TGFB_EMT_DN
#> [1] "ADORA2B" "ALDH3A2" "ANK3" "BIRC3" "AREG" "DST"
#>
#> $FOROUTAN_INTEGRATED_TGFB_EMT_UP
#> [1] "ALOX5AP" "BMPR2" "CALD1" "CDH2" "CDH11" "COL1A1"
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GSVA_1.53.20 GSEABase_1.67.0 graph_1.83.0
#> [4] annotate_1.83.0 XML_3.99-0.17 AnnotationDbi_1.67.0
#> [7] IRanges_2.39.2 S4Vectors_0.43.2 Biobase_2.65.1
#> [10] BiocGenerics_0.51.1
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.35.1 KEGGREST_1.45.1
#> [3] rjson_0.2.22 xfun_0.47
#> [5] bslib_0.8.0 htmlwidgets_1.6.4
#> [7] rhdf5_2.49.0 lattice_0.22-6
#> [9] rhdf5filters_1.17.0 vctrs_0.6.5
#> [11] tools_4.4.1 parallel_4.4.1
#> [13] RSQLite_2.3.7 blob_1.2.4
#> [15] Matrix_1.7-0 sparseMatrixStats_1.17.2
#> [17] lifecycle_1.0.4 GenomeInfoDbData_1.2.12
#> [19] compiler_4.4.1 Biostrings_2.73.1
#> [21] codetools_0.2-20 GenomeInfoDb_1.41.1
#> [23] htmltools_0.5.8.1 sass_0.4.9
#> [25] yaml_2.3.10 crayon_1.5.3
#> [27] jquerylib_0.1.4 SingleCellExperiment_1.27.2
#> [29] BiocParallel_1.39.0 DelayedArray_0.31.11
#> [31] cachem_1.1.0 magick_2.8.4
#> [33] abind_1.4-5 rsvd_1.0.5
#> [35] digest_0.6.37 BiocSingular_1.21.2
#> [37] bookdown_0.40 fastmap_1.2.0
#> [39] grid_4.4.1 cli_3.6.3
#> [41] SparseArray_1.5.31 magrittr_2.0.3
#> [43] S4Arrays_1.5.7 UCSC.utils_1.1.0
#> [45] bit64_4.0.5 rmarkdown_2.28
#> [47] XVector_0.45.0 httr_1.4.7
#> [49] matrixStats_1.4.0 bit_4.0.5
#> [51] png_0.1-8 SpatialExperiment_1.15.1
#> [53] ScaledMatrix_1.13.0 beachmat_2.21.6
#> [55] memoise_2.0.1 HDF5Array_1.33.6
#> [57] evaluate_0.24.0 knitr_1.48
#> [59] GenomicRanges_1.57.1 irlba_2.3.5.1
#> [61] rlang_1.1.4 Rcpp_1.0.13
#> [63] xtable_1.8-4 DBI_1.2.3
#> [65] jsonlite_1.8.8 Rhdf5lib_1.27.0
#> [67] R6_2.5.1 MatrixGenerics_1.17.0
#> [69] zlibbioc_1.51.1