Vignette built using GSVA, version 1.53.20.

1 What are GMT files

An important source of gene sets is the Molecular Signatures Database (MSigDB) (Subramanian et al. 2005), which stores them in plain text files following the so-called gene matrix transposed (GMT) format. In the GMT format, each line stores a gene set with the following values separated by tabs:

  • A unique gene set identifier.
  • A gene set description.
  • One or more gene identifiers.

Because each different gene set may consist of a different number of genes, each line in a GMT file may contain a different number of tab-separated values. This means that the GMT format is not a tabular format, and therefore cannot be directly read with base R functions such as read.table() or read.csv().

2 How to read GMT files

We need a specialized function to read GMT files. We can find such a function in the GSEABase package with getGmt(), or in the qusage package with read.gmt().

GSVA also provides such a function called readGMT(), which takes as first argument the filename or URL of a, possibly compressed, GMT file. The call below illustrates how to read a GMT file from MSigDB providing its URL. Note that we also load the package GSEABase because, by default, the value returned by readGMT() is a GeneSetCollection object defined in that package.

library(GSEABase)
library(GSVA)

URL <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt"
genesets <- readGMT(URL, geneIdType=SymbolIdentifier())
class(genesets)
#> [1] "GeneSetCollection"
#> attr(,"package")
#> [1] "GSEABase"
length(genesets)
#> [1] 4872
genesets
#> GeneSetCollection
#>   names: GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP, ..., KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP (4872 total)
#>   unique identifiers: ABCA2, ABCC5, ..., LINC00841 (20457 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

If we have problems accessing that URL, we can also load a compressed version of this GMT file, stored in the workshop package, as follows.

fname <- system.file("extdata", "c7.immunesigdb.v2023.2.Hs.symbols.gmt.gz",
                     package="GSVAEuroBioC2024")
genesets <- readGMT(fname, geneIdType=SymbolIdentifier())
class(genesets)
#> [1] "GeneSetCollection"
#> attr(,"package")
#> [1] "GSEABase"
length(genesets)
#> [1] 4872
genesets
#> GeneSetCollection
#>   names: GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP, ..., KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP (4872 total)
#>   unique identifiers: ABCA2, ABCC5, ..., LINC00841 (20457 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

Next to the filename of the GMT file, we provided the argument geneIdType=SymbolIdentifier(). This argument adds the necessary metadata that later on allows other software to figure out what kind of gene identifiers are used in this collection of gene sets, to attempt mapping them to other type of identifiers, if necessary. While this argument is optional, we should always try to provide it.

3 Dealing with duplicated gene set names

The specification of the GMT format establishes that duplicated gene set names are not allowed. For this reason, the getGmt() function from the GSEABase package prompts an error when duplicated gene names are found, while the read.gmt() function from the qusage package silently accepts them in a list with duplicated element names.

The GSVA readGMT() function deals with duplicated gene set names as follows. By default, readGMT() warns the user about a duplicated gene set name and keeps only the first occurrence of the duplicated gene set in the returned object. We can illustrate this situation with a GMT file from the MSigDB database that happens to have duplicated gene set names and which we have stored as part of this workshop package.

fname <- system.file("extdata", "c2.all.v7.5.symbols.gmt.gz",
                     package="GSVAEuroBioC2024")
genesets <- getGmt(fname, geneIdType=SymbolIdentifier())
#> Error in validObject(.Object): invalid class "GeneSetCollection" object: each setName must be distinct

We can see that getGmt() prompts an error. We can see below that this does not happen with readGMT() and that, by default, all but the first occurrence of the duplicated gene set have been removed.

genesets <- readGMT(fname, geneIdType=SymbolIdentifier())
#> Warning in deduplicateGmtLines(lines, deduplUse): GMT contains duplicated gene
#> set names; deduplicated using method: first
genesets
#> GeneSetCollection
#>   names: CORONEL_RFX7_DIRECT_TARGETS_UP, FOROUTAN_TGFB_EMT_UP, ..., SA_TRKA_RECEPTOR (6363 total)
#>   unique identifiers: ABAT, DIP2A, ..., IGHG3 (21728 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)
any(duplicated(names(genesets)))
#> [1] FALSE

The parameter deduplUse in the readGMT() function allow one to apply other policies to deal with duplicated gene set names, see its help page with ?readGMT.

4 Returning a list instead of a GeneSetCollection

Storing gene sets in a GeneSetCollection object offers a number of advantages, including the addition of metadata that enables an automatic mapping of the gene identifiers that form the gene sets. For this reason, readGMT() returns such an object by default. However, if a user prefers using gene sets stored as a list object of character vectors, this is possible by setting the argument valueType="list" in the call to readGMT().

genesets <- readGMT(fname, valueType="list")
#> Warning in deduplicateGmtLines(lines, deduplUse): GMT contains duplicated gene
#> set names; deduplicated using method: first
class(genesets)
#> [1] "list"
any(duplicated(names(genesets)))
#> [1] FALSE
head(lapply(genesets, head))
#> $CORONEL_RFX7_DIRECT_TARGETS_UP
#> [1] "ABAT"  "DIP2A" "KLF9"  "PTMS"  "TOB2"  "ARL15"
#> 
#> $FOROUTAN_TGFB_EMT_UP
#> [1] "ABCA1"   "ACTN1"   "JAG1"    "ALOX5AP" "APBB2"   "ARNTL"  
#> 
#> $FOROUTAN_TGFB_EMT_DN
#> [1] "ADORA2B" "ALDH1A3" "ALDH3A2" "ANK3"    "BIRC3"   "AQP3"   
#> 
#> $FOROUTAN_PRODRANK_TGFB_EMT_UP
#> [1] "ABCA1"   "ACTN1"   "JAG1"    "ALOX5AP" "APBB2"   "ARNTL"  
#> 
#> $FOROUTAN_PRODRANK_TGFB_EMT_DN
#> [1] "ADORA2B" "ALDH3A2" "ANK3"    "BIRC3"   "AREG"    "DST"    
#> 
#> $FOROUTAN_INTEGRATED_TGFB_EMT_UP
#> [1] "ALOX5AP" "BMPR2"   "CALD1"   "CDH2"    "CDH11"   "COL1A1"

5 Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GSVA_1.53.20         GSEABase_1.67.0      graph_1.83.0        
#>  [4] annotate_1.83.0      XML_3.99-0.17        AnnotationDbi_1.67.0
#>  [7] IRanges_2.39.2       S4Vectors_0.43.2     Biobase_2.65.1      
#> [10] BiocGenerics_0.51.1 
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.35.1 KEGGREST_1.45.1            
#>  [3] rjson_0.2.22                xfun_0.47                  
#>  [5] bslib_0.8.0                 htmlwidgets_1.6.4          
#>  [7] rhdf5_2.49.0                lattice_0.22-6             
#>  [9] rhdf5filters_1.17.0         vctrs_0.6.5                
#> [11] tools_4.4.1                 parallel_4.4.1             
#> [13] RSQLite_2.3.7               blob_1.2.4                 
#> [15] Matrix_1.7-0                sparseMatrixStats_1.17.2   
#> [17] lifecycle_1.0.4             GenomeInfoDbData_1.2.12    
#> [19] compiler_4.4.1              Biostrings_2.73.1          
#> [21] codetools_0.2-20            GenomeInfoDb_1.41.1        
#> [23] htmltools_0.5.8.1           sass_0.4.9                 
#> [25] yaml_2.3.10                 crayon_1.5.3               
#> [27] jquerylib_0.1.4             SingleCellExperiment_1.27.2
#> [29] BiocParallel_1.39.0         DelayedArray_0.31.11       
#> [31] cachem_1.1.0                magick_2.8.4               
#> [33] abind_1.4-5                 rsvd_1.0.5                 
#> [35] digest_0.6.37               BiocSingular_1.21.2        
#> [37] bookdown_0.40               fastmap_1.2.0              
#> [39] grid_4.4.1                  cli_3.6.3                  
#> [41] SparseArray_1.5.31          magrittr_2.0.3             
#> [43] S4Arrays_1.5.7              UCSC.utils_1.1.0           
#> [45] bit64_4.0.5                 rmarkdown_2.28             
#> [47] XVector_0.45.0              httr_1.4.7                 
#> [49] matrixStats_1.4.0           bit_4.0.5                  
#> [51] png_0.1-8                   SpatialExperiment_1.15.1   
#> [53] ScaledMatrix_1.13.0         beachmat_2.21.6            
#> [55] memoise_2.0.1               HDF5Array_1.33.6           
#> [57] evaluate_0.24.0             knitr_1.48                 
#> [59] GenomicRanges_1.57.1        irlba_2.3.5.1              
#> [61] rlang_1.1.4                 Rcpp_1.0.13                
#> [63] xtable_1.8-4                DBI_1.2.3                  
#> [65] jsonlite_1.8.8              Rhdf5lib_1.27.0            
#> [67] R6_2.5.1                    MatrixGenerics_1.17.0      
#> [69] zlibbioc_1.51.1

References

Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.