Sylvain Mareschal, Ph.D.
Bioinformatics engineer
October 21, 2012 at 19:31
Querying Ensembl in R via BioMart
Here is an example I met during a SNP priorization project, for which I needed to collect various annotation data on SNP from the Ensembl Variation database. As it is quite long to query, we will start playing with Ensembl Gene, and come back to it later.


A few words about BioMart

Aside from its european origins, greatly appreciated in a field where most valuable tools are made by our over-Atlantic colleagues, Ensembl has an invaluable asset : BioMart. To make it quick, BioMart is a database scheme providing various ways to extract biological data, via a Perl API, an URL/XML web service or a (relatively) user-friendly interface named MartView. It is intended to be as generic as possible, thus allowing consistent means to grab distinct biological data types (UniProt proteins, HapMap polymorphism frequencies, Ensembl genes, variations, regulations ...).

And if you are the same "R-addict" i am, at this point of the reading you are already convinced an R package is available to make the connection between such databases and our favourite scripting language. And you are right ! You will find as a part of the Bioconductor project a biomaRt package to handle your queries.


Getting the tools : the straight way

Obviously, the first thing to do is to get the R package (or to get R itself, but as it is a tedious language to learn, i will suppose you already survived this step). The simplest way is to follow Bioconductor recommandations on the package's web page, and run in a fresh R session :

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")


It will download the last version of the package and its dependencies suiting your operating system, and everything should be fine in a couple of seconds. If you are running Linux, notice two out-of-R dependencies have to be installed first : libxml2-devel (for the XML package) and libcurl-devel (for the RCurl package), as these packages are essentially wrappers for these stand alone libraries. Your package manager (yum, apt-get ...) should get them for you easily.


Getting the tools : the warrior's way

If like me you are enchained by a tricky IT department, you can try the manual way : download the package and its dependencies ("Depends" and "Imports" lines, so here XML and RCurl which can be found in the CRAN repository) from the same web page, and install them manually.
- Windows users should download the "Windows binary" version, and install it via the "install from zip archive" button in the GUI.
- Linux users have to get the "Package Source" version and use the shell command R CMD INSTALL [package.tar.gz].

The manual way has the advantage over the automatic one to let you keep the exact software you develop your script with, so if you need to re-install it in a couple of months you will have better chances to find it still working, as package contents and behavior vary with the versions. Its major pitfall is met here : when you have dependencies, you have to manage them yourself, and if you are working on Windows, you will find that getting a "Windows binary" version of the XML and RCurl package is a pain (from a quite small list of such packages containing binaries that the CRAN won't provide, see this if you want to learn more).



Establishing a contact

The biomaRt package is quite simple to use, all you have to do is to set up a connection with useMart(), and use the object it returns in getBM() calls to get your data in R data.frames. The difficulty resides in choosing proper arguments to these functions.

First you have to choose an host to deal with, i.e. a web server to collect data from. biomaRt default behavior is to use www.biomart.org, which mirrors a large variety of BioMart-using databases (such as Ensembl ones). Obviously these mirrors are not up-to-date, so if you want to work on fresh data, the best way is to deal directly with its provider. A little guess is needed as it is hard to find which URL is suitable for a BioMart "host" usage, but if you know a website is providing a "MartView" tool, you can begin by trying its bare URL. As we are planning on querying Ensembl here, let's begin with www.ensembl.org. To check our guess, let's use the listMarts() function, and see what happen :

listMarts(host="www.ensembl.org")

It does not raise an error, so we are in ! This should return a data.frame, listing all the BioMart databases that can be queried at this host location. This output will help you determine a second important argument : the biomart one. As each host can handle multiple independent databases, it is up to you to define which one you want to query. As we are working on genes here, let's use the "ENSEMBL_MART_ENSEMBL" biomart, referring to Ensembl Genes. If you have doubts, the MartView interface of the website can help you : just try one and see what filters and data it offers. When our choice is made, we can set up a connection :

mart <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL")

This mart variable we define here is crucial, as it will be requested by all the other functions to let them know throughout which connection they have to work. Let's talk now about the dataset. Each biomart is actually a database collection following the same scheme, but for each of them you can have several versions or species, and here again you are asked to make your choice. To do so, biomaRt provides a listDatasets() function, so let's use it :

listDatasets(mart)

This should return a data.frame, describing the versions and species available. As we are working on human genes, we will use the "hsapiens_gene_ensembl" dataset. Let's explain our intentions to biomaRt, and update our mart connection :

mart <- useDataset(dataset="hsapiens_gene_ensembl", mart=mart)

Now we are connected to the database, and we are ready to collect. Notice the useMart() function has also a dataset argument, so in your scripts or future uses you will be able to establish the connection in a single command line :

mart <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")


Browsing the data

To collect data from this connection, we need to use the getBM() function. It takes 3 main arguments : attributes (the data you want to extract for each record), filters (the data on which you want to select records) and values (for each filter the values that make a record to be selected). To know the available attributes and filters, simply use the listAttributes() and listFilters() functions. Notice they differ, as some filters are established sharing data with over databases.

listAttributes(mart)
listFilters(mart)


As you can see, you have a large choice ... Maybe to much in fact, as it provides cross linking with other species databases that are not really interesting. To help you crawl into this flow of data, biomaRt provides the page argument, to use in conjunction with the attributePages() function. Let's see how to refine this attribute list :

attributePages(mart)

Not so much choice in here, but "feature_page" seems interesting for general purpose data (at least it is the first page, we can suppose it is the most generic). Let's give it a try :

listAttributes(mart, page="feature_page")

That is more handleable, so supposing we are interested in the chromosomal location of genes we will start with "chromosome_name", "start_position" and "end_position". Suppose now we are interested in the BCL6 gene, how to get it ? You will soon learn that Ensembl is fond of its own cryptic identifiers (it is pretty clear that we are in fact talking about ENSG00000113916 isn't it ?), so if you do not know the ENSG number of your gene you will be needed to query manually the Ensembl database to get it before writing your script. Or to waste a full day in researches to realize that the "BCL6" identifier that almost everybody use is in fact a symbol attributed by the HUGO Gene Nomenclature Committee, corresponding to the "hgnc_symbol" filter hidden in the list between useless ones like "HAVANA transcript" or "IPI ID". So here is our first query :

dat <- getBM(
   mart = mart,
   attributes = c("chromosome_name", "start_position", "end_position"),
   filters = "hgnc_symbol",
   values = "BCL6"
)
print(dat)


Here we are !


Getting back to SNP

To finish this tutorial with a more concrete example, let's go back to my initial goal : getting SNP. The idea here was to get all SNP referenced as linked to a candidate gene, in order to process and priorize them (the ID and the alleles will be enough here). As you can see using the listFilters() function in the "ENSEMBL_MART_SNP" biomart and "hsapiens_snp" dataset, an "ensembl_gene" filter exists to handle such requests, as long as you have an ENSG identifier for your gene (what we tried to obtain before). So this could seem a good idea :

mart <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")
ENSG <- getBM(mart=mart, attributes="ensembl_gene_id", filters="hgnc_symbol", values="BCL6")

mart <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp <- getBM(mart=mart, attributes=c("refsnp_id","allele"), filters="ensembl_gene", values=ENSG)

In fact it isn't, and at rush hours you have great chances to get a "Time out" error. I would like to end this tutorial with a remark i was made by a member of the Ensembl Help Desk (never hesitate to contact them, they rocks) : querying this way forces BioMart to check the transcript location, then the gene location of each SNP in order to determine if it is located in the gene we are looking for. Actually they are roughly 30 millions ... As i was said, this query is more "Gene centric" than "Variation centric", so the proper way to do this is to query the Gene mart to get the SNP list, and then the Variation one to get full data on them :

mart <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")
snp <- getBM(mart=mart, attributes="external_id", filters="hgnc_symbol", values="BCL6")

mart <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp <- getBM(mart=mart, attributes=c("refsnp_id","allele"), filters="snp_filter", values=snp)