Sylvain Mareschal, Ph.D.
Bioinformatics postdoc
March 8, 2013 at 14:56
rawFasta 1.0.0
This package implements memory efficient storage of letter sequences (DNA, RNA, protein ...) in R, coding sequence elements on less than 8 bits (1, 2, 3, 4, 5, 6 or 8). It was mainly developed as a showpiece for R capabilities to handle binary data, as more featured classes can achieve the same purpose in Biostrings. However its full R implementation and absence of dependency can prove easier to install and use. It can be downloaded on the CRAN website.

The package relies on a main rawFasta interface for several classes handling sequences coded on less than 8 bits (S4 class system). Objects can be instantiated from FASTA files via the rawFasta parser, which chooses the correct implementation to be used. A common extract method is finally provided to subset the sequence by coordinates.

It is intended to store very large sequences (such as whole chromosomes) in memory, in order to subset the sequence by coordinates. The default 3-bit implementation can handle the 4 DNA letters, "N" ambiguities and "-" gaps in a memory space 3 time smaller than what can be achieved with a standard character vector.

Typical use

# Generate a dummy FASTA file
seq <- sample(c("A","C","G","T"), size=1000, replace=TRUE)
cat(">Random DNA sequence\n", file="test.fa")
write(seq, ncolumns=100, sep="", file="test.fa", append=TRUE)

# Default (DNA allowing ambiguities and gaps)
object <- rawFasta("test.fa")
print(object)
print(extract(object, 1, 10))

# Unambiguous DNA
object <- rawFasta("test.fa", alpha="ACGT")
print(object)
print(extract(object, 1, 10))