Sylvain Mareschal, Ph.D.
Bioinformatics postdoc
August 31, 2014 at 21:51
The script can be downloaded from here.

As working with the IonTorrent technology, we were once confronted with a curious read length distribution :
Read distribution
As it seemed, most of the reads were less than 15 nucleotide long. A bit disappointing from a machine supposed to sequence hundreds of nucleotides, isn't it ? Apparently the chip wells were filled and the sequencing was achieved as we obtained many reads, but for some reason they ended prematurely. As a result 90% of our reads were discarded, because they were too short to properly map the reference genome.

We had two hypothesis to explain this: the DNA template was too short (blame the biologists !), or the software was not able to interpret the whole sequence (blame the bioinformati... blame no one !). To answer that critical question, I peaked into Life technologies documentation to extract the more I could from the BAM files related to this experiment. What a treasure I found there ! It appeared that while abandoning the SFF format, Life technologies switched to the BAM format to handle every piece of data generated, including the ionograms on which they rely to call bases.

For those unfamiliar with the IonTorrent technologies, it is very similar to pyrosequencing: bases (A, C, G and T) are flowed one by one in a predetermined order, and when one hybridizes the template and extends the sequencing primer, a signal (small pH shift) is observed. It can become very tricky when homopolymers (repetitions of the same base) are sequenced, as the signal is supposed to linearly increase according to the length of the repetition (and you know that in biology things are never happening as they are supposed to ...). As if it was not enough, templates are rarely 100% pure, and when you try to sequence multiple templates in diverse proportions, a signal of variable intensity is recorded on each flow. Each template is sequenced as its own speed, and what you end with is just a solid headache.

So I wrote a small R script to visualize this data, to get an idea on what was happening. It can be piped to SAMtools, to generate image files for every single read in a BAM file:
samtools view -h file.bam | head | ./plotIonogram.r

Coupled with a small AWK script to filter reads on size, it was enough to get an idea on the problem:
samtools view -h file.bam | awk '$0 ~ /^@/ || length($10) < 15 { print }' | head | ./plotIonogram.r

Unlike normal sized reads, there were no 3' adapter visible in those reads, so it was clear that software trimming was involved. And as far as we can judge from the high variance in the template part comparing to the key sequence part of the ionogram, it was already pretty imaginative to return a sequence of 11 bases from this data. This is typically what can be expected from a multi-template (or "polyclonal") sequencing: good quality as long as the templates are identical (key sequence and barcode in front of the template itself), followed by a noisy mess as each template contributes to each signal in variable proportion.