Sylvain Mareschal, Ph.D.
Bioinformatics engineer
December 9, 2013 at 19:51
blast2sam
Updated on 2014-01-17 to comply with multi-line BLAST alignments.
Updated on 2014-06-18 to comply with BLASTN 2.2.26 in addition to NCBI's BLASTN 2.2.29+.
Updated on 2014-09-04 to force sequence uppercase, following Nathan Brown's suggestion.
Up-to-date script can be found here


As my lab is currently experimenting Life's AmpliSeq technology with the PGM sequencer, I tried a few week ago to compare the alignments of the resulting reads with the primers used for the target enrichment, especially in regions where enrichment was not expected. BLASTN is a wonderful tool to align such short sequences in a single batch allowing mismatches and gaps, and IGV's my new favorite to visualize BAMs and annotations. All I needed was a suitable format for the alignments produced by BLASTN, as its native one is more attended to human readers than automatic parsers.

Despite the relative flexibility of IGV and BLASTN in terms of handled formats, there were no way to content the two of them. A BED file built from BLASTN's tabular output could have made the deal, but as I was also interested in a suitable representation of mismatches and gaps in IGV, i needed something more detailed. After a quick internet research, i found an obscure blast2sam.pl script distributed with SAMtools, but it was quite unsatisfactory : the base position extraction feature is currently broken, sequences are not stored and Picard almost had an heart attack when I asked him if the produced SAM file was valid. Here is a short sample :

cat samtools.out | head | column -t
AMPL4575434462_F 0 ref|NC_000017.10| 0 255 3H0M187755651374H * 0 0 * * AS:i:32 EV:Z:4.4 AMPL4575434462_F 0 ref|NC_000017.10| 0 255 4H0M187755651374H * 0 0 * * AS:i:30 EV:Z:17 AMPL4575434462_F 0 ref|NC_000017.10| 0 255 9H0M187755651369H * 0 0 * * AS:i:30 EV:Z:17 AMPL4575434462_F 16 ref|NC_000017.10| 0 255 187755651375H0M3H * 0 0 * * AS:i:30 EV:Z:17 AMPL4575434462_F 16 ref|NC_000017.10| 0 255 187755651369H0M10H * 0 0 * * AS:i:28 EV:Z:68 AMPL4575434462_F 16 ref|NC_000017.10| 0 255 187755651370H0M5H * 0 0 * * AS:i:28 EV:Z:68 AMPL4575434462_F 0 ref|NC_000017.10| 0 255 8H0M187755651371H * 0 0 * * AS:i:28 EV:Z:68 AMPL4575434462_F 0 ref|NC_000017.10| 0 255 1H0M187755651374H * 0 0 * * AS:i:28 EV:Z:68 AMPL4575434462_F 0 ref|NC_000017.10| 0 255 10H0M187755651369H * 0 0 * * AS:i:28 EV:Z:68 AMPL4575434462_F 0 ref|NC_000017.10| 0 255 5H0M187755651374H * 0 0 * * AS:i:28 EV:Z:68
However the idea was interesting, so i overcame my visceral reluctance against Perl and made the decision to rewrite it from scratch. The resulting script can be found here, and its usage is straightforward :

# Produce a sorted / indexed BAM file
./blast2sam.pl BLASTN.out | samtools view -Sb - | samtools sort - BLASTN
samtools index BLASTN.bam

# Check with Picard
java -jar ValidateSamFile.jar I=BLASTN.bam

As you can see, resulting files are valid 1.4 SAM files with suitable headers :

samtools view -h BLASTN.bam | head -50 | column -t
@HD VN:1.4 @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 [...] @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @RG ID:6FRXJA1201R SM:6FRXJA1201R PG:BLASTN 2.2.28+ AMPL4575434462_F 0 chr17 7573839 42 24= * 0 0 CATGAAGGCAGGATGAGAATGGAA * RG:Z:6FRXJA1201R XB:f:48.1 XE:f:7e-05 NM:i:0 AMPL4575434462_F 256 chr17 67686383 0 3H16=5H * 0 0 GAAGGCAGGATGAGAA * RG:Z:6FRXJA1201R XB:f:32.2 XE:f:4.4 NM:i:0 AMPL4575434462_F 256 chr17 21947201 0 4H15=5H * 0 0 AAGGCAGGATGAGAA * RG:Z:6FRXJA1201R XB:f:30.2 XE:f:17 NM:i:0 AMPL4575434462_F 256 chr17 41629877 0 9H15= * 0 0 AGGATGAGAATGGAA * RG:Z:6FRXJA1201R XB:f:30.2 XE:f:17 NM:i:0 AMPL4575434462_F 272 chr17 48645634 0 6H15=3H * 0 0 TCTCATCCTGCCTTC * RG:Z:6FRXJA1201R XB:f:30.2 XE:f:17 NM:i:0 AMPL4575434462_F 272 chr17 4077210 0 14=10H * 0 0 TTCCATTCTCATCC * RG:Z:6FRXJA1201R XB:f:28.2 XE:f:68 NM:i:0 AMPL4575434462_F 272 chr17 9115648 0 1H12=1X5=5H * 0 0 TCCATTCTCATCCTGCCT * RG:Z:6FRXJA1201R XB:f:28.2 XE:f:68 NM:i:1 AMPL4575434462_F 256 chr17 10650458 0 8H14=2H * 0 0 CAGGATGAGAATGG * RG:Z:6FRXJA1201R XB:f:28.2 XE:f:68 NM:i:0 AMPL4575434462_F 256 chr17 13047619 0 1H10=1X7=5H * 0 0 ATGAAGGCAGGATGAGAA * RG:Z:6FRXJA1201R XB:f:28.2 XE:f:68 NM:i:1 AMPL4575434462_F 256 chr17 30541636 0 10H14= * 0 0 GGATGAGAATGGAA * RG:Z:6FRXJA1201R XB:f:28.2 XE:f:68 NM:i:0
With such a BAM file in hands, I was ready for my IGV exploration. For the curious ones, unexpected enrichment did occurred and was explained by primers from distinct couples aligning in suitable distances and ways. However as these "off-target" reads represented less than 1% of the experiment reads, it was no big deal ... This can be explained by the fact that unexpected alignments were weaker than designed ones, as at least one the two primers was aligned on a shorter length or with a mismatch. We were lucky in this design, but the more amplicons we will add, the more vigilant i will become ...

A few caveats to end this article : currently long alignments outputted on several lines are not handled, and reference sequences are supposed to be NCBI's human chromosomes (however it should work with other reference sequences, but chromosome name translation won't occur). It may be slightly slower than SAMtool's version, as the header reconstruction needs the file to be crawled twice. In my opinion BLASTN output will never be large enough for you to notice, unless you are working on a Raspberry Pi (if so, good luck with your career as a bioinformatician !)