I downloaded some datasets from the SILVA96 database. These are structurally aligned SSU rDNA sequences. I browsed through the taxonomic groups and chose annelids (N=1050) and nematodes (N=5048) as smallish tests. I downloaded these as fasta files.
I started with the annelids file. The file contain a LOT of gaps, because it comes from an alignment of hundreds of thousands of sequences of all three domains of life.
I haven’t yet found a good way to process large files to remove columns that are all gaps. It can be done in Clustal and Mesquite but these are bad choices with very large alignments. There are some online resources but my fasta files are >50-250MB, so online is not the place even if I could persuade a server to upload my files. I should really have used BioPerl SimpleAlign to remove gap columns, its probably the most flexible and able to deal with big files, but I was temporarily having trouble installing BioPerl on my desktop (a future post) and ran out of time and patience.
I ran it through Gblocks instead which does more than just remove blank columns, also trimming areas of poor alignment judge by various criteria. This reduced the file considerably.
I had previously installed FastTree, so I ran it with the command
fasttree -nt annelids.fasta >annelids.tree
It ran quite nicely and produced a viable tree.
Something strange with the timings though.
Topology done after 1242.20 sec -- computing support valuesUnique: 3137/5048 Bad splits: 37/3134 Hill-climb: 259 Update-best: 11335 NNI: 4149Top hits: close neighbors 2510/3137 refreshes 176Time 1577.05 Distances per N*N: by-profile 0.220 (out 0.065) by-leaf 0.291END: 2008-10-28 18:23:32
----------------------------------------------Runtime: 5886 secondsRuntime: 01:38:06 h:m:s
----------------------------------------------
The text starting with “END:” is the output of my perl script before that from fasttree. So fasttree claims to have taken 1577 seconds (26 minutes) but my script times it at 1 hour 38 minutes. I actually noted the time it started and it did take 1 hour 38 mins. I repeated with identical results. Strange discrepancy.
For removing alignment columns that are all gaps, I typically use BELVU from Erik Sonnhammer’s group. I am not sure if it can handle your giant files, though.BELVU is mainly an alignment viewer, but it can also be run in command line mode and is reasonably fast.Good luck,Kay
LikeLike