I have a project going at the moment to examine changes in intron diversity, size and location in animal genomes. I am always a bit frustrated with the way introns are treated in many genome characterisation papers- “the genome contained Y introns with mean intron size Xbp” is usually all we get. This sort of summary stat can hide all manner of interesting trends. One measure that is often useful is intron density but unfortunately there doesn’t seem to be any standardised way to use this measure. Density is often measured as ‘introns per gene’, which is a reasonable shorthand, although since genes vary in length very considerable both within and between genomes it makes quantitative analysis very difficult indeed. I have seen ‘introns per 10kb’! This is OK, but what number to choose? What if each study chooses a different number? ‘Introns per nucleotide’ will standardise this better, and although the number will be very small, we seem to manage just fine with small mutation rate numbers and the like. But the more I think about it the less simple this seems.
Introns per bond
Something that is often overlooked when calculating the number of introns per nucleotide is that introns do not insert into nucleotides but rather the phosphodiester bonds between them. I would suggest therefore that the most accurate and effective way to specify density would be introns per bond. It seems reasonable that counting nucleotides is a convenient shorthand for this, but actually this shorthand leads to small but persistent errors. This is an unfortunate consequence of genome annotation restricting itself to nucleotides but genomic processes sometimes targeting bonds.
In the cartoon of a gene above CDS represents the protein coding region and UTR stands for the 5′ and 3′ untranslated regions. The dashes between nucleotides represent phosphodiester bonds joining the nucleotides. There are 6 nucleototides in the 5′-UTR, 9 nucleotides in the CDS and 5 nucleotides in the 3′-UTR. What would happen if we were to insert an intron in this sequence at the boundary of one of the gene regions? In which gene region would it be counted?
Coding regions almost always begin with a codon specifying a methionine residue- the start codon ATG. Nothing preceding this A nucleotide is counted as part of the CDS. Coding regions finish with a termination codon, TGA in the example above. This A nucleotide is the end of the CDS. By usual practice therefore any intron inserting into the bond between the T and the A at the 5′ end of the CDS would not be counted as part of the CDS, nor would any intron inserting into the bond between the A and the A at the 3′ end of the CDS. This is quite reasonable in many ways, but defining UTRs by reference to the CDS (after the last nucleotide, before the first nucleotide) means that the CDS has one less bond per nucleotide than do the UTRs! The 5′-UTR here has 6 nucleotides and 6 bonds, the 3′-UTR has 5 nucleotides and 5 bonds, but the CDS has 9 nucleotides and only 8 bonds where an inserting intron would be labelled as a ‘CDS intron’.
Does this matter?
Both yes and no. Counts using introns/nucleotide will be very similar to introns/bond. I am not claiming that work needs to be repeated or that substantial errors put into question previous work. But there are two issues here
- We should do it right. Understanding the actual insertion process requires us to use the right language. We should label introns as inserting between nucleotides to avoid confusion. You may not be confused, but try writing a script that counts introns when everything is labelled by nucleotide position.
- We can’t yet be sure what difference correct counts make in large data sets. The age of genomics is here. We can study hundreds of thousands of introns from lots of species and treat this mass of data statistically. The numbers of introns/nucleotide and introns/bond may look similar to our eyes, but trusting our savannah ape brains to make the right call is a risky strategy with big numbers.