Выбрать главу

We can’t fault C. elegans on grounds of utility, but it is clearly a much less complex organism than our good selves. Why are we so much more sophisticated? Given the importance of proteins in cellular function, the original assumption was that complex organisms like mammals have more protein-coding genes than simple creatures like C. elegans. This was a perfectly reasonable hypothesis but it has fallen foul of a phenomenon described by Thomas Henry Huxley. He was Darwin’s great champion in the 19th century and it was Huxley who first described ‘the slaying of a beautiful hypothesis by an ugly fact’.

As DNA sequencing technologies improved in cost and efficiency, numerous labs throughout the world sequenced the genomes of a number of different organisms. They were able to use various software tools to identify the likely protein-coding genes in these different genomes. What they found was really surprising. There were far fewer protein-coding genes than expected. Before the human genome was decoded, scientists had predicted there would be over 100,000 such genes. We now know the real number is between 20,000 and 25,000 genes[128]. Even more oddly, C. elegans contains about 20,200 genes[129], not so very different a number from us.

Not only do we and C. elegans have about the same number of genes, these genes tend to code for pretty much the same proteins. By this we mean that if we analyse the sequence of a gene in human cells, we can find a gene of broadly similar sequence in the nematode worm. So the phenotypic differences between worms and humans aren’t caused by Homo sapiens having more, different or ‘better’ genes.

Admittedly, more complicated organisms tend to splice their genes in more ways than simpler creatures. Using our CARDIGAN example from Chapter 3 as an analogy once again, C. elegans might only be able to make the proteins DIG and DAN whereas mammals would be able to make those two proteins and also CARD, RIGA, CAIN and CARDIGAN.

This certainly would allow humans to generate a much greater repertoire of proteins than the 1mm worm, but it introduces a new problem. How do more complicated organisms regulate their more complicated splicing patterns? This regulation could in theory be controlled solely by proteins, but this in turn has difficulties. The more proteins a cell needs to regulate in a complicated network, the more proteins it needs to do the regulation. Mathematical models have shown that this rapidly leads to a situation where the number of proteins that we need begins to out-strip the number of proteins that we actually possess – clearly a non-starter.

Do we have an alternative? We do, and it’s indicated in Figure 10.1.

Figure 10.1 This graph demonstrates that the complexity of living organisms scales much better with the percentage of the genome that doesn’t code for protein (black columns) than it does with the number of basepairs coding for protein in a genome (white columns). The data are adapted from Mattick, J. (2007), Exp Biol. 210: 1526–1547.

At one extreme we have the bacteria. Bacteria have very small, highly compacted genomes. Their protein-coding genes cover about 4,000,000 base-pairs, which is about 90 per cent of their genome. Bacteria are very simple organisms and fairly rigid in the way they control their gene expression. But things change as we move further up the evolutionary tree.

The protein-coding genes of C. elegans cover about 24,000,000 base-pairs, but that only accounts for about 25 per cent of their genome. The remaining 75 per cent doesn’t code for protein. By the time we reach humans, the protein-coding regions cover about 32,000,000 base-pairs, but this only represents about 2 per cent of the total genome. There are various ways that we can calculate the protein-coding regions, but they make relatively little difference to the astonishing bottom line. Over 98 per cent of the human genome doesn’t code for protein. All but 2 per cent of our genome is ‘junk’.

In other words, the numbers of genes, or the sizes of these genes, don’t scale with complexity. The only feature of a genome that really seems to get bigger as organisms get more complicated is the section that doesn’t code for protein.

The tyranny of language

So what are these non-coding regions of the genome doing, and why are they so important? It’s when we start to consider this that we begin to notice what a strong effect language and terminology have on human thought processes. These regions are called non-coding, but what we mean is that they don’t code for protein. This isn’t the same as not coding at all.

There is a well-known scientific proverb: absence of evidence is not the same as evidence of absence. For example, in astronomy, once scientists had developed telescopes that could detect infrared radiation, they were able to detect thousands of stars that had never been ‘seen’ before. The stars had always been there, but we couldn’t detect them conclusively until we had an instrument for doing so. A more everyday example might be a mobile phone signal. Such signals are all around us, but we cannot detect them unless we have a mobile phone. In other words, what we find depends very much on how we are looking.

Scientists identify the genes which are expressed in a specific cell type by analysing the RNA molecules. This is done by extracting all the RNA from cells and then analysing it using various different techniques, so that you build a database of all the RNA molecules that are present. When researchers in the 1980s first began investigating which genes were expressed in a given cell type, the techniques available were relatively insensitive. They were also designed to detect only mRNA molecules, as these were the ones that were assumed to be important. These methods tended to be good at detecting highly expressed mRNAs and quite poor at detecting the less well-expressed sequences. Another confounding factor was that the software used to analyse mRNA was set so that it would ignore signals originally generated from repetitive, i.e. ‘junk’, DNA.

These techniques served us very well for profiling the mRNA that we were already interested in – the mRNA molecules that coded for proteins. But as we have seen, this only represents about 2 per cent of the genome. It wasn’t until new detection technologies were coupled with hugely increased computing power that we began to realise that something very interesting was happening in the remaining 98 per cent – the non-coding part of our genome.

With these improved methodologies, the scientific world began to appreciate that there was actually a huge amount of transcription going on in the parts of the genome that didn’t code for proteins. Initially this was dismissed as ‘transcriptional noise’. It was suggested that there was a baseline murmur of expression from all over the genome, as if these regions of DNA occasionally produced an RNA molecule that got above a detection threshold. The concept was that although we could detect these molecules with our new, more sensitive equipment, they weren’t really biologically meaningful.

The phrase ‘transcriptional noise’ implies a basically random event. However, the patterns of expression of these non-protein-coding RNAs were different for different cell types, which suggested that their transcription was far from random[130]. For example, there was a lot of this expression in the brain. It’s now become clear that the patterns of expression are different in different brain regions[131]. This effect is reproducible when the various brain regions are compared from different individuals. This isn’t what we would expect if this low-level transcription of RNA was a purely random process.

вернуться

128

http://genome.wellcome.ac.uk/node30006.html

вернуться

129

http://wiki.wormbase.org/index.php/WS205

вернуться

130

For a useful review see Qureshi et al. (2010), Brain Research 1338: 20–35.

вернуться

131

Clark and Mattick (2011), Seminars in Cell and Developmental Biology, in press at time of publication.