On the Astonishing Improbability of Protein Families

In 2007 I published a paper in the Journal of Theoretical Biology and Medical Modelling¹ in which I presented a method to estimate the amount of information required to encode the instructions for a protein into a genome. The results were disturbing to those committed to the belief than nature can easily “find” the digital information required to code for thousands of different proteins. Given the levels of functional information I estimated from actual data in the online Pfam database², when we solve for their probability using an equation published by Hazen et al.³, it turns out that the level of required information encoded in DNA is so improbable, that we can never expect nature to achieve the code for even one protein family anywhere in the universe, over the lifespan of the universe. Never mind thousands of them.

Elsewhere, I have proposed the following testable, falsifiable and verifiable hypothesis:

Hypothesis: The ability to produce statistically significant levels of functional information is a unique attribute of intelligent minds.

The key word here is “unique”; nothing else can do it. In other words, natural processes cannot write computer programs, which poses a problem for those who are committed to the belief that nature created life. As Craig Venter, the first person to decode the human genome, put it …

All living cells that we know of on this planet are “DNA software”-driven biological machines comprised of hundreds of thousands of protein robots, coded for by the DNA, that carry out precise functions.”⁴

Just a month ago, Matthew Matlock and S. Joshua Swamidass uploaded a paper to a public archive, bioRxiv⁵. Using the results of a very simple simulation they had designed, they claimed that my published method fantastically overestimated the functional information required to code for a protein family, and that they therefore had falsified my testable hypothesis stated above. Since it had not been peer reviewed, I wrote a formal review of their paper⁶. For the layperson, there are three major flaws:

The authors’ simulations are grossly inadequate to justify their conclusion that real-life data for protein families suffer from the same problem as the data generated by their simple simulation.

There is a complete absence of any data to support their conclusion that they have falsified the hypothesis mentioned above. It seems that they do not understand the concept of functional information as defined by Hazen et al., and are confused about the difference between an estimate and the actual value. As I stated in my review, “One does not actually produce functional information by ‘fantastically’ overestimating it from a badly skewed sample.”

Finally, and most devastating, a more realistic simulation completely falsifies their two major conclusions (available here).

In their simulation, they began with a perfectly ordered repeating sequence and then mutate it to see if the estimated functional information for non-functional sequences would converge on the actual value of zero bits of information. It did not, producing estimates that were significantly in error from the known value of zero bits. They provided no analysis as to why their results were so badly off.

I wrote a more realistic simulation that began with the same, highly ordered repeating sequence. From that seed sequence the program produces a universal common ancestral population from which numerous, independently evolving populations can be produced. The user can vary the length of the sequences, size of the populations, number of populations that descend from the universal ancestral population, percentage of sequences in each generation that produce progeny, mutation rate, and number of generations through which the populations evolve. Members of each generation that produced progeny are randomly chosen.

Try it for yourself: The program and modules are available here. One can try a range of values to see their effect. Caution: it is best to start small, as certain combinations of large values can easily result in run times of many hours, days, or even weeks.

Results:

The more realistic simulation produces estimates of functional information that converge on the actual value of zero bits, as the various populations evolve. The rate of convergence depends upon the values the user inputs for the variables. Various inputs produced estimates for functional information that were equal to, or close to, the actual functional information content for non-functional sequences of zero bits. Two examples that falsify the conclusions of Matlock and Swamidass are as follows:

For sequences of 100 amino acids, in 1,000 independently evolving populations of 1,000 members each, over 500 generations, and with a replication rate of 0.1 and a mutation rate of 1 mutation per 100 amino acids, the estimated functional information for the resulting multiple sequence alignment was 0 bits, with an information density of 0.00 bits/site.

For sequences of 100 amino acids, in 1,000 independently evolving populations of 100 members each, evolving over 100,000 generations with a replication rate of 0.5 and a mutation rate of only 0.003 mutations per 100 amino acids, the estimated functional information of the resulting multiple sequence alignment was 1 bit, with an information density of 0.01 bits/site.

Discussion:

It should be pointed out that for universal protein families in the Pfam database, or protein families that are common across phyla, the universal ancestral population that gave rise to them would go back to the earliest discovered evidence for life which, for an evolutionary scenario, is more than a billion years (i.e., more than a billion generations). Not only would there likely be multiple, independently evolving populations within the same species, but across genera, orders and phyla. Thus, for a universal protein family, there would be easily tens of thousands of independently evolving populations, if not hundreds of thousands or even millions, when we consider all the different taxa containing that protein family evolving over hundreds of millions of generations. Thus, the data we have in the Pfam database today is the outcome of not merely one population as Matlock and Swamidass simulated, but of a process modelled by the much more realistic simulation I have provided.

Conclusions:

A more realistic simulation falsifies the conclusions in the Matlock and Swamidass paper. At present, their paper falls substantially short of the standard we should expect for science, and requires significant revision if it is to be salvaged. More likely, it should be retracted from the public archive bioRxiv. Furthermore, a more realistic simulation not only falsifies their conclusions, but provides reason to believe that the method I presented in my original paper yields an estimate that is more reliable than previously thought, at least for protein families that span numerous, independently evolving taxa. Finally, Matlock and Swamidass have not provided any data whatsoever to falsify my hypothesis stated earlier. The scientific evidence, therefore, from actual data available on Pfam, suggests that the information required to code for protein families is statistically very significant and, thus, tests positive for an intelligent source.

References:

(1) “Measuring the functional sequence complexity of proteins“: K.K. Durston, D.K.Y. Chiu, D.L. Abel, J.T. Trevors, Theoretical Biology and Medical Modelling (2007), 4:47, DOI: 10.1186/1742-4682-4-47

(2) “The Pfam protein families database: towards a more sustainable future,” R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. Bateman Nucleic Acids Research (2016) Database Issue 44:D279-D285

(3) “Functional information and the emergence of biocomplexity,” Robert M. Hazen, Patrick L. Griffin, James M. Carothers, and Jack W. Szostak, PNAS 2007 104 (suppl 1) 8574–8581; published ahead of print May 9, 2007

(4) “Passing the Baton of Life – from Schrodinger to Venter,” New Scientist, 13 July, 2012

(5) “Evolution and Functional Information,” bioRxiv, 2017.

(6) “Review of ‘Evolution and functional information‘” April 2017.

Photo credit: Dietmar Rabich / Wikimedia Commons / “Norderney, Nordsee am Oststrand — 2016 — 5167 (bw) – 2” / CC BY-SA 4.0, via Wikimedia Commons.

Source: On the Astonishing Improbability of Protein Families | Evolution News