## How many proteins can be made using the entire earth mass to do so ?

The mass of the earth is given by my physics book (Halliday 6th Ed.) as 6 x 10^27 grams. If we made just one molecule of each protein containing n amino acids linked together, when would we run out of material? Make a guess. I found the results surprising.

Assume the earth is made of nothing but hydrogen, oxygen, nitrogen, carbon and sulfur. Clearly not true, but we’re going for what mathematicians call an upper bound. If mathematicians can get away with things like “consider a spherical cow” I can get away with this. (The cognoscenti may wish to go for a least upper bound). Proteins are linear chains of 20 different amino acids ranging in mass from glycine at 79 Daltons to tryptophan at 204. When linked together by an amide (peptide) bond, 18 Daltons of mass is lost (water is split out). So figure the average amino acid at 100 Daltons (roughly).

So there are 20 x 20 = 400 distinct proteins of 2 amino acids, 8000 with 3, 160,000 with 4, 3,200,000 with just 5. Shorties like this are called peptides (or polypeptides) and just when you start calling them proteins seems to be a matter of taste.

We’re figuring the mass of the typical amino acid at 100 Daltons, but a Dalton doesn’t have much mass. It is 1/12 the mass of a single atom of carbon-12, Avogadro’s number (about 6 x 10^23) of which have a mass of 12 grams. So one Dalton has a mass of 10^-24 grams (roughly).

The number of distinct proteins containing n amino acids is 20^n. The mass of each protein (in Daltons) is (roughly) 100 x n — depending on the amino acids chosen. The mass of the collection of distinct proteins of length n in grams is (20^n) x (100 x n) x (10^-24). It’s clear that we’re over 1 gram for the collection at only 24 amino acids (as 20^24 is much larger than 10^-24. How far over? 2^24 x 100 x 24 = 40,265,318,400 = 4 x 10^10 grams.

As noted, the mass of the earth is 6 x 10^27 grams. So we’re not too far away at 24 amino acids. Certainly no farther away than another 17 amino acids as 20^17 is much greater than 10^17.

So, the mass of the earth (which isn’t all carbon, hydrogen, etc… ) isn’t enough to make just one molecule of each of the possible proteins 41 amino acids long. 41 amino acids is a very small protein (some would call it a polypeptide). Just about every protein of biological interest is much larger. The champ is a muscle protein called titin which has 27,000+ amino acids.

So what? It means that chemists will never be able to explore more than a tiny morsel of the space of possible proteins. Perhaps computationally we will (I doubt it), but that’s the subject of a future post.

The above is a post I wrote for “The Skeptical Chymist” back in April of 2008 (using the nom de plume Retread). I hoped for a lot of comments (particularly showing how I was wrong, as being correct has a lot of implications). I did get the following interesting comment from Param Priya Singh.

Really Good! However this may not be true. Because the situation which has been discussed is only valid if all possible polypeptides are made- all at once. But in biological reality it may not be the case. What if the sequence space has been explored (by nature) gradually during millions of years? In that case at a particular instance not all, but a limited (but still very large) subset is being explored and is being evolved under the selective pressure. From Param Priya Singh

to which I replied

Param — thanks for your comments. Consider the following: Let us suppose there is a super-industrious post-doc who can make a new protein every nanosecond (reusing the atoms). There are 60 * 60 * 24 * 365 = 31,536,000 ~ 10^7 seconds in a year and 10^10 years (more or less) since the big bang. This is 10^9 * 10^7 * 10 ^10 = 10^26 different 41 amino acid proteins he could make since the dawn of time. But there are 20^41 = 2^41 * 10^41 proteins of length 41 amino acids. 2^41 = 2,199,023,255,552 = 10^12. So he has only tested 10^26 of 10^53 possible 41 amino acid proteins in all this time.

As per your suggestion, this is making one protein at a time. However, even if the hapless post-doc was able to use the entire mass of the earth (6 x 10^27 grams) every nanosecond to make a different set of proteins (one molecule of each), he would never have made all the possibilities for a protein of length of one of the two chains of hemoglobin (141 or 146 amino acids) since time began. Hemoglobin just isn’t that big as proteins go (the gene mutated in cystic fibrosis has well over 1000).

So write in and show me the mistakes in all this. If it stands, this back of the envelope calculation poses severe problems for a very popular theory.

• Another O-chemist  On December 21, 2009 at 10:05 am

Protein synthesis in living things is certainly not random; it’s guided and catalyzed by molecules already in place. So the accessible portions of protein space are severely limited by what parts of it have already been explored.

But what if that were also the case in abiotic situations? RNA, if I remember correctly, has been shown to catalyze its own synthesis. Proteins probably do so, too, to some extent, so that the possible protein space available even under abiotic conditions would be dependent on what polypeptides had formed previously.

There would be some randomizing caused by the fact that amide bonds in water have a half-life of about seven years. But ultimately that wouldn’t matter because, after the formation of a certain number of polypeptides, most random abiotic proteins would result from recombination of existing polypeptide fragments rather than being built from scratch.

• Another O-chemist  On December 21, 2009 at 10:06 am

Sorry! I forgot to close the <em> tag after “even under abiotic conditions”.

• Yggdrasil  On December 26, 2009 at 9:12 pm

One issue here is that most proteins consist of smaller functional domains that are shared across many proteins (e.g. almost all tyrosine kinases share the same two-lobe construction with the active site at the cleft between the two lobes). Through duplication and recombination events, nature has been able to mix and match these domains to evolve multidomain proteins with a variety of functions. For example, Src kinase is a notable example of a protein with such a modular construction. Indeed, titin which you reference in the post is actually just a bunch of small ~100 amino acid immunoglobulin-like domains strung together in a linear array.

Furthermore, there are some common structural scaffolds upon which proteins with multiple functions are built and at a structural level below that, bioinformaticians have identified numerous proteins folds that are conserved among different families of proteins. So perhaps nature could just evolve a number of protein folds through the brute force combinatorial synthesis approach in the primordial soup then evolve more complex structures by recombining these folds into scaffolds, domains and proteins.

Thus, when considering the evolution of new proteins you are perhaps placing too much focus on the mutation aspect of evolution. Surely much of evolution has occurred through nature randomly substituting amino acids and finding that these substitutions produce improved or new functions. However, recombination — rearranging existing proteins folds, scaffolds, and domains into new combination — surely has played just as important a role.

Perhaps there is a difference of philosophy here. Evolution is an astoundingly good mechanism for solving local optimization problems. Building all possible permutations of amino acids of the average length of a protein and selecting the best sequence for a particular function would be horribly inefficient if one were seeking to solve a local optimization problem. However, this approach solves a different problem that nature cannot solve so easily: it can find the global optimum solution. So, perhaps this fact should offer some comfort to chemists. While nature may have evolved extremely efficient solutions to many chemical problems, a better solution likely exists outside of the local minimum that evolution can sample.

• luysii  On January 5, 2010 at 9:37 pm

Another O-chemist — RNA dependent RNA polymerases are known, but I as far as I know they’re all proteins. While the ribosome is basically an RNA enzyme (ribozyme) with proteins hanging around on the periphery, I don’t think RNA polymerases made out of RNA have been found. Also I don’t think a protein has been found which can catalyze its synthesis. There is something called an intein which can splice itself.

Yggdrasil — certainly agree that huge proteins like titin, fibronectin, von Willebrand factor are made of smaller units mushed together. But even the repetitive 100 amino acid immunoglobulin domain of titin, is too large to have been found by random exploration of protein space. Agree that recombination has played a role. But the natural selection I was brought up on argues that it works on mutations which occur in a basically random fashion. My point was that there hasn’t been enough material or time for anything approaching a thorough search, so our existence, and that of all living things must be put down to an amazing stroke of good luck (or something else).

Your idea of evolution as the solution to an optimization problem is intriguing and probably correct, but how did it ever get anything to optimize?

Both of you — thanks for the thoughtful comments.

• Galaxy Rising  On July 30, 2010 at 7:45 pm

http://en.wikipedia.org/wiki/Ribozymes#Known_ribozymes

There are numerous identified ribozymes. RNA is more than capable of catalytic action, just like proteins; it’s simply not as efficient (in terms of both catalytic properties, due to reduced ability to bend, and building properties due to the fact you are also using these to store data).

Now, as a computer scientist and biologist, I find your statements quite off. Natural selection is all about optimization; we’ve got statistical models about how this occurs, and the probabilities rendered. We even use genetic and evolutionary algorithms to develop things which humans are incapable of understanding! The problem with your basic idea is that it ignores the back story to each protein: time and numbers.

Each organism, especially the earliest organisms, is a laboratory, a single iteration of a computational experiment. This was true even more so if we go all the way back to the first population (note that evolution says nothing of abiogenesis, that is “how did the first population occur”), when it is most likely that the environment was anoxic (without oxygen, and most likely reducing, judging by the toolkit all organisms have on Earth) and there was very little competition (due to the lack of, well, life).

And there in lies your largest error: each generation is a cycle in the optimization; evolution is iterative. Proteins are not generated in the way you appear to think they are, fully formed and functional, in an exhaustive manner. They are tested and improved or removed. Also take note that your proteins have no way with which to reproduce or iteratively improve, which basically means this isn’t natural selection, it is a form of brute force search. This is, notably, the worst case POSSIBLE, and that your hypothetical post-doc is apparently a chimp.

So, allow me to modify your experiment:
Instead of the hapless post-doc, allow us to use my hypothetical “chimp” post-doc.

Our chimp post-doc has an assignment: search the space of proteins for useful ones. His higher ups also don’t care about finding EVERY useful chemical; they are fine as long as he gets results. (Probabilistic not exhaustive)

The chimp post-doc, while quite fast, is also rather messy (what with the feces), and therefore makes mistakes. Let us assume he is quite competent for a chimp, and therefore only makes mistakes every %5 of the time, recognizing and correcting them the other 95% of the time. (Mutation)

However, let us assume this chimp is also not functionally retarded, and therefore tests the protein for reactions. Not useful reactions, mind you; he has some sort of assay, and it returns a number, scoring how reactive the protein is. (Testing)

Now, our chimp post-doc is also rather childlike, and the assay happens to make a noise he likes when the protein gets above a certain number. He makes sure to mark those down to play with later, especially if they are particularly loud (the noises volume is dependent on how reactive the assay is). But he tends to dump the ones that don’t make a noise because their number is too low, except for when he gets distracted. However, when he gets distracted (which is at a random interval), he writes the protein down no matter what. (Directed Selection, Randomized Sampling)

Now, the assay doesn’t really depend on how big the protein is; it can test irrespective of the length, dependent only on the reactivity. However, the assay breaks down the proteins to do this, so every time our fair chimp post-doc runs it, he loses anything he hasn’t written down. Luckily for him, he has a rather large (but finite) database to write in; it has space for one million proteins, irrespective of their size. (This is, by the way, around the size of a DVD, 4.7 gigabytes, each protein or metal being one byte allows for 1 million proteins of up to 4000 proteins, and the monkey would probably have a hard drive in the terabytes; he’s very good at getting grant money).

Finally, the monkey follows a strict procedure:

First, he generates, one by one, proteins of random length and random content, testing every single one as indicated previously, until his database is completely full, at which point he can do this no longer. (Population Generation)

This, of course, infuriates him (as he loves hitting buttons at random), causing him to go back and test all his favorites again (he loves that sound). But, as previously indicated, he isn’t very good at typing, so he causes single errors every 5% of the time. These can be additions, subtractions, or simple changes. (Mutation)

He also occasionally loses his place, usually when the noise is very loud, and tries the same protein multiple times, with the same error rate as before. But he quickly recovers, doing this a number of times proportional to how loud the noise is (the louder the noise, the more retries). (Duplication of Selected)

Now, take note, as previously indicated, sometimes he gets distracted, and writes down the attempt no matter what. He also occasionally gets distracted and leans on the buttons, generating a completely random protein, which he then tests anyway, as above. (Diversity Maintenance)

Now, as the assay gets louder and louder, how much it takes to make him note the noise also increases. He also tends to get less distracted the longer this goes on (though he always will be distracted eventually). (Goal Shifting)

When he gets done testing his million proteins, he does it again. Rinse, repeat. (Iteration)

However, our fair chimp post-doc will eventually get bored if the noise generated from the assay stays at a constant volume for too long, at which point he will refuse to work. When this happens, his hard drive will back up his current million to some central location, and will be cleared he will start all the way from the beginning, making a new random million, and selecting through them (with his assay making a new sound to entice him). (Champion Preservation)

When he gets 1 billion in this manner, the same procedure will occur, but instead of clearing his hard drive, it will be filled up with random examples from the central location (now having 1 billion proteins). The procedure will then be repeated until he once again refuses to work because the sound hasn’t changed in a while. Repeat, rinse. (Hill Climbing)

When he finally gets bored of the 1 billion, he will go back to zero, and work his way back up through the ranks, until he generates a new billion. The 2 billion will be filtered together; the top billion preserved, and the bottom billion discarded. Repeat, rinse, until the heat death of the universe. (Class Cutting)

Now, this procedure may be a bit complex compared to your original thought experiment, but it can be broken down quite quickly given my parenthesis, which I will do now:

Our protein generation procedure is:
(Probabilistic not exhaustive)
includes (Mutation), (Testing)
and is an example of (Directed Selection, Randomized Sampling)

The procedure
(Population Generation) Generates a randomized population
(Mutation) – Mutates parts of it for diversity
(Duplication of Selected) – Duplicates good results
(Diversity Maintenance) – Makes sure it doesn’t narrow too much
(Goal Shifting) – Doesn’t stop optimizing
(Iteration) – Optimizes at every turn

It also has a higher level:
(Champion Preservation) – Keep great results
(Hill Climbing) – Use great results to get almost optimal results
(Class Cutting) – Save space by only allowing the best to proceed

This is a relatively simple asexual genetic algorithm with a few important, and natural, heuristics. It is not guaranteed to find optimal, great, or even ok results faster than brute force; however, it has high probability of doing so, and will improve even abysmal results to almost optimal results extremely fast. It also needs no intelligence to do so.

By my simplest computational runs, using a small bit of C code, and assume he never goes over 1000 amino acids, he will find hemoglobin (with 99% probability) by the end of this procedure.

How long will this procedure take? Using a “cycle” per million tries (a unit of time you can decide for yourself, I would use 1 second (the time it took my computer to run 4 of these), 5 minutes (RNA decay time), 20 minutes (bacterial reproductive time), or 1 hour (a nice round number) ), it is at least a trillion. Due to the fact it is a probabilistic process, I would go with at least an order of magnitude above that. However, that is the time it would take the procedure to run; at the end, you would have 1 billion highly reactive materials (and if you chose to waste space, you would actually have much more than that, by saving every protein above a certain level of reactivity, a cut off point).

If, instead, we are simply looking for hemoglobin (or even simply a protein that binds to oxygen reversibly, of which there are many), it would take significantly less than this. We most likely would find it before the billion protein hard drive is full, along with a whole mess of toxins, heme complexes, and metal-organic protein complexes.

However, unlike your procedure, I cannot definitively tell you how long it would take. It must be noted that life, much like the executives to our chimp post-doc, are not concerned with finding specifics; they are concerned with a certain end result. Notably in hemoglobin, the act of binding to oxygen reversibly. Is hemoglobin the best at doing this? Of course not. It’s just good enough for life. Natural selection does not seek optimal proteins, only ones better than the previous ones.

More than that, my procedure (and natural selection) are directed; they can use previous information to find results, by improving or modifying old designs that have been proven. Your procedure is completely undirected, and indeed, not even random (which would give you a much better payoff, by the way), it’s entirely brute force. That’s not how natural selection, or indeed, ANY optimization technique works; they all beat brute force in the average case.

Please, before attempting to attack natural selection, make sure you actually understand the math and principles behind it. It is an optimization process; look up how optimization works in math and statistics. It is also a probabilistic process; not every step works out, but these are much faster at finding a good result than those that guarantee an optimal result. It is also iterative; every step forward is the start of what could be a great leap, and every step is definitely followed. It is also, above all else, multi-threaded; every organism is a lab, finding new proteins and spreading them.

It is not brute force; it doesn’t try everything. It isn’t optimal finding; it doesn’t find the best option. And most of all, for crying out loud, it isn’t single threaded; there is not one organism, trying desperately to find proteins for everyone else.

• luysii  On July 31, 2010 at 9:29 am

Galaxy Rising: thanks for taking what must have been a lot of time to write your comment. I’m about to leave for some family affairs, so I won’t be able to properly respond until at least the middle of next week. In the meantime have a look at http://luysii.wordpress.com/2009/11/29/time-for-the-glass-eye-test-to-be-inserted-into-casp/

• layman  On December 25, 2010 at 4:31 pm

luysii, i found myself on your blog today via a fark.com link to another article. this blog entry, and Galaxy Rising’s thoughtful response were particularly interesting to me. being fundamentally ignorant of all the subject matter discussed here, i resign myself to merely weighing the semantic logic of each viewpoint. i’m just curious if you ever took the time to digest and respond to the above comment, as you hinted you might.

• CoolJoe  On November 29, 2012 at 12:40 am

Your argument presented here ,in regards to the improbability of abiogenesis and/or evolution, is wrong. Rather than explain why myself, I’ll borrow the words of Richard Feynman:

” You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!”

• luysii  On November 29, 2012 at 8:02 pm

Cool Joe — have a look at another post — http://luysii.wordpress.com/2010/08/08/a-chemical-gedanken-experiment/

and let me know what you think. I’m amazed that the proteins making us up have relatively few shapes, e.g. have just a few potential energy minima much lower than the gazillions of other ones, AND that the barriers between these few minima and the other gazillions are high enough that the gazillions aren’t visited often at body temperature. How common do you think this is?

• Aspirin  On April 16, 2013 at 10:36 am

“How common do you think this is?”

Very common. The shapes of proteins have been optimized by evolution so it does not matter much that they are separated from others by low energy barriers. The people who do protein design (both experimental and computational) constantly come up with new solutions that are picked among countless possible ones with amazing speed and efficiency, thanks to natural selection. By your logic you would say that directed (as well as natural) evolution would never work since the process has to parse literally an infinite number of combinations and will never find a solution, yet both directed and natural evolution of new proteins are accomplished by nature and scientists every single day.

You seem to be amazed by things that are quite uncontroversial and well-understood. Life is amazing, but it’s not so amazing as to be unbelievable. A lot of commenters above (especially Galaxy Rising) have explained that your logic assumes a brute force approach to a problem that’s not approached using brute force at all, and it’s quite well known that this is the case. Perhaps you should familiarize yourself more with the literature on protein folding. Yet with due respect, you simply link to other posts when challenged rather than answering detailed comments with an equal level of detail. Why don’t you fix the fundamental flaw in your argument rather than just offering more versions of it?

• luysii  On April 16, 2013 at 10:57 am

Fair enough. I had planned to start reading Dill’s book on molecular driving forces, and then go seriously into the PChem of protein folding. Other interests supervened (differential geometry, relativity) for details see — http://luysii.wordpress.com/2011/12/31/some-new-years-resolutions/

I certainly accept the descent with modification idea of protein optimization, but how the first things to descend from were arrived at amazes me.

Saying proteins fold to their final shape by sliding down an energy funnel is a description not an explanation.