Unfolding the Protein Folding Solution

Amidst the tumult of 2020, in which the world often seemed to stand still (and worse, at times to regress), the march of scientific progress pushed ever onward. First and foremost, we saw a new standard set for the development, testing, and validation of vaccines, culminating in the approval of multiple COVID-19 vaccines. At the Large Hadron Collider (LHC), we saw the first evidence of a new type of particle. And SpaceX’s Crew Dragon became the first private vehicle to carry astronauts to the International Space Station (ISS).

In November, we witnessed another remarkable scientific feat, as Google DeepMind’s AlphaFold2 “solved” the protein folding problem – an achievement with far-reaching implications for biology and medicine. This feat absolutely deserves to be celebrated. But it is also important to recognize the potential limitations of their approach, and to retain a healthy dose of skepticism. 

In this blog post, I plan to describe the protein folding “problem”, and then to explain why I believe it is best to exercise caution, rather than to immediately regard AlphaFold’s performance as a “solution” to this problem.

First however, I want to acknowledge that I am a physicist, not a biologist. Make of that what you will. I also want to disclose that last year I interned Google X, formerly known as The Moonshot Factory. The opinions I espouse here are entirely my own.

What is a Protein?

To most people, proteins mainly connote biology. Many – like myself – remember learning about proteins as biological molecules, or biomolecules, which have distinct biological functions. In reality, proteins sit at a unique intersection of biology, chemistry, and physics. This makes them fascinating objects of study, but also makes them particularly unyielding to established scientific methods. 

At a basic level, proteins are chains made from amino acids. The amino acids serve as the building blocks for proteins, in much the same way as letters in an alphabet can be strung together to form words. Just as the order of the letters in a word affects the meaning of the sequence, so too does the order of amino acids in the chain affect the biology of the resulting protein.

Unlike words however, the protein chains exist natively in the physical world. When we write a word on the page, the space between letters is fixed. The previous letters in the word don’t dictate how much space we should leave before the next letter. 

For proteins, space matters. Chemically, the amino acids are strung together via covalent bonds, where electron pairs are shared between both parties. Going a level deeper, the amino acids themselves are organic compounds made up of atoms, and are as a result substantially influenced by chemical and physical forces. These forces constantly push and pull the constituents in different directions, driving a series of twists and turns in three dimensional space as the protein moves toward a stable configuration, or conformation. This intricate dance is the process of protein folding. The protein gradually moves from less stable, higher energy configurations to more stable, lower energy states, so the folding is called spontaneous.

And here’s the thing: the protein’s function depends directly on this conformation. In other words, identifying a protein’s stable shape is crucial to understanding the roles it plays in biology.

The Protein Folding Problem

One of the most remarkable things about protein folding is that for a given chain, many distinct paths – each with their own twists and turns – can lead to the same final shape. The intermittent configurations can at times seem completely random; and yet the result is somehow predestined. Observations of this kind led Nobel laureate Christian Anfinsen to postulate that a protein’s structure is entirely determined by its sequence of amino acids. This hypothesis, known as Anfinsen’s dogma, essentially defines the protein folding problem: to predict a protein’s shape (and consequently its function) given only the protein’s sequence of amino acids.

Solving this problem has been an outstanding challenge for half a century, evading the tools of biology, of chemistry, and of physics. 

Physically, the problem is typically framed in terms of minimizing the energy of the collection of atoms and molecules in the protein chain. Despite their success in areas such as biophysics and drug design, techniques like molecular dynamics, which are based in classical mechanics, fall spectacularly short. And the proteins, often consisting of hundreds or even thousands of amino acids, are far too large to be treated quantum mechanically. Some physical models for the problem, which treat the protein chain as randomly choosing junctures at which to fold (so long as the chain doesn’t fold in on itself) lead to the conclusion that the problem is NP-Hard: a fancy way of saying that solving the general case is VERY HARD.

Typically when a problem gets too large in scale to be succinctly stated in the language of one theory, another theory emerges, and with it comes a more suitable language. We need not analyze the quantum mechanical wave function of every proton, neutron and electron to understand that noble gases are stable because their valence shells of electrons are full. And we need not look at every chemical bond in a cell to understand that the mitochondria is the powerhouse. To quote Nobel laureate Phil Anderson in his essay More is Different, “The constructionist hypothesis breaks down when confronted with the twin difficulties of scale and complexity…at each level of complexity entirely new properties appear”.

In the case of protein folding, progress has indeed been made toward finding a more suitable language. In fact, there is a general structural hierarchy within the folded proteins; the primary structure is comprised of the amino acid sequence; in the secondary structure the amino acids form stable patterns of helices and sheets; in the tertiary structure these helices and sheets are then folded into further formations; finally, the quaternary structure captures the interplay between multiple chains; biologists have even identified structural motifs, or three-dimensional structureswhich frequently appear as segments within folded proteins.

The frustrating thing about the protein folding problem is that we are fairly positive such a language should exist. Why? Because nature solves the problem all the time. Typical proteins fold in seconds or minutes; some fold on the scale of microseconds. Yet our theoretical models for protein folding – models based in physics – tell us that it should take proteins astronomically long times to fold. Even under the most lenient assumptions, the predicted timescales are longer than the Universe is old! This apparent discrepancy between the complexity of modeling protein folding on one hand, and the ease with which proteins actually fold on the other, is known as Levinthal’s paradox.

The Test

Every two years, the worldwide protein folding community comes together to assess the state of progress in the field. More than one hundred research groups from around the globe come armed with their newest and most sophisticated algorithms for predicting the structure of proteins. These algorithms are then evaluated on a set of roughly 100 never before measured proteins. 

Adding to the challenge, the competitors (the different research groups) are not told anything about the proteins prior to the assessment. In this way, the biennial test, known as the Community Assessment of protein Structure Prediction (CASP), is designed so as to test protein structure prediction solely on the basis of amino acid sequence. In other words, CASP is designed so that ‘solution’ implies solving the protein folding problem.

Given the vast space of possible ‘predictions’ for each protein, CASP evaluates the quality of a prediction, or how closely it approximates the actual measured protein, on a variety of metrics. The primary evaluation metric, the global distance test (GDT) involves comparing the actual and predicted positions of molecules known as alpha-carbons, which tag the approximate locations of the amino acids. In essence, this is a way of quantifying how well the measured and predicted proteins overlap in three-dimensional space, with GDT scores of 0 implying no overlap, and 100 signifying perfect overlap. 

However, the experimental techniques used to measure the actual proteins are not perfect. This means that after a certain point, it isn’t clear whether the predicted or measured protein is more accurate; which one is closer to ground truth. As a result, a score above 90 on the GDT is generally regarded as a ‘solution’. 

From the inception of CASP in 1994 up through 2016 (CASP12, the 12th competition), there had not been substantial improvement in performance on the GDT. In the intervening years, understanding of proteins and protein folding had absolutely matured. But the results had not materialized in three-dimensional structure prediction. From 2006 and 2016 for instance, the median GDT on the subset of test proteins in the free-modeling category for the best performing algorithm remained above 30 and below 42 every year. Up through this point, no machine learning based approach had even come close to threatening the state of the art. 

The ‘Solution’

Enter DeepMind, on a mission to radically challenge our deepest held beliefs about the power of AI. In 2017, DeepMind shocked the world when its artificial intelligence AlphaGo demonstrated mastery over the game of Go, convincingly beating the reigning world champion.  Fresh off of its game-playing triumph, DeepMind unabashedly set its sights on protein-folding.

In 2018, participating in CASP for the first time, DeepMind’s AI system AlphaFold handily beat the competition, scoring a median GDT of close to 60 on the free-modeling category, regarded as the most challenging category. This was properly recognized as a tremendous leap forward on the protein folding problem, albeit far from a solution. Already, AlphaFold had convinced many that machine learning could potentially be useful not just in games, but in pure scientific research. Indeed, AlphaFold was so convincing that about half of the entrants for the 2020 CASP competition used deep learning in their approaches.

Determined to build on this initial success, DeepMind went back to the drawing board and returned to CASP in 2020 with a new and improved AI; AlphaFold2. Once again, DeepMind shocked the world by shattering its own records and achieving a median GDT of 87 on the free-modeling category – and 92.4 GDT overall. On average, AlphaFold2’s predictions were within a single-atom’s width of the actual measurements. 

Almost immediately, AlphaFold2 was hailed as a ‘solution’ to the protein folding problem. Because AlphaFold2 is an improved version of AlphaFold, we’ll refer to the AI system without the number ‘2’.

DeepMind’s own blog claimed “AlphaFold: a solution to a 50-year old grand challenge in biology”. News outlets followed suit, with sources including Science Magazine, Vox, CNBC, and MIT Tech Review using some variant of the word “solved” in their coverage, and sentiment to match.


With AlphaFold, like AlphaGo before it, DeepMind is forcing us to reimagine what artificial intelligence is capable of. This in and of itself is remarkable. That AI will likely play an integral role in the future of medical research and drug discovery is worth further celebrating. DeepMind deserves ample credit for these achievements. 

That being said, it is far too early to claim they have ‘solved’ the protein folding problem. I believe we should remain skeptical because of the relationship between machine learning on the one hand, and generalizability and interpretability on the other. These problems are not unique to AlphaFold. Rather, they are philosophical qualms with using machine learning to ‘solve’ scientific problems in the way that AlphaFold attempts to do.

The application of machine learning to scientific research is not new. At the Large Hadron Collider (LHC) at CERN for instance, machine learning was used to find the Higgs boson back in 2012. The difference lies in how machine learning is employed. 

At CERN, machine learning was used to facilitate the comparison of our scientific theories – in this case the standard model of particle physics – and experimental data. Physicists already had a theory for the ways in which elementary particles interact with each other; they set out to test that theory by colliding fast-moving particles together, and comparing post-collision measurements with the outputs of their theoretical models. The problem was that even on powerful computers, their model took a long time to generate predictions. Machine learning helped them to more quickly generate synthetic data to compare with experiments. Machine learning did not replace the physics-based model; it helped test the model.

With AlphaFold, DeepMind is effectively attempting to replacephysical and biological models of protein folding with a machine learning model. Yes, AlphaFold performed far better than any previous models. But to what extent can we actually trust AlphaFold’s predictions on new proteins? In other words, how well does AlphaFold generalize? 

Well, we don’t really know. Of the millions of proteins we have already found, AlphaFold was trained on the tiny fraction whose structures have been measured. The test set was even smaller still. Even if AlphaFold had perfectly predicted every test protein – which it didn’t – I’d still bet on nature’s ingenuity. 

Of course, any theory, when faced with new observations, must wrestle with the same questions. If theory and observations disagree, the theory must be modified or replaced entirely. But with machine learning models, where the assumptions are hidden, it often isn’t clear where the model is breaking down.


By playing against AlphaGo time after time, researchers have begun to gain insights into how the AI “thinks”. And human Go players have taken inspiration from AlphaGo in their own strategies. In just a few years, the AI has already given us tremendous insights into the game of Go, strategy more generally, and what it means to be creative.

“Solving” a scientific theory is a far higher bar than besting the best human at a game. 

It’s quite possible that artificial intelligence helps us to achieve this goal; to find the right language. But we need to work on extracting insights from our machine learning models, and interpreting the models we build.

Even if AlphaFold never improves beyond its current state, it will still prove useful in medical research; at the bare minimum, it will allow biologists to take coarser measurements in the lab (reducing time and money spent), and use AlphaFold to iron out the fine structure. More optimistically, we can envision a future in which humans with with AlphaFold to discover the rules of protein folding.

AlphaFold is not a solution to the protein folding problem, but it is absolutely a breakthrough. Any machine learning based approach to science will need to address practical and philosophical challenges. For now, we should appreciate DeepMind’s colossal step forward, and we should prepare for unprecedented progress in the near future. This is only the beginning.