What AlphaFold Means for Structural Bioinformatics
By understanding how biological macromolecules, like proteins, DNA, and RNA, coil into three-dimensional shapes, researchers can study how changes to their structure influence their function, the part they play in diseases, and their interactions with other biomacromolecules. Structural bioinformatics is dedicated to just that — creating new computational methods to analyze biomacromolecular data in order to solve problems and bring insight.
Proteins play a large part in organisms and because their structure is directly related to their function, researchers, scientists, and bioinformaticians have taken a particular interest in understanding them. The mystery surrounding the folding and structure of proteins prompted them to develop computational programs capable of predicting the 3D shape of a protein from its amino acid sequence. DeepMind’s successor to AlphaFold, AlphaFold2, produces incredibly accurate results on protein structure prediction tasks and is a breakthrough for biology, artificial intelligence, and data science.
What is the impact of AlphaFold2 on structural bioinformatics?
WHAT IS STRUCTURAL BIOINFORMATICS?
Structural bioinformatics is concerned with the analysis, prediction, and visualization of the 3D structure of biological macromolecules, which are biomolecules with a mass between 800 and 1000 daltons, high molecular weights, and complex structures. Structural biology focuses on the molecular structure of proteins, nucleic acids, and membranes. These biomacromolecules perform most of the functions in cells and hence, it’s important to understand the structures that enable them to do so.
By applying the principles of molecular folding, evolution, binding interactions, and structure/function, structural bioinformaticians can examine diverse sequences and find the patterns that code for a particular shape.
Protein structures are categorized into 4 levels: primary for the sequential representation of amino acids, secondary for the local geometry of the polypeptide chain (the present alpha-helices and beta-pleated sheets), tertiary for the three-dimensional structure of the entire amino acid chain, and quaternary for the association of the multiple polypeptide structures that form a protein complex.
Structural bioinformaticians also want to visualize protein structure and observe the static or dynamic representations of the molecules. By detecting the macromolecular interactions, it would be possible to infer molecular mechanisms and come to better conclusions.
Biomacromolecular interactions occur when parts of molecules come into contact and have an effect upon one another. Proteins perform several types of interactions, such as protein-protein interactions, protein-peptide interactions, protein-ligand interactions, and protein-DNA interactions. Understanding these interactions has not been easy, but has immense potential for advancing the study and design of drugs.
INTRODUCING THE PROTEIN FOLDING PROBLEM
There are more ways of assembling 100 amino acids than there are atoms in the universe!
Levinthal's paradox suggests that by randomly searching a large number of possible structures, it wouldn’t be possible to reach the functional conformation of a protein in a reasonable amount of time. This statement is paradoxical considering the fact that proteins fold into their structures in a few milliseconds.
Proteins fold themselves spontaneously and when trying to predict how, there are many things to consider, such as the various interactions between amino acids. The study of protein folding deals with how proteins arrive at their native state, the state in which they are properly folded and in functional form.
Critical Assessment of Structural Prediction (CASP) is a biennial competition that presents predictor groups from industry and academia with about 100 protein sequences whose structures have been found but not publicly released. Some entrants (like AlphaFold) compute a structure for each sequence, whereas others determine it experimentally. The first CASP competition took place in 1994 and the most recent one was in 2020.
The Z-score is used for the relative comparison of CASP predictions. It’s the difference between a sample’s value and the population mean, divided by the standard deviation. A high value represents a large deviation from the mean and in other words, the groups that are considerably better than the average will have larger Z-scores.
The root-mean-square deviation (RMSD) is a quantitative measure of similarity between 2 or more protein structures that averages the distance between the atoms of proteins and small organic molecules. CASP uses RMSD to assess how well a proposed structure matches the target structure. This metric is sensitive to the created outlier regions. The lower the RMSD score, the better the model is.
The global distance test (GDT_TS), where TS stands for “total score”, is a measure of similarity between 2 protein structures with corresponding or identical amino acid sequences, but differing tertiary structures. It’s reported as a percentage and the higher the GDT_TS score, the more similar the model is to the reference structure. I like to think of it as the fraction of the protein structure that is correctly predicted.
There are 8 modeling categories at CASP:
- The High Accuracy Modeling category is concerned with domains where the majority of submitted models are sufficiently accurate for detailed analysis.
- The Topology category is concerned with the ability of methods to predict contacts and interresidual distances.
- The Refinement category is concerned with the analysis of success improving the accuracy of refined models from the initial submissions.
- The Assembly category is concerned with the assessment of how well methods can predict various interactions.
- The Accuracy Estimation category is concerned with the ability to provide useful estimates for the overall accuracy of models at the domain and residue level.
- The Data Assisted category is concerned with how much the accuracy of models is improved by adding sparse data.
- The Biological Relevance category is concerned with the answers to biological questions that the models provide.
WHAT IS ALPHAFOLD?
Like most modern prediction algorithms, AlphaFold’s feature engineering technique is multiple sequence alignment (MSA). In short, it helps determine how similar amino acid sequences are. The initial amino acid sequences often have an evolutionary relationship with the data in the Protein Data Bank, co-evolving in a close 3D space and descending from a common ancestor. It’s then possible to infer sequence homology (similarity due to shared ancestry) and conduct phylogenetic (evolutionary development and diversification of species) analysis. MSA is used to assess sequence conservation (similar sequences in nucleic acids and proteins across species) of secondary and tertiary structures. The idea is that if 2 amino acids are in close contact in 3D space, the mutations in one amino acid will be followed by mutations of the other. Amino acids that are distant in a sequence generally don’t have much of an effect on each other, so MSA could provide valuable hints on the shape of a protein. Input data for AlphaFold and AlphaFold2 is the information about pairs of amino acids that end up close together in folded structures.
In the graph below, the CASP14 participants are ranked by the sum of the Z-scores of their predictions, as long they’re greater than 0. The first bar represents group 427, AlphaFold2, and the second bar represents group 473, BAKER. AlphaFold2 boasts a high score relative to other groups.
HOW DOES ALPHAFOLD WORK?
From the MSA input, AlphaFold used a Deep ResNet (deep convolutional neural network) to predict a distance distribution matrix (distance between the pairs of amino acids) and torsion angle matrix (angles between the chemical bonds that connect those amino acids). Deep ResNets are residual neural networks, inspired by the pyramidal cells in the brain.
Next, the program applies gradient descent optimization to the matrices (distance distribution matrix and torsion angle matrix) to predict the three-dimensional structure. How? Starting with a 3D structure as a model, the algorithm continuously updates it until the distogram (matrix of distances between different parts of a protein) of the predicted structure gets as close as possible to the distogram of the Deep ResNet.
DeepMind’s code for AlphaFold is available on Github.
HOW DOES ALPHAFOLD2 WORK?
Instead of using a 2 step approach (like for AlphaFold), DeepMind took an end-to-end approach with AlphaFold2, taking the MSA as input and providing the full structure as output. This year, their program is based on an attention-based neural network, a new deep learning approach, called a Transformer. Also popular for natural language processing and computer vision purposes, attention mechanisms enable neural networks to focus on any subset of their inputs or features when training. The Transformer attempts to interpret the structure of a folded protein, always refining itself with the MSA and representation of amino acid residue pairs. The system is then able to make predictions of the physical structure and determine the accurate structure.
AlphaFold2 was trained on a labeled dataset of approximately 170 000 proteins with known structures and an even larger unlabeled dataset of proteins with unknown structures from the Protein Data Bank.
DeepMind’s GDT_TS score exceeded 90 for around ⅔ of the proteins at CASP14, which means that any differences between the predicted structure and the actual structure could be caused by experimental errors rather than software faults.
As of now (December 2020), DeepMind has not yet released the paper detailing how AlphaFold2 works. However, it’s possible to speculate from the disclosed information during the CASP14 conference and the paper on the original AlphaFold.
WHAT ABOUT STRUCTURAL BIOINFORMATICS?
Thanks to AlphaFold2, structural bioinformaticians can now focus on problems other than structural prediction. The program doesn’t reveal how an amino acid chain assembles into the structure within milliseconds, but rather only demonstrates crystal structure. Because the neural networks might be difficult to interpret and poorly represent the dynamic folding process, understanding the way AlphaFold2 infers the folded structure could either provide a lot of or very little insight.
If DeepMind makes AlphaFold2’s code available, bioinformaticians (with the means to run the program) could replicate the program, deploy it for practical applications, and improve it. If the code were to be kept private and if Alphabet (DeepMind’s “grandparent” company) were to decide to only commercialize AlphaFold2, progress in protein informatics could be stalled.
From the resulting AlphaFold2 models, a research group was already able to correct a mistake in its interpretation of a protein and another group was able to solve in a few hours the structure of a protein it had been working on for about 2 years. AlphaFold2 could join the array of tools that structural biologists already use, such as mass spectrometry, crystallography, and cryoEM, to determine protein structures and corroborate results. With the appropriate infrastructure and hardware, scientists and computational research groups could effectively rely on AlphaFold2, which is shorter in speed and lower in cost than experimental methods.
According to the Wikipedia page on structural bioinformatics,
“Structural bioinformatics main objectives are the creation of new methods to deal with biological macromolecules data to solve problems in biology and generate new knowledge.”
There is no doubt that AlphaFold2 has met these goals and now the question is: How will the scientific community build upon this breakthrough?