Using quantum annealing methods to simulate the protein folding problem within a 2D lattice
Believe it or not, the drug discovery process takes around 10 years for a new discovery to be made and costs around $2.6 billion. The drug development process itself results in a little over 8 failed attempts out of 9 tries, giving a success ratio of around only 10%.
We really can’t afford to have a success rate this low. This huge price tag in both money and time takes a huge toll, almost 10 million people die of cancer each year, which is around 19 people each minute.
With such a huge problem like this, it doesn’t make any sense and it’s just unacceptable. And it’s so surprising that despite the tremendous number of people who are affected by this and the amount of time we’ve spent researching, we still haven’t found a cure.
(This is Pt. 2 of my quantum protein folding series, if you want a more in-depth intro to this topic, check out my article here)
But what if drug discovery and the designing of new protein-based drugs were able to happen more efficiently and at a much faster rate? Not only would this help the millions of people currently diagnosed with these diseases and illnesses but it would help prevent countless more.
Conventional Ways and The Protein Folding Problem
Today, the conventional way of testing out these protein sequences are with the “take and test” method, by trying out each candidate until you generate a list of possible solutions, or “maybe candidates” which may not even work. This method is extremely inefficient, which is why drug development is so slow and tedious.
Because of this, there is something called the protein folding problem. With these proteins, you are given this one giant 20-letter word where you are then assigned a task to predict the global, three-dimensional structure that these proteins fold into.
So with these 20 available amino acids, each having varying molecular properties, when expanding to the human protein ranging to around 400 amino acids long, the computational complexity is huge.
Other methods, including NMR (nuclear magnetic resonance), cryo-electron microscopy and X-ray crystallography are also being carried out for novel drug discovery. These techniques are used to determine the shapes of the proteins, but all the methods depend a lot on trial and error. Years of work is spent while also costing hundreds of thousands of dollars for each protein structure.
Recently, advances in classical machine learning are also being carried out today, including the work that’s done with AlphaFold. Deep neural networks are able to predict properties of the protein, such as the distances between the pairs of the amino acids and the angles between the chemical bonds, solely from its genetic sequence.
They are however, restricted due to computational size limits and there is also the question of efficiency. The Levinthal Paradox states that it takes more than the age of the universe, 14 billion years to try all the possible combinations of protein folding based on amino acid chains while the body only takes seconds to achieve the folding pattern. At this rate, we can never effectively map out these proteins. In addition to that, proteins over 150 residues are unable to be computed classically.
Quantum Annealing
But even as protein folding continues to be a problem, it can potentially be solved with the development through quantum computing. This approach utilizes quantum-based algorithms to fold long protein sequences, where lattice models are then used to examine the specific properties of protein folding and design.
More specifically we can use quantum annealing which helps us find the lowest energy path when creating a protein, much more efficiently without being limited to the amount of classical computational power.
Simulated annealing is a probabilistic technique for approximating a global optimum of a given function within a large search space. It’s used for optimization problems (ex. The Traveling Salesman Problem).
Annealing a material itself involves
- Heating above the recrystallization temperature
- Maintaining a specific suitable temperature for an appropriate amount of time
- Cooling
The general steps of the simulated annealing algorithm are:
- The temperature decreases from an initial value to 0
- (At each time step) The algorithm selects a solution close to its current one
- Measures the quality of that solution
- Moves it according to the temperature-dependent probabilities of selecting better/worse solutions.
The algorithm accepts “worse” solutions compared to the current one as the solution space is explored. This allows for a more extensive search for the global optimum solution.
For each step, the simulated annealing algorithm considers a neighboring state S to the current state s and decides probabilistically to either move the system to state S or stay in s.
A move, or the way the states are changed in order to produce the neighboring states, result in small changes in the previous state in order to improve the solution by constantly improving its parts.
The probabilities lead the system to move to states of lower energy, and this state is repeated until the final state is considered “good enough.”
The whole point of it is to search for a maximum, which is found by cooling the temperature slowly until it reaches 0. The goal is bringing the system from the initial state to a state with the minimum possible energy.
Project: 2D Lattice Protein Folding with DWave’s Annealer
What I did was implement the protein folding problem on a 2D lattice model. There are three parts to this:
- Simulated Annealing with conventional Monte Carlo
- Turn Ancilla Encoding on a CPU
- Turn Ancilla Encoding on a QPU (Dwave’s Annealer)
We would first perform a simulation based on simulated annealing with Monte Carlo where the amino acid positions are simulated as random walks that are relative to the previous residue. We would then compare the results with a qubit model based on turn ancilla encoding, running the simulated annealing on a conventional CPU. This would also be running the quantum annealing on Dwave’s QPU, where all three results would be compared together.
Simulated Annealing with Monte Carlo
The purpose of using simulated annealing with Monte Carlo is to approximate the global optimum of a given function within a large search space (with many variables).
We would have to define a schedule for the annealing temperature, randomly choose a residue then perform a random walk with respect to that residue. To implement this, we would
- Input the protein sequence. In this case I used 6 residues for a sequence of “YYDPET.”
- Create an annealing schedule, which outputs the temperature to use.
For the algorithm, the temperature decreases from an initial value to 0.
At each time step, the algorithm selects a solution close to the current one, or the neighboring state.
It measures the quality of the solution then moves it according to the probabilities of selecting better or worse solutions.
3. The probability is displayed on this graph representation:
The moves result in changes of the previous state and its parts in order to improve the solution. The probabilities lead the system to move to states of lower energy and this step is repeated.
4. The folding of this sequence based on the annealing schedule that we created is outputted:
As a result of the folding in this protein sequence, 67 our of 360 iterations were in the lowest energy conformation (so around 18.6%).
Conventional Turn Ancilla Encoding
With the next method of turn ancilla encoding, we first implement this on a regular CPU then do a second simulation on a QPU.
Turn ancilla encoding basically puts these ancillary qubits into a Hamiltonian in order to encode information about the amino acid reactions.
As displayed here is the construction of the binary representation of a particular six unit lattice protein in turn encoding for a self-avoiding walk.
There is a requirement of two qubits per bond and the turn or bond directions are shown as 00 which means downwards, 01 which means rightwards, 10 which means left, and 11 which means upwards.
In the case of lattice folding, the chain has to be self-avoiding meaning a penalty is given if two amino acids overlap. This penalty is given by the energy function overlap returning a really high value if these amino acids overlap.
There is another energy function, back, which gives a penalty in a special case of overlaps that happens if the chain goes directly backwards on itself. This occurs when the y-component of one edge is the same as the x-component as another edge.
There is a third energy function used which is the pair, which marks the interaction between non-bonded acids that are adjacent on a lattice. The new energy function is constructed and our task now is to minimize this objective function.
The method for implementing this is basically the same as the previous one, instead of performing random walks and accepting the steps to find the optimum, we randomly choose a qubit, perform the flip, and accept the flip.
The annealing schedule is outputted based on our given energy function that we are using.
With the pre-processing, our energy expressions turns out to be like this:
We also need a helper function in order to clean and refine out the resulting expressions. Rounding is necessary for the zero coefficients who don’t evaluate to 0.
The result with the folding of the 6 residue protein is then outputted:
As a result with the 6 residue protein, 3 out of 720 were in the lowest energy conformation so around 0.4% with this energy function.
Turn Ancilla Encoding with Dwave Simulation
With the third method, we implement turn ancilla encoding onto a quantum processing unit using an annealer from dwave.
- We would first have to define the solver and get its correspondent adjacent graph for embedding other sampler parameters as well as importing the various packages.
The solver is then defined in order to get a corresponding adjacency graph to embed other sampler parameters.
2. We would then have to fix the first 3 qubits in order to avoid any redundant conformations for multiple problems within the same graph.
3. The embedding is verified and we would have to find the energy of the optimal solution as well as corroborating the energy in the sample.
3. After the process of embedding, we would have to find the energy of the optimal solution which would then be scaled to be the target energy for the true minimum.
4. The folding of this sequence is outputted:
As a result, with a 6 residue protein sequence we would have 7 instances in a single embedding with 10000 reads to have a time of 3170 ms. There are 6 out of 70,000 (around 0.00009%) in the lowest energy conformation.
Implications
Overall, the use of quantum annealing in protein folding for implications in drug discovery is still relatively new, but as we’re able to gain more knowledge about these protein shapes and how they operate through these simulations and models, it can then lead to efficient drug discovery.
So far, especially with misfiled proteins, we’re unable to accurately model the folding patterns based on the amino acid chain sequence. We’re becoming more familiar with the structure of the protein, but there is still a lot of uncertainty with achieving the exact formation through folding.
We’re currently able to model the folding of proteins in their native state, but we’re still not able to find the folding pattern of the misfolded proteins. Once we’re able to achieve this, we’d be able to find all the different kinds of combinations in order to efficiently design more effective protein-based drugs to eventually cure a huge variety of different diseases.
The designing of these protein-based therapeutic drugs ranging from anemia, pulmonary treatment, strokes, will save tens of millions of lives. This only marks the beginning, but we can only imagine what the future implications are for this field in the upcoming years.
If you liked this article, add a clap and stay in tuned for more articles coming soon! Reach out to me at aliceliu2004@gmail.com or on LinkedIn