As part of the Theory and Modelling in the Chemical Science MSc at Oxford, there is a module focused on basic software development. It consists of 2 days of tutorials followed by 2 days of hands-on programming – the hackathon!
The purpose of this module is to expose students who may have only seen basic Fortran to more modern programming tools. The first 2 days introduce a variety of topics such as how to use an IDE, basic features of object-oriented programming, using version control, building tests for your code, etc … (you can find the full schedule here).
This year the hackathon was focused on machine learning, as the Glowacki research group is increasingly interested in this topic. The three of us that participated to the hackathon designed the following projects:
- Using deep learning for avatar aesthetics in virtual reality – Mike O’Connor
- Machine learning of transfer operators for biomolecules – Rob Arbon
- Fitting potential energy surfaces with neural networks
I lead the last project, which is described below. You can click on the links to read more about the first 2 projects!
Fitting potential energy surfaces with neural networks
One of the most used techniques for simulating the behaviour of molecules is Molecular Dynamics. The time evolution of the atoms is calculated classically by propagating Newton’s equations of motion. The accuracy of the simulation depends on how the forces acting on the atoms are estimated. Generally, they are obtained through the gradients of a potential energy surface with parameters fitted to experimental data, but this approach doesn’t take into account the relationship between the electronic properties of the atoms and their dynamics!
This is why “ab-initio Molecular dynamics” was introduced, where the forces are estimated through quantum mechanical calculations. However, the better accuracy comes at a significant computational cost. Wouldn’t it be good if one could combine the speed of parametrised potential energy surfaces with the quantum accuracy? This is where neural networks come into play.
During a simulation, similar molecular configurations are often re-visited, but a quantum mechanical calculation is done from scratch at every time step. The idea is to collect all the important configurations and their respective energies in a dataset.
An artificial neural network is then used to interpolate between all the data points. In this way, a fit of the potential energy surface is obtained. This can then be used in simulations to obtain the forces – at the speed of a parametrised surface, but with the accuracy of the quantum calculations!
When using neural networks to fit some data, the most tedious part is definitely obtaining the data. In our case, we managed to make it ‘fun’. Using a virtual reality framework that enables to interact directly with molecular simulations, we had kind volunteers in the Centre for Computational Chemistry at Bristol make a methane molecule react with a cyanide radical (in virtual reality!). This is basically enhanced sampling guided by human intuition. The video below shows some of the configurations that were generated. See how often the hydrogen is exchanged between the methane and the cyanide radical!
This generated in total about 50000 configurations of the methane and cyanide radical system. Great! But then, the energies of these configurations had to be calculated at a quantum mechanical level of theory. Not so great.
The 5 % of the data with highest energy was removed from the data set. About 47000 configurations were kept and their energies calculated with one of the less computationally expensive quantum methods (DFT with PBE functional and STO-3G basis set). These energies are shown in the figures below as a function of Carbon-Carbon distance. It is interesting how a cheap quantum calculation removes the spurious minimum at ~2.7 Å!
Then, for about 7000 configurations the energies were also calculated with a more expensive quantum calculation (DFT with B3LYP functional and aug-cc-pVTZ basis set). The configurations for which both a PBE and B3LYP value of the energy were available were collected together in a data set, that can be found here.
Dividing the tasks among the students
In order to write all the bits of code in only two days, the following tasks were split among the students:
- Write a script that imports the raw data into the program.
- Turn the raw data into a format that can be processed by the neural network, since Cartesian coordinates are not suitable for this purpose. The formats chosen were 3 different flavours of the Coulomb matrix, which are described in more detail here.
- Construct a pipeline in Scikit learn to automate the steps required to train the neural network. This procedure was followed to set up the pipeline.
Scikit learn – some technical details…
Since this is a regression problem, the Scikit module used was the MLPRegressor (Multi Layer Perceptron Regressor). The data was preprocessed with the StandardScaler function and then it was randomly split in a training set containing 90% of the samples and a test set with the remaining ones. Through the GridSearch module, different values for the hyper parameters (learning rate, regularisation parameter, number of units in the single hidden layer) were tried and tested through k-fold cross-validation. Table 1 shows the values of the R2 calculated for the test set for the three different types of inputs used. Because of the small size of the data set, it was decided to fit the difference between the more/less accurate values of the energies. This is because the potential energy surface can have regions with large gradients. In these regions a higher density of training points is necessary for fitting. The difference of the more/less accurate values of the energies varies more slowly, so should be fitted more easily1.
The best results are obtained with the randomly sorted Coulomb matrix. As said in this paper, using a set of random Coulomb matrices for each molecule helps to overcome the high-dimensionality of the input space. This leads to great performance improvements.
As can be seen from the graphs below, for the randomly sorted Coulomb matrix the worst predictions have an error of about 0.3 Hartree, corresponding to 787.5 kJ/mol. This is still a very large error. It will be interesting to see whether training on a larger data set will improve the predictions. In addition, the GridSearch for optimising the hyper parameters is very slow, just a few combinations of values were tried. In the future, TensorFlow (software library developed by Google for numerical computation) will be tested for this system. TensorFlow code can be run on GPUs, which will considerably accelerate the training process!