10. Delta-learning

10.1. Slides

Slides:

10.2. Delta-learning operating principle

One of the most robust and simplest approaches of combining ML and QM, which does not require tempering with the QM method itself, is delta-learning. In delta-learning, instead of training the model directly on the target QM level (e.g., FCI), we only learn the difference (delta) between the target and baseline QM levels. The baseline QM level is a computationally faster but less accurate method. The rational is that learning differences is easier and that if some PES regions are insufficiently sampled, the predictions fall back to the baseline method which serves as a ‘fail-safe’.

The operational principle of delta-learning can be seen best in the exercise:

Example ML-delta.

Train the KREG model on the differences between the FCI and HF/STO-3G energies interatomic distances larger or equal 0.8 Angstrom of H₂. These differences are residual error of HF/STO-3G wrt FCI.

Estimate the residual errors with this KREG delta-learning model for all internuclear distances from 0.5 to 5 Angstrom. Add them back to HF/STO-3G energies to obtain an estimate of FCI energies with such a delta-learning model.

See the previous tasks how to train and predict.

Plot the potential energy curve:

How does this delta-learning model perform?
Compared to the KREG model you trained previously directly on the FCI data for the same training set (e.g., in the previous task)?

Required data files (full potential energy curve from 0.5 to 5 Angstrom - you need to cut them as needed):

E_FCI_451.dat - full CI/aug-cc-pV6Z energies.
H2_HF.en - HF/STO-3G energies.
h2.xyz - geometries.
R_451.dat - interatomic distances for your convenience (they can be extracted from h2.xyz file).

As you can see, although delta-learning is not a silver bullet and does not resolve all the problems, it does help to mitigate problems seen in the direct learning to a good extend. At least it often is still able to capture the most important features of PES at more or less the right location where direct learning fails. Some of the failures can be completely unphysical, like missing the minimum on PES or having too low dissociation energy leading to molecular explosion during MD.

You have alrady used AIQM1 that is a delta-learning model which uses ANI neural network for correcting the semi-empirical QM method baseline. You also saw that although both AIQM1 and ANI-1ccx were trained on the same data (and use similar neural networks!), AIQM1 is usually more accurate and robust while ANI-1ccx sometimes lead to unphysical failures (remember H₂ and its MD!).

What is disadvantages of delta-learning you can think about?

10.3. Getting rid of a QM baseline: hierarchical ML

One of the disadvantages of delta-learning is that you have to calculate the properties wit the baseline QM method when you do predictions. Baseline QM is faster than the target QM method but typically much slower than pure ML model trained directly on the target QM data. There are several ways that allow you to speed up the simulations by getting rid of the slow QM baseline. The simplest one is to train another ML model directly on the baseline QM data and use this ML baseline model instead. It only works if you train this model on more baseline data than you have target data, e.g., you can train ML model on entire H₂ data set with 451 points with HF/STO-3G energies along the potential energy curve from 0.5 to 5 Angstrom while only use a fraction of this points with energies at the FCI level. This approach can be generalized to any number of the QM levels and is known as hierarchical ML.

A similar and more popular approach is transfer learning. Here you first train an ML model on more data obtained at the baseline QM method and then fine-tune (i.e., train only some/all parameters) on the fewer data at the target QM method. This can be also done, e.g., by training on the experimental data as a target. It is a powerful and easy to use approach and was also used in both AIQM1 and ANI-1ccx. In ANI-1ccx, the model was first trained on the 4.5 mln DFT data and then fine-tuned on 0.5 mln coupled-cluster data.

The problem with all these approaches without baseline QM method, is that there is no guarantee that the baseline ML model will always work as well as the baseline QM method.