.. _ml-pes:

ML for PES
===============================================================

.. raw:: html

   <iframe src="https://player.bilibili.com/player.html?isOutside=true&aid=1802775881&bvid=BV1yt421E7b5&cid=1500308558&p=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true" height=500 width=800></iframe>

Slides
------

:download:`Slides <_static/mlatom/3-XACSW2024_20240704_Dral_wm.pdf>`:

.. raw:: html
    
    <embed src="_static/mlatom/3-XACSW2024_20240704_Dral_wm.pdf" width="741" height="310" aligh="left" type="application/pdf">

Machine learning potentials
---------------------------

You have already used the :ref:`KREG models <ml-basics>`. The operating principle of these models is the same: for a given geometry, they predict energies. Hence, such models are called machine learning (interatomic) potentials (in literature abbreviated as both MLP or MLIP), as they represent potential energy surfaces (PES) of molecules (function of energy :math:`E` with respect to nuclear coordinates :math:`\mathbf{R}` in the Born--Oppenheimer approximation, see :ref:`literature on quantum chemistry <lecture1_further_reading>`):

.. math::

    E = f(\mathbf{R}).

As we have already seen, QM methods provide the first-principles way to calculate this energy, but they are slow and, hence, there is a huge insentive to use ML to learn from the data the dependence of energy on nuclear coordinates. ML functions are much faster to evaluate because they are relatively simple compared to the numerically intensive QM algorithms.

There is the whole zoo of MLPs with typical representatives shown below (the potentials in bold are supported by MLatom):

.. image:: _static/lecture4/mlp.png 
    :width: 800 
    :align: center
    :alt:   Zoo of MLP potentials

If this zoo seems intimidating -- it is. But do not worry, we will start with simple hands-on examples to get the feeling about different MLPs and explain along the way what we should pay attention to, when choosing MLPs.

Let's first get a taste of different flavors of MLPs by extending our :ref:`first example of ML calculations <lecture1_ml>`. There we trained the :ref:`KREG model <task1.4_kreg_train>` on the data for the H\ :sub:`2` molecule and used this MLP to :ref:`optimize its geometry <task1.3_h2_geomopt>`. As usual, for training MLPs we need the corresponding data -- energies and coordinates. In the following task, we will try another model of a completely different type and will see what happens:

.. include:: materials/lecture4/task4.1_ani_h2/task4.1_ani_h2.inc

After you run these simulations, it is useful to compare the results with other people around you. Or you can run them several times...

What you will notice is that:

- ANI is definetely much slower than the KREG training,
- ANI results are all over the place, while KREG always produces the same result.

To understand why, you need to :ref:`understand the underlying differences between the ML algorithms <lecture4_ml_algorithms>`.

.. _lecture4_ml_testing:

Testing ML models
---------------------------

Once you trained the model, you should always test how well it performs. Let's first see what can go wrong when you train a model and leave it unchecked:

.. include:: materials/lecture4/task4.2_kreg_overfit/task4.2_kreg_overfit.inc

As you see from above task, the modern ML models are so flexible that they can easily memorize the training data (the severa case of *overfitting*) while our goal is to make good predictions for points outside the training set. Hence, the standard practice is to evaluate the error of the ML models on an independent test set. Let's do it in the next task.

.. include:: materials/lecture4/task4.3_kreg_test/task4.3_kreg_test.inc

All these are just initial tests. **The ultimate test is the performance in the required application!**

.. _lecture4_ml_selection:

Choosing ML model settings to make it work
------------------------------------------

We saw that it is easy to overfit the model. However, we do not want to *underfit* it either, i.e., to obtain a bad model that does not even learn the known data. How well the model fits on the known and generalizes to the unknown data strongly depends on the settings of the model. These settings, known as *hyperparameters*, are not model parameters which are optimized during training but they influence the training outcome.

If we think about NN, then such hyperparameters can be the number of layers, number of neurons in each layer, the choice of activation function, the batch size and learning rate. In case of KRR, there might be hyperparameters entering the kernel function (e.g., the Gaussian width) and all of them have a regularization hyperparameter that imposes penalty on the magnitude of regression coefficients so that the model's variance around the known data is not too large -- the larger regularization usually leads to a more generalizable model but with a higher fitting error.

In general, there is usually a sweet spot in the choice of hyperparameters, known as *bias-variance tradeoff*. Practically, it is done by splitting the training set into the sub-training (often, confusingly called training set too) and the validation set. The hyperparameters are optimized so that the error on the validation set is the lowest. In case of the NNs, the model is often trained until the error on the validation set does not decrease for some number of epochs known as *patience*; when this point is reached the training is stopped and this approach is called *early stopping*.

Note that since the validation set is indirectly used to obtain a better model, it cannot be used to judge the generalization error which must be evaluated on the independent test set. Luckily, many of the packages, MLatom included, provide automatic routines for the choice of hyperparameters, see the next exercise.

.. include:: materials/lecture4/task4.4_kreg_hyperopt/task4.4_kreg_hyperopt.inc

As for choosing the best MLP, it is not an easy question as it depends on the task at hand. Practically speaking, NNs tend to be used more and among them options like ANI provide good balance between speed (training and evaluation) and accuracy but if you can afford equivariant networks such as MACE, you should try. MACE is definetely slower but there is a growing evidence that it becomes an algorithm of choice when accurate and robust results are needed.

.. _lecture4_ml_forces:

Importance of including forces in training MLPs
------------------------------------------------

Last but not least, when training MLPs, you must always utilize derivative information if available. The first-order derivative of energy with respect to nuclear coordinates is energy gradient or negative force and it is available in many but not all QM methods. When trained on both energies and forces, the models have much higher accuracy and even though the training is slower, it is worth doing. MLatom provides an easy way to train all of the supported MLPs on both energies and forces.

You can check how the results improve when you train the model with energy gradients in addition to energies.

.. include:: materials/lecture4/homework4.3/homework4.3.inc

This example reflects the well-recognized advice that you must include forces during training when they are available!

Splitting
----------

You have might been disappointed that the model trained in the :ref:`previous task <task4.4_kreg_hyperopt>` sometimes was still not good enough. One of the reasons is that the validation set is quite large in default settings in MLatom (20% of the training set). While it is a safe option for many applications, practice shows that 10% is sufficient (i.e., 9:1 splitting of the training set). Some people may argue that cross-validation is required and this is also supported in MLatom, but again, practice shows that it is overkill. Hence in the next homework you have a chance to get more robust model by playing with different splittings.

.. include:: materials/lecture4/homework4.1/homework4.1.inc

The results must become substantially more stable. To be honest, 9:1 splitting is sufficient and some people even go to extremes and use something like 50 out of 1k training points for validation. MLatom's default setting of 8:2 splitting is probably too conservative and might be changed in the future to 9:1 as that is what we now mostly using in practice. You need to check yourself what kind of splitting works best for your problem.

Learning curves
---------------

.. include:: materials/lecture4/homework4.2/homework4.2.inc

This example is showing you how to perform the type of analysis that we have done some time ago (in 2020-2021) which became one of the `highly cited articles <https://doi.org/10.1039/D1SC03564A>`__ in a prestigeous Chemical Science journal:

.. image:: _static/lecture5/ChemSci_certificate.jpeg
    :width: 800 
    :align: center
    :alt:   Highly cited paper in Chemical Science

Note, however, that the field is moving forward fast and conclusions we drew in 2021 do not reflect all the developments afterwards. Particularly, I would really recommend to have a look at the equivariant NNs like MACE! Luckily, with MLatom you can easily perform the same type of analysis for the new MLPs to draw your own conclusions.