.. _tutorial_user-defined-models:

Building models
=================================

Here we show how to create user-defined models with different functions (prediction and training) and of different complexity. See also :ref:`manual <api_models>`.

Prediction
----------

A typical model is a Python class which has a ``predict`` method which accepts :class:`molecule <mlatom.data.molecule>` and/or :class:`molecular_database <mlatom.data.molecular_database>` and has keyword arguments ``calculate_energy`` and, often, ``calculate_energy_gradients``. Such a model then needs to add to a molecule or all molecules in the database the required properties (energies and gradients) and can be used for different types of simulations such as :ref:`single-point calculations <tutorial_sp>`, :ref:`geometry optimization <tutorial-geomopt>`, and :ref:`molecular dynamics <tutorial_md>`. Here is a Python code snippet to give a better idea:

.. code-block:: python

    class mymodel():
        def __init__(self):
            pass

        def predict(self,
            molecule=None,
            molecular_database=None,
            calculate_energy=True, 
            calculate_energy_gradients=True
            ):
            if not molecule is None:
                molecule.energy = ... # do required calculations for energy, in Hartree!
                molecule.energy_gradients = ... # in Hartree/Angstrom!
                # Note that energy_gradients = -forces!
            if not molecular_database is None:
                if calculate_energy:
                    for mol in molecular_database:
                        mol.energy = ...

    model = mymodel()
    model.predict(molecule=mymol)
    print(mymol.energy)

.. note:: 

    The model has to use Hartree for energies and Hartree/Angstrom for gradients to be used in many of the simulations, as these units are default in MLatom.

Model trees
+++++++++++

.. note:: 
    
    Text and figures of this section are adapted from the paper on MLatom 3 in *J. Chem. Theory Comput.* **2024**, DOI: `10.1021/acs.jctc.3c01203 <https://doi.org/10.1021/acs.jctc.3c01203>`_ (published under the CC-BY 4.0 license).

Often, it is beneficial to combine several models (not necessarily ML!). One example of such composite models is based on Δ-learning where the low-level QM method is used as a baseline, which is corrected by an ML model to approach the accuracy of the target higher-level QM method. Another example is ensemble learning where multiple ML models are created, and their predictions are averaged during the simulations to obtain more robust results and use in the query-by-committee strategy of :ref:`active learning <tutorial_al>`. Both of these concepts can also be combined in more complex workflows as exemplified by the :ref:`AIQM1 <tutorial_aiqm1>` method, which uses the NN ensemble as a correcting Δ-learning model and the semiempirical QM method as the baseline. To easily implement these workflows, MLatom allows the construction of the composite models as model trees consisting of :class:`model_tree_node <mlatom.models.model_tree_node>`, see an example for AIQM1:

.. image:: _static/mlatom3paper/image5.png
    :width: 600
    :align: center
    :alt: Composite models can be constructed as a model tree in MLatom. Here, an example is shown for the AIQM1 method where the root parent node comprises 3 children, the semiempirical QM method ODM2*, the NN ensemble, and additional D4 dispersion correction. The NN ensemble in turn is a parent of 8 ANI-type NN children. Predictions of parents are obtained by applying an operation “average” or “sum” to children's predictions. The code snippets are shown, too.

AIQM1's root parent node comprises three children, the semiempirical QM method ODM2*, the NN ensemble, and additional D4 dispersion correction. The NN ensemble in turn is a parent of eight ANI-type NN children. Predictions of parents are obtained by applying an operation “average” or “sum” to children's predictions. The code snippets are shown, too.

Other examples of possible composite models are hierarchical ML, which combines several (correcting) ML models trained on (differences between) QM levels, and self-correction, when each next ML model corrects the prediction by the previous model.

.. include:: tutorial_geomopt_delta-learn.inc

Training
-----------

If you want to write your own ML model architecture and train it on data, you can add ``train`` method to your model class. To utilize MLatom's features, you might want to make use of MLatom's :ref:`superclasses <udm_super>` which would help to optimize hyperparameters, etc. You would also then have to follow the minimum set of requirements to such a model. First of all, it should accept as keyworded arguments :class:`molecular_database <mlatom.data.molecular_database>` (the training set) and ``property_to_learn=[some string]``. Often, models require the validation set, this would need to be provided via ``validation_molecular_database`` keyword. ``property_to_learn`` can be, e.g., ``'energy'`` or ``'y'``. Similar, in :ref:`ML potentials <tutorial_mlp>`, we want to learn forces if they are available: ``xyz_derivative_property_to_learn='energy_gradients'`` would be needed.

Also, after training the model, you probably want to save it somewhere to reuse later. For this, you can specify the ``model_file`` while initializing the model class.

A code snippet putting together above considerations:

.. code-block:: python

    import mlatom as ml

    class mymodel(ml.models.ml_model):
        def __init__(self, model_file=None):
            self.model_file = model_file

        def train(self,
                molecular_database=None,
                validation_molecular_database=None,
                property_to_learn=None,
                xyz_derivative_property_to_learn=None):
            
            # do the training on molecular database and requested properties
            ...
            # dump the model to disk as specified in self.model_file

Models usually have hyperparameters, they can be added to the model using the special class :class:`hyperparameters <mlatom.models.hyperparameters>` which consists of class instances of :class:`hyperparameters <mlatom.models.hyperparameter>`.

For example:

.. code-block:: python

    import mlatom as ml

    class mymodel(ml.models.ml_model):
        def __init__(self, model_file=None):
            self.hyperparameters = ml.models.hyperparameters({'myhyperparam': ml.models.hyperparameter(value=2**-35, 
                                                         minval=2**-35, 
                                                         maxval=1.0, 
                                                         optimization_space='log',
                                                         name='myhyperparam'),
                                ...}) 

At the end, your model training may look like:

.. code-block:: python

    ...
    newmodel = mymodel(model_file='newmodel.npz')
    newmodel.train(molecular_database=trainDB,
                   validation_molecular_database=validateDB,
                   property_to_learn='energy',
                   xyz_derivative_property_to_learn='energy_gradients'
                   )

See examples in the :ref:`tutorial for ML potentials <tutorial_mlp>` how to do, e.g., hyperparameter optimization.

.. _udm_super:

Superclasses
------------

In practice, it might be useful to make the model do other operations (e.g., setting the threads, optimizing hyperparameters etc.). Many handy functions like this can be inherited from several common superclasses, with the two most important onces:

- :class:`mlatom.models.model <mlatom.models.model>` -- helps to configure multiprocessing, special handling of predictions during geometry optimization (logging the optimization trajectory), etc. This class is strongly recommended to be set as a superclass for your model, e.g., to enable all features in geometry optimizations.
- :class:`mlatom.models.ml_model <mlatom.models.ml_model>` -- use it if your model is an ML model which also needs to be trained. This superclass will impart your model with such useful features as hyperparameter optimization, calculation of the validation loss, etc. Note that its superclass is :class:`mlatom.models.model <mlatom.models.model>`, so you do not need to set two of them.

Hence, in practice, you might want to use one of the above superclasses when defining your model:

.. code-block:: python

    import mlatom as ml

    # for any model
    class mymodel(ml.models.model):
        ...
    # or for ML models that need to be trained
    class mymodel(ml.models.ml_model):
        ...