.. _tutorial_data:

Data
=====

Here we show how to use MLatom to deal with data. There are three major data types in MLatom: :ref:`molecules <tutorial_data_molecule>`, :ref:`databases <tutorial_db>` of molecules, and MD :ref:`trajectories <tutorial_data_trajs>`, which are shown below (figure from `MLatom 3 paper <https://doi.org/10.1021/acs.jctc.3c01203>`__ under CC-BY license):

.. image:: _static/mlatom3paper/image3.png
    :width: 600
    :align: center
    :alt: Overview of different data types in MLatom.

Before running examples below in Python/Jupyter notebook, you need to add and run the line ``import mlatom as ml`` as usual. We also provide a short video introduction about data objects used in MLatom:

.. raw:: html

    <iframe src="https://player.bilibili.com/player.html?isOutside=true&bvid=BV189eEeEE9R&p=1&autoplay=0" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true" height=500 width=800></iframe>

.. _tutorial_data_molecule:

Molecules
---------

Molecule is a central concept in MLatom as most of operations are usually done on :class:`molecule <mlatom.data.molecule>` class objects. This objects store the information about the constituents :class:`atom <mlatom.data.atom>` objects, their coordinates, and any property we want to learn or calculate. Molecules can be loaded and dumped in various formats.

Creating and loading molecule
+++++++++++++++++++++++++++++

A molecule object can be created from scratch. A very explicit way (**not used in practice!**) relecting how the molecule is actually constructed is the following:

.. code-block:: python

    mol = ml.molecule()
    atom1 = ml.data.atom(atomic_number=1, xyz_coordinates=[0.0, 0.0, 0.0])
    mol.atoms.append(atom1)
    ...

In practice, you can create the molecule using many useful functions. E.g., you can load it from SMILES string and the geometry will be generated and optimized with `Pybel's <https://open-babel.readthedocs.io/en/latest/UseTheLibrary/Python_Pybel.html>`_ ``make3D()`` method:

.. code-block:: python

    mol = ml.molecule()
    mol.read_from_smiles_string('CCCCCCC')
    print(mol.xyz_coordinates)
    ...

In above example, you can see how to access the XYZ coordinates stored in a molecule object.

Very common way is to load the geometry from XYZ coordinates. There are several ways to do it.
You can either provide XYZ coordinates as a string, read them from file or from numpy array:

.. code-block:: python

    mol = ml.molecule()
    mol.read_from_xyz_string('''2

    H 0 0 0
    H 0 0 0.74
    ''')
    # or
    mol.read_from_xyz_file('h2.xyz')
    # or
    import numpy as np
    xyz = np.array([[0, 0, 0], [0, 0, 0.7]])
    mol.read_from_numpy(coordinates=xyz, species=np.array([1, 1]))
    ...

A small nuance is that if you remove ``read_`` from above methods, you can directly get the molecule object, i.e., compare:

.. code-block:: python

    mol = ml.molecule()
    mol.read_from_xyz_file('h2.xyz')
    # or
    mol = ml.molecule.from_xyz_file('h2.xyz')

    # also works:
    mol = ml.molecule.from_smiles_string('CCCCCCC')
    mol = ml.molecule.from_numpy(coordinates=xyz, species=np.array([1, 1]))

Also, you can read XYZ files in different formats. The default one is:

- first line: number of atoms
- second line: blank or comments
- following lines: element symbol or atomic number followed by x, y, z coordinates in Angstrom

MLatom also supports reading XYZ files in Columbus, Newton-X, and Turbomole formats, just add ``format``, e.g., as ``mol = ml.molecule.from_xyz_file('h2.xyz', format='Newton-X')``.

Dumping and loading
+++++++++++++++++++

Ultimately, molecule is a data format. Hence, it is convenient to dump and load it as such, e.g., in json format:

.. code-block:: python

    mol.dump(filename='h2.json')
    mol2.load(filename='h2.json')
    print(mol, mol2)

You can also save the coordinates in text file:

.. code-block:: python

    mol.write_file_with_xyz_coordinates('h2.xyz')

or load the molecule from Gaussian output file and dump it later:

.. code-block:: python

    mol = ml.molecule.load('gaussian_output_file.log', format='gaussian')
    mol.dump('parsed_molecule.json', format='json')

Properties
++++++++++

You can access, modify, and add different properties.

Some of the most common examples are charge and multiplicity:

.. code-block:: python

    mol.charge = 1
    mol.multiplicity = 2

We have already saw how to access xyz coordinates. In the same way, we can access other properties (if they are availble, of course...), e.g.:

.. code-block:: python

    # examples of accessing properties
    xyz = mol.xyz_coordinates
    print(mol.energy)
    print(mol.energy_gradients)
    mol.hessian
    # modifying properties
    mol.energy = 3.00
    mol.energy_gradients = np.array(...)

A bit more complicated are properties corresponding to different electronic states (see a dedicated tutorial on :ref:`excited-state data format <tutorial_es_data>`). They are stored in ``molecule.electronic_states`` which is a list of :class:`molecule <mlatom.data.molecule>` objects themselves, corresponding to each state (note that by default this list is empty -- it is only filled either after the excited-state calculations or by the user). This allows to access a property of any state the same way as accessing a property of a 'normal' molecule object. For example:

.. code-block:: python

    # let's get S1 state
    S1 = mol.electronic_states[1]
    print(f'Energy of S1 state is {S1.energy}')
    print(f'Its forces are')
    print(-S1.energy_gradients)
    print(f'checking multiplicity: {S1.multiplicity}')

Other useful methods
++++++++++++++++++++

Above overview is just to explain the key features when dealing with molcules. There is much more to it, which you can check out in the :class:`manual <mlatom.data.molecule>`:

- getting xyz string for printing it
- calculating and rescaling kinetic energy
- getting internuclear distances (for a pair and all pairs of atoms)
- copying molecules with or without indicated properties
- getting smiles (``molecule.smiles``)
- rotating, translating, scaling, and aligning
- printing detailed info (with ``molecule.info()``)
- getting numpy array with electronic state energies, gradients, energy gaps, and excitation energies
- dealing with generic vectorial properties (e.g., dipole moments)
- and some others more technical methods.

.. _tutorial_db:

Molecular databases
-------------------

We often have to deal with a collection of molecular objects, which are stored in the
:class:`molecular_database <mlatom.data.molecular_database>` class instances. The molecules are stored in the list ``molecular_database.molecules``. 

Molecular databases are in many respects extension of the molecule class, e.g., they can be loaded and dumped in a similar manner:

.. code-block:: python

    # prepare H2 geometries with bond lengths ranging from 0.5 to 5.0 Å
    xyz = np.zeros((451, 2, 3))
    xyz[:, 1, 2] = np.arange(0.5, 5.01, 0.01)
    z = np.ones((451, 2)).astype(int)
    molDB = ml.molecular_database.from_numpy(coordinates=xyz, species=z)

    molDB.dump(filename='h2.json')
    molDB2 = ml.molecular_database.load(filename='h2.json')

The useful feature of molecular databases is that they can be manipulated as lists/numpy arrays:

.. code-block:: python

    print(len(molDB))           # 451
    print(len(molDB[:100:4]))   # 25

A related feature is that a database can be split into several other databases (useful, e.g., for splitting into training and test sets):

.. code-block:: python

    # let's change the splitting to 9:1 instead of default 8:2
    train, test = molDB.split(fraction_of_points_in_splits=[0.9, 0.1])

Molecular databases also provide an easy acccess to the properties of all molecules, e.g., if we want to get the list of some property and save them in a text file, we can use:

.. code-block:: 

    molDB.get_properties(property_name='energy')
    molDB.write_file_with_properties('h2.energies', property_name='energy')

There are also options to dump gradients and other vectorial properties as well as other useful technical features, see the :class:`manual <mlatom.data.molecular_database>`.

.. _tutorial_data_trajs:

MD trajectories
-------------------

MD trajectories are a common data type and are handled with the :class:`molecular_trajectory <mlatom.data.molecular_trajectory>` class in MLatom. They consist of :class:`molecular_trajectory_step <mlatom.data.molecular_trajectory_step>` objects which contain information about molecule and other information such as step number, time, energies, etc.:

.. code-block:: python

    traj = ml.data.molecular_trajectory()
    ...
    print(f'Info for MD trajectory step {traj.steps[5].step}'')
    print(f'Molecule: {traj.steps[5].molecule}')
    print(f'Time: {traj.steps[5].time}')
    print(f'Kinetic energy: {traj.steps[5].molecule.kinetic_energy}')

The trajectories can be dumped and loaded as different formats: json (easier to open and browse, but heavy and slow, *dumping overwrites the file*), h5md (faster, smaller, **dumping appends, not overwrites**), plain text (xyz coordinates, velocities, energies, temperature, etc., dumping overwrites):

.. code-block:: python

    traj.dump(filename='mytraj.h5md', format='h5md')
    traj2.load(filename='mytraj.h5md', format='h5md')

To unlock many of the features available in molecular databases, it is often convenient to convert the trajectory into :class:`molecular_database <mlatom.data.molecular_database>`:

.. code-block:: python

    moldb = traj.to_database()

Visualizing data objects
------------------------

You can use the common tools for visualizing XYZ files, etc.
Additionally, MLatom provides several handy routines for quick checks of data objects in Jupyter notebooks (currently, only on the `XACS cloud <https://XACScloud.com>`__):

.. code-block:: python

    my_molecule.view()
    my_molecular_database.view()
    my_trajectory.view()

and also for vibrations, see below.

.. include:: tutorial_vibrview.inc