.. _tutorial_data: Data ===== Here we show how to use MLatom to deal with data. There are three major data types in MLatom: :ref:`molecules `, :ref:`databases ` of molecules, and MD :ref:`trajectories `, which are shown below (figure from `MLatom 3 paper `__ under CC-BY license): .. image:: _static/mlatom3paper/image3.png :width: 600 :align: center :alt: Overview of different data types in MLatom. Before running examples below in Python/Jupyter notebook, you need to add and run the line ``import mlatom as ml`` as usual. .. _tutorial_data_molecule: Molecules --------- Molecule is a central concept in MLatom as most of operations are usually done on :class:`molecule ` class objects. This objects store the information about the constituents :class:`atom ` objects, their coordinates, and any property we want to learn or calculate. Molecules can be loaded and dumped in various formats. Creating and loading molecule +++++++++++++++++++++++++++++ A molecule object can be created from scratch. A very explicit way (**not used in practice!**) relecting how the molecule is actually constructed is the following: .. code-block:: python mol = ml.molecule() atom1 = ml.data.atom(atomic_number=1, xyz_coordinates=[0.0, 0.0, 0.0]) mol.atoms.append(atom1) ... In practice, you can create the molecule using many useful functions. E.g., you can load it from SMILES string and the geometry will be generated and optimized with `Pybel's `_ ``make3D()`` method: .. code-block:: python mol = ml.molecule() mol.read_from_smiles_string('CCCCCCC') print(mol.xyz_coordinates) ... In above example, you can see how to access the XYZ coordinates stored in a molecule object. Very common way is to load the geometry from XYZ coordinates. There are several ways to do it. You can either provide XYZ coordinates as a string, read them from file or from numpy array: .. code-block:: python mol = ml.molecule() mol.read_from_xyz_string('''2 H 0 0 0 H 0 0 0.74 ''') # or mol.read_from_xyz_file('h2.xyz') # or import numpy as np xyz = np.array([[0, 0, 0], [0, 0, 0.7]]) mol.read_from_numpy(coordinates=xyz, species=np.array([1, 1])) ... A small nuance is that if you remove ``read_`` from above methods, you can directly get the molecule object, i.e., compare: .. code-block:: python mol = ml.molecule() mol.read_from_xyz_file('h2.xyz') # or mol = ml.molecule.from_xyz_file('h2.xyz') # also works: mol = ml.molecule.from_smiles_string('CCCCCCC') mol = ml.molecule.from_numpy(coordinates=xyz, species=np.array([1, 1])) Also, you can read XYZ files in different formats. The default one is: - first line: number of atoms - second line: blank or comments - following lines: element symbol or atomic number followed by x, y, z coordinates in Angstrom MLatom also supports reading XYZ files in Columbus, Newton-X, and Turbomole formats, just add ``format``, e.g., as ``mol = ml.molecule.from_xyz_file('h2.xyz', format='Newton-X')``. Dumping and loading +++++++++++++++++++ Ultimately, molecule is a data format. Hence, it is convenient to dump and load it as such, e.g., in json format: .. code-block:: python mol.dump(filename='h2.json') mol2.load(filename='h2.json') print(mol, mol2) You can also save the coordinates in text file: .. code-block:: python mol.write_file_with_xyz_coordinates('h2.xyz') or load the molecule from Gaussian output file and dump it later: .. code-block:: python mol = ml.molecule.load('gaussian_output_file.log', format='gaussian') mol.dump('parsed_molecule.json', format='json') Properties ++++++++++ You can access, modify, and add different properties. Some of the most common examples are charge and multiplicity: .. code-block:: python mol.charge = 1 mol.multiplicity = 2 We have already saw how to access xyz coordinates. In the same way, we can access other properties (if they are availble, of course...), e.g.: .. code-block:: python # examples of accessing properties xyz = mol.xyz_coordinates print(mol.energy) print(mol.energy_gradients) mol.hessian # modifying properties mol.energy = 3.00 mol.energy_gradients = np.array(...) A bit more complicated are properties corresponding to different electronic states (see a dedicated tutorial on :ref:`excited-state data format `). They are stored in ``molecule.electronic_states`` which is a list of :class:`molecule ` objects themselves, corresponding to each state (note that by default this list is empty -- it is only filled either after the excited-state calculations or by the user). This allows to access a property of any state the same way as accessing a property of a 'normal' molecule object. For example: .. code-block:: python # let's get S1 state S1 = mol.electronic_states[1] print(f'Energy of S1 state is {S1.energy}') print(f'Its forces are') print(-S1.energy_gradients) print(f'checking multiplicity: {S1.multiplicity}') Other useful methods ++++++++++++++++++++ Above overview is just to explain the key features when dealing with molcules. There is much more to it, which you can check out in the :class:`manual `: - getting xyz string for printing it - calculating and rescaling kinetic energy - getting internuclear distances (for a pair and all pairs of atoms) - copying molecules with or without indicated properties - getting smiles (``molecule.smiles``) - rotating, translating, scaling, and aligning - printing detailed info (with ``molecule.info()``) - getting numpy array with electronic state energies, gradients, energy gaps, and excitation energies - dealing with generic vectorial properties (e.g., dipole moments) - and some others more technical methods. .. _tutorial_db: Molecular databases ------------------- We often have to deal with a collection of molecular objects, which are stored in the :class:`molecular_database ` class instances. The molecules are stored in the list ``molecular_database.molecules``. Molecular databases are in many respects extension of the molecule class, e.g., they can be loaded and dumped in a similar manner: .. code-block:: python # prepare H2 geometries with bond lengths ranging from 0.5 to 5.0 Å xyz = np.zeros((451, 2, 3)) xyz[:, 1, 2] = np.arange(0.5, 5.01, 0.01) z = np.ones((451, 2)).astype(int) molDB = ml.molecular_database.from_numpy(coordinates=xyz, species=z) molDB.dump(filename='h2.json') molDB2 = ml.molecular_database.load(filename='h2.json') The useful feature of molecular databases is that they can be manipulated as lists/numpy arrays: .. code-block:: python print(len(molDB)) # 451 print(len(molDB[:100:4])) # 25 A related feature is that a database can be split into several other databases (useful, e.g., for splitting into training and test sets): .. code-block:: python # let's change the splitting to 9:1 instead of default 8:2 train, test = molDB.split(fraction_of_points_in_splits=[0.9, 0.1]) Molecular databases also provide an easy acccess to the properties of all molecules, e.g., if we want to get the list of some property and save them in a text file, we can use: .. code-block:: molDB.get_properties(property_name='energy') molDB.write_file_with_properties('h2.energies', property_name='energy') There are also options to dump gradients and other vectorial properties as well as other useful technical features, see the :class:`manual `. .. _tutorial_data_trajs: MD trajectories ------------------- MD trajectories are a common data type and are handled with the :class:`molecular_trajectory ` class in MLatom. They consist of :class:`molecular_trajectory_step ` objects which contain information about molecule and other information such as step number, time, energies, etc.: .. code-block:: python traj = ml.data.molecular_trajectory() ... print(f'Info for MD trajectory step {traj.steps[5].step}'') print(f'Molecule: {traj.steps[5].molecule}') print(f'Time: {traj.steps[5].time}') print(f'Kinetic energy: {traj.steps[5].molecule.kinetic_energy}') The trajectories can be dumped and loaded as different formats: json (easier to open and browse, but heavy and slow, *dumping overwrites the file*), h5md (faster, smaller, **dumping appends, not overwrites**), plain text (xyz coordinates, velocities, energies, temperature, etc., dumping overwrites): .. code-block:: python traj.dump(filename='mytraj.h5md', format='h5md') traj2.load(filename='mytraj.h5md', format='h5md') To unlock many of the features available in molecular databases, it is often convenient to convert the trajectory into :class:`molecular_database `: .. code-block:: python moldb = traj.to_database() Visualizing data objects ------------------------ You can use the common tools for visualizing XYZ files, etc. Additionally, MLatom provides several handy routines for quick checks of data objects in Jupyter notebooks (currently, only on the `XACS cloud `__): .. code-block:: python my_molecule.view() my_molecular_database.view() my_trajectory.view() and also for vibrations, see below. .. include:: tutorial_vibrview.inc