Data

Here we show how to use MLatom to deal with data. There are three major data types in MLatom: molecules, databases of molecules, and MD trajectories, which are shown below (figure from MLatom 3 paper under CC-BY license):

Overview of different data types in MLatom.

Before running examples below in Python/Jupyter notebook, you need to add and run the line import mlatom as ml as usual.

Molecules

Molecule is a central concept in MLatom as most of operations are usually done on molecule class objects. This objects store the information about the constituents atom objects, their coordinates, and any property we want to learn or calculate. Molecules can be loaded and dumped in various formats.

Creating and loading molecule

A molecule object can be created from scratch. A very explicit way (not used in practice!) relecting how the molecule is actually constructed is the following:

mol = ml.molecule()
atom1 = ml.data.atom(atomic_number=1, xyz_coordinates=[0.0, 0.0, 0.0])
mol.atoms.append(atom1)
...

In practice, you can create the molecule using many useful functions. E.g., you can load it from SMILES string and the geometry will be generated and optimized with Pybel’s make3D() method:

mol = ml.molecule()
mol.read_from_smiles_string('CCCCCCC')
print(mol.xyz_coordinates)
...

In above example, you can see how to access the XYZ coordinates stored in a molecule object.

Very common way is to load the geometry from XYZ coordinates. There are several ways to do it. You can either provide XYZ coordinates as a string, read them from file or from numpy array:

mol = ml.molecule()
mol.read_from_xyz_string('''2

H 0 0 0
H 0 0 0.74
''')
# or
mol.read_from_xyz_file('h2.xyz')
# or
import numpy as np
xyz = np.array([[0, 0, 0], [0, 0, 0.7]])
mol.read_from_numpy(coordinates=xyz, species=np.array([1, 1]))
...

A small nuance is that if you remove read_ from above methods, you can directly get the molecule object, i.e., compare:

mol = ml.molecule()
mol.read_from_xyz_file('h2.xyz')
# or
mol = ml.molecule.from_xyz_file('h2.xyz')

# also works:
mol = ml.molecule.from_smiles_string('CCCCCCC')
mol = ml.molecule.from_numpy(coordinates=xyz, species=np.array([1, 1]))

Also, you can read XYZ files in different formats. The default one is:

  • first line: number of atoms

  • second line: blank or comments

  • following lines: element symbol or atomic number followed by x, y, z coordinates in Angstrom

MLatom also supports reading XYZ files in Columbus, Newton-X, and Turbomole formats, just add format, e.g., as mol = ml.molecule.from_xyz_file('h2.xyz', format='Newton-X').

Dumping and loading

Ultimately, molecule is a data format. Hence, it is convenient to dump and load it as such, e.g., in json format:

mol.dump(filename='h2.json')
mol2.load(filename='h2.json')
print(mol, mol2)

You can also save the coordinates in text file:

mol.write_file_with_xyz_coordinates('h2.xyz')

or load the molecule from Gaussian output file and dump it later:

mol = ml.molecule.load('gaussian_output_file.log', format='gaussian')
mol.dump('parsed_molecule.json', format='json')

Properties

You can access, modify, and add different properties.

Some of the most common examples are charge and multiplicity:

mol.charge = 1
mol.multiplicity = 2

We have already saw how to access xyz coordinates. In the same way, we can access other properties (if they are availble, of course…), e.g.:

# examples of accessing properties
xyz = mol.xyz_coordinates
print(mol.energy)
print(mol.energy_gradients)
mol.hessian
# modifying properties
mol.energy = 3.00
mol.energy_gradients = np.array(...)

A bit more complicated are properties corresponding to different electronic states (see a dedicated tutorial on excited-state data format). They are stored in molecule.electronic_states which is a list of molecule objects themselves, corresponding to each state (note that by default this list is empty – it is only filled either after the excited-state calculations or by the user). This allows to access a property of any state the same way as accessing a property of a ‘normal’ molecule object. For example:

# let's get S1 state
S1 = mol.electronic_states[1]
print(f'Energy of S1 state is {S1.energy}')
print(f'Its forces are')
print(-S1.energy_gradients)
print(f'checking multiplicity: {S1.multiplicity}')

Other useful methods

Above overview is just to explain the key features when dealing with molcules. There is much more to it, which you can check out in the manual:

  • getting xyz string for printing it

  • calculating and rescaling kinetic energy

  • getting internuclear distances (for a pair and all pairs of atoms)

  • copying molecules with or without indicated properties

  • getting smiles (molecule.smiles)

  • rotating, translating, scaling, and aligning

  • printing detailed info (with molecule.info())

  • getting numpy array with electronic state energies, gradients, energy gaps, and excitation energies

  • dealing with generic vectorial properties (e.g., dipole moments)

  • and some others more technical methods.

Molecular databases

We often have to deal with a collection of molecular objects, which are stored in the molecular_database class instances. The molecules are stored in the list molecular_database.molecules.

Molecular databases are in many respects extension of the molecule class, e.g., they can be loaded and dumped in a similar manner:

# prepare H2 geometries with bond lengths ranging from 0.5 to 5.0 Å
xyz = np.zeros((451, 2, 3))
xyz[:, 1, 2] = np.arange(0.5, 5.01, 0.01)
z = np.ones((451, 2)).astype(int)
molDB = ml.molecular_database.from_numpy(coordinates=xyz, species=z)

molDB.dump(filename='h2.json')
molDB2 = ml.molecular_database.load(filename='h2.json')

The useful feature of molecular databases is that they can be manipulated as lists/numpy arrays:

print(len(molDB))           # 451
print(len(molDB[:100:4]))   # 25

A related feature is that a database can be split into several other databases (useful, e.g., for splitting into training and test sets):

# let's change the splitting to 9:1 instead of default 8:2
train, test = molDB.split(fraction_of_points_in_splits=[0.9, 0.1])

Molecular databases also provide an easy acccess to the properties of all molecules, e.g., if we want to get the list of some property and save them in a text file, we can use:

molDB.get_properties(property_name='energy')
molDB.write_file_with_properties('h2.energies', property_name='energy')

There are also options to dump gradients and other vectorial properties as well as other useful technical features, see the manual.

MD trajectories

MD trajectories are a common data type and are handled with the molecular_trajectory class in MLatom. They consist of molecular_trajectory_step objects which contain information about molecule and other information such as step number, time, energies, etc.:

traj = ml.data.molecular_trajectory()
...
print(f'Info for MD trajectory step {traj.steps[5].step}'')
print(f'Molecule: {traj.steps[5].molecule}')
print(f'Time: {traj.steps[5].time}')
print(f'Kinetic energy: {traj.steps[5].molecule.kinetic_energy}')

The trajectories can be dumped and loaded as different formats: json (easier to open and browse, but heavy and slow, dumping overwrites the file), h5md (faster, smaller, dumping appends, not overwrites), plain text (xyz coordinates, velocities, energies, temperature, etc., dumping overwrites):

traj.dump(filename='mytraj.h5md', format='h5md')
traj2.load(filename='mytraj.h5md', format='h5md')

To unlock many of the features available in molecular databases, it is often convenient to convert the trajectory into molecular_database:

moldb = traj.to_database()

Visualizing data objects

You can use the common tools for visualizing XYZ files, etc. Additionally, MLatom provides several handy routines for quick checks of data objects in Jupyter notebooks (currently, only on the XACS cloud):

my_molecule.view()
my_molecular_database.view()
my_trajectory.view()

and also for vibrations, see below.

Visualizing vibrations

There are several ways to visualize the vibrations:

  • Currently only on the XACS cloud computing:

    • At the end of the calculations via input file/command line using the options freq, ir, or raman, MLatom will dump the freq_gaussian{mol_index}.log file, which uses Gaussian-style output format for frequencies. You can open it in ChemCraft.

    • You can also dump text output file using Gaussian-style format for frequencies from the Python API: mymolecule.dump(filename='gauss.log', format='gaussian').

  • NEW: Now infrared spectrum and vibrations can be visualized at the same time. We offer a slider to choose which normal model to show and at the same time highlight the corresponding IR band in the spectrum. You can download the Jupyter notebook for an example of calc.json and exp.txt files:

  • You can directly view the vibrations in Jupyter notebook by using mymolecule.view(normal_mode=...), as shown below for an example molvibr.json file:

freq_view