Data

Converting XYZ coordinates to molecular descriptor

Arguments

XYZ2X
XYZfile=[input file S with XYZ coordinates]
XfileOut=[output file S with X values]

Molecular descriptor arguments

If the user only provides XYZ file with XYZfile argument, XYZ coordinates need to be first converted into the molecular descriptor.

molDescriptor=[molecular descriptor]
- RE [default] (relative-to-equilibrium)
  
  CM (Coulomb matrix)
  
  ID (inverse internuclear distances)
- RE descriptor is well-suited for accurate descriptioin of single-molecule PES.
  
  CM is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.
  
  ID is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor.
molDescrType=[type of molecular descriptor]
- unsorted [default for RE]
  
  sorted [default for CM]
  
  permuted (optional, can be used for both RE and CM)
- unsorted descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.
  
  sorted descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options: XYZsortedFileOut, permInvGroups, permInvNuclei. See also related tutorial.
  
  permuted augments the descriptor with the permutations of user-defined atoms. Related arguments: permInvKernel, permInvGroups, permInvNuclei. See also related tutorial.
XYZsortedFileOut=[output file with with sorted XYZ coordinates]
- optional.
  
  Only works with molDescriptor=RE molDescrType=sorted.
- saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out
permInvNuclei=[permutationally invariant nuclei]
- optional.
  
  Should be used with molDescrType=permuted (and often with permInvKernel)
- E.g. permInvNuclei=2-3.5-6 will permute atoms 2,3 and 6,7. See also related tutorial.
permInvGroups=[permutationally invariant groups]
- optional.
  
  Should be used with molDescrType=permuted (and often with permInvKernel)
- E.g. for water dimer permInvGroups=1,2,3-4,5,6 generates permuted atom indices by flipping the monomers in a dimer.

Example

MLatom.py XYZ2X XYZfile=CH3Cl.xyz XfileOut=CH3Cl.x

Analyzing data sets

MLatom can analyze data sets by comparing them, e.g., mostly by calculating errors of ML-predicted values with respect to available reference values. All files are input files and MLatom output is a statistical analysis.

Arguments

analyze
Yfile=[input file with values]
YgradXYZfile=[input file with gradients in XYZ coordinates]
YestFile=[input file with estimated Y values]
YgradXYZestFile=[input file with estimated XYZ gradients]

Example

MLatom.py analyze Yfile=en.dat YestFile=enest.dat

Sampling and splitting

Arguments for sampling and splitting

sample
- it requires at least one of iTrainOut, CVtest, LOOtest, CVopt, LOOopt
- see also tutorial.
XYZfile=[file with XYZ coordinates] or XfileIn=[file with input vectors X]
- required.
iTrainOut=[file with indices of training points]
- no default file names.
- generates indices for the training set.
iTestOut=[file with indices of test points]
- no default file names.
- generates indices for the test set.
iSubtrainOut=[file with indices of sub-training points]
- no default file names.
- generates indices for the sub-training set.
iValidateOut=[file with indices of validation points]
- no default file names.
- generates indices for the validation set.
CVtest
- optional.
  
  Related option NcvOptFolds.
- generates indices for splits in N-fold cross-validation. By default, 5-fold cross-validation is used.
NcvTestFolds=[number of CV folds]
- 5 [default].
  
  Can be used only with CVopt.
- if this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.
LOOtest
- optional.
- leave-one-out cross-validation. Only random or no sampling can be used.
iCVtestPrefOut=[prefix of files with indices for CVtest]
- no default prefixes.
- file names will include the required prefix.
CVopt
- optional.
  
  Related option NcvOptFolds.
- generates indices for N-fold cross-validation for hyperparameters optimization. By default, 5-fold cross-validation is used.
NcvOptFolds=[number of CV folds]
- 5 [default].
  
  Can be used only with CVopt.
- If this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.
LOOopt
- optional.
- Leave-one-out cross-validation. Only random or no sampling can be used.
iCVoptPrefOut=[prefix of files with indices for CVopt]
- no default prefixes.
- file names will include the required prefix.

Additional optional arguments for sampling

Arguments used with sample argument.

sampling=[type of data set sampling into splits]
- random [default] random sampling
- none simply split unshuffled data set into the training and test sets (in this order) (and sub-training and validation sets)
- structure-based structure-based sampling
- farthest-point farthest-point traversal iterative procedure
Nuse=[N first entries of the data set file to be used]

100% [default]

optional.
Ntrain=[number of the sub-training points or a fraction of the training points]

80% of the total set by default.

If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set.
Ntest=[number of the validation points or a fraction of the training points]

By default, the remaining points of the total set after subtracting the training points.

If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set.
Nsubtrain=[number of the sub-training points or a fraction of the training points]

80% of the training set by default.

If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set.
Nvalidate=[number of the validation points or a fraction of the training points]

By default, the remaining points of the training set after subtracting the sub-training points.

If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set.

Example

Structure-based sampling:

mlatom sample sampling=structure-based XYZfile=CH3Cl.xyz Ntrain=1000 Ntest=10000 iTrainOut=itrain.dat iTestOut=itest.dat

Slicing

Sometimes it is useful to slice data by the Euclidean distance of their descriptors to the equilibrium descriptor. See tutorial.

Arguments for slicing:

`slice`	required.
`XfileIn=[file with input vectors X]`	required.
`eqXfileIn=[file with input vector for equilibrium geometry]`	required.
`Nslices=[number of slices]`	`3` [default]. optional.

Arguments for sampling from slices:

`sampleFromSlices`
`Ntrain=[total integer number N of training points from all slices]`	required.
`Nslices=[number of slices]`	`3` [default]. optional.

Arguments for merging indices from slices:

`mergeSlices`
`Ntrain=[total integer number N of training points from all slices]`	required.
`Nslices=[number of slices]`	`3` [default]. optional.

Examples

See tutorial.

MLatom.py slice Nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x

mlatom sampleFromSlices Nslices=3 sampling=structure-based Ntrain=4480

mlatom mergeSlices Nslices=3 Ntrain=4480