Data
Converting XYZ coordinates to molecular descriptor
Arguments
XYZ2X
XYZfile=[input file S with XYZ coordinates]
XfileOut=[output file S with X values]
Molecular descriptor arguments
If the user only provides XYZ file with XYZfile
argument, XYZ coordinates need to be first converted into the molecular descriptor.
molDescriptor=[molecular descriptor]
RE
[default] (relative-to-equilibrium)CM
(Coulomb matrix)ID
(inverse internuclear distances)RE
descriptor is well-suited for accurate descriptioin of single-molecule PES.CM
is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.ID
is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor.
molDescrType=[type of molecular descriptor]
unsorted
[default for RE]sorted
[default for CM]permuted
(optional, can be used for both RE and CM)unsorted
descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.sorted
descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options:XYZsortedFileOut
,permInvGroups
,permInvNuclei
. See also related tutorial.permuted
augments the descriptor with the permutations of user-defined atoms. Related arguments:permInvKernel
,permInvGroups
,permInvNuclei
. See also related tutorial.
XYZsortedFileOut=[output file with with sorted XYZ coordinates]
optional.
Only works with
molDescriptor=RE molDescrType=sorted
.saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out
permInvNuclei=[permutationally invariant nuclei]
optional.
Should be used with
molDescrType=permuted
(and often withpermInvKernel
)E.g.
permInvNuclei=2-3.5-6
will permute atoms 2,3 and 6,7. See also related tutorial.
permInvGroups=[permutationally invariant groups]
optional.
Should be used with
molDescrType=permuted
(and often withpermInvKernel
)E.g. for water dimer
permInvGroups=1,2,3-4,5,6
generates permuted atom indices by flipping the monomers in a dimer.
Example
MLatom.py XYZ2X XYZfile=CH3Cl.xyz XfileOut=CH3Cl.x
Analyzing data sets
MLatom can analyze data sets by comparing them, e.g., mostly by calculating errors of ML-predicted values with respect to available reference values. All files are input files and MLatom output is a statistical analysis.
Arguments
analyze
Yfile=[input file with values]
YgradXYZfile=[input file with gradients in XYZ coordinates]
YestFile=[input file with estimated Y values]
YgradXYZestFile=[input file with estimated XYZ gradients]
Example
MLatom.py analyze Yfile=en.dat YestFile=enest.dat
Sampling and splitting
Arguments for sampling and splitting
sample
it requires at least one of
iTrainOut
,CVtest
,LOOtest
,CVopt
,LOOopt
see also tutorial.
XYZfile=[file with XYZ coordinates]
orXfileIn=[file with input vectors X]
required.
iTrainOut=[file with indices of training points]
no default file names.
generates indices for the training set.
iTestOut=[file with indices of test points]
no default file names.
generates indices for the test set.
iSubtrainOut=[file with indices of sub-training points]
no default file names.
generates indices for the sub-training set.
iValidateOut=[file with indices of validation points]
no default file names.
generates indices for the validation set.
CVtest
optional.
Related option
NcvOptFolds
.generates indices for splits in N-fold cross-validation. By default, 5-fold cross-validation is used.
NcvTestFolds=[number of CV folds]
5
[default].Can be used only with
CVopt
.if this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.
LOOtest
optional.
leave-one-out cross-validation. Only random or no sampling can be used.
iCVtestPrefOut=[prefix of files with indices for CVtest]
no default prefixes.
file names will include the required prefix.
CVopt
optional.
Related option
NcvOptFolds
.generates indices for N-fold cross-validation for hyperparameters optimization. By default, 5-fold cross-validation is used.
NcvOptFolds=[number of CV folds]
5
[default].Can be used only with
CVopt
.If this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.
LOOopt
optional.
Leave-one-out cross-validation. Only random or no sampling can be used.
iCVoptPrefOut=[prefix of files with indices for CVopt]
no default prefixes.
file names will include the required prefix.
Additional optional arguments for sampling
Arguments used with sample
argument.
sampling=[type of data set sampling into splits]
random
[default] random samplingnone
simply split unshuffled data set into the training and test sets (in this order) (and sub-training and validation sets)structure-based
structure-based samplingfarthest-point
farthest-point traversal iterative procedure
Nuse=[N first entries of the data set file to be used]
100% [default]
optional.
Ntrain=[number of the sub-training points or a fraction of the training points]
80% of the total set by default.
If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set.
Ntest=[number of the validation points or a fraction of the training points]
By default, the remaining points of the total set after subtracting the training points.
If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set.
Nsubtrain=[number of the sub-training points or a fraction of the training points]
80% of the training set by default.
If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set.
Nvalidate=[number of the validation points or a fraction of the training points]
By default, the remaining points of the training set after subtracting the sub-training points.
If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set.
Example
Structure-based sampling:
mlatom sample sampling=structure-based XYZfile=CH3Cl.xyz Ntrain=1000 Ntest=10000 iTrainOut=itrain.dat iTestOut=itest.dat
Slicing
Sometimes it is useful to slice data by the Euclidean distance of their descriptors to the equilibrium descriptor. See tutorial.
Arguments for slicing:
|
required. |
|
required. |
|
required. |
|
|
Arguments for sampling from slices:
|
|
|
required. |
|
|
Arguments for merging indices from slices:
|
|
|
required. |
|
|
Examples
See tutorial.
MLatom.py slice Nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x
mlatom sampleFromSlices Nslices=3 sampling=structure-based Ntrain=4480
mlatom mergeSlices Nslices=3 Ntrain=4480