Learning

Training popular ML models

The models can be either native MLatom or from third-party interfaces to popular ML model types:

MACE (through MACE)

ANI (through TorchANI)

(p)KREG (native). See tutorial. Can only be used for single-molecule PES

KRR-CM (KRR with Coulomb matrix, native).

DeepPot-SE and DPMD (through DeePMD-kit)

GAP - SOAP (through GAP suite and QUIP)

PhysNet (through PhysNet)

sGDML (through sGDML). Can only be used for single-molecule PES

Input arguments

createMLmodel
- requests training an ML model.
XYZfile=[input file with XYZ coordinates]
- no default file names.
- requests to train on a data set with many molecules provided in file with their XYZ coordinates. The units of coordinates are arbitrary, but many simulations with MLatom require Å which are recommended.
Yfile=[input file with reference values] and/or YgradXYZfile=[input file with reference XYZ gradients]
- Yfile or both of these two arguments can be chosen.
  
  No default file names.
- Yfile are often energies, it is recommended to use Hartree if the model is intended to be used in further simulations. YgradXYZfile are often energy gradients, it is recommended to use Hartree/Å. Note that gradients are negative forces and appropriate sign should be used. Also, note that sparse gradients can be provided, where for geometries without gradients, YgradXYZfile file should contain ‘0’ followed by a blank line (see tutorial).
MLmodelOut=[output file with trained model]
- no default file name.
- saves model to a user-defined file. If the file already exists, MLatom will not overwrite it and stop.

MLmodelType=[supported ML model type]

KREG [default];

Available model types and corresponding programs (MLatomF is a native program):
+-------------+----------------+
| MLmodelType | default MLprog |
+-------------+----------------+
| KREG        | MLatomF        |
+-------------+----------------+
| sGDML       | sGDML          |
+-------------+----------- ----+
| GAP-SOAP    | GAP            |
+-------------+----------------+
| PhysNet     | PhysNet        |
+-------------+----------------+
| DeepPot-SE  | DeePMD-kit     |
+-------------+----------------+
| ANI         | TorchANI       |
+-------------+----------------+
Calculations with native implementations do not require this argument. For third-party models the user should provide either MLmodelType and/or MLprog argument (see also installation instructions). Note that to request KRR-CM model, one has to choose descriptor and algorithm details manually.

MLprog=[supported ML program]

It is recommended to use MLmodelType instead of this option.

Supported interfaces with default and tested ML model types:

+------------+----------------------+
| MLprog     | MLmodelType          |
+------------+----------------------+
| MLatomF    | KREG [default]       |
|            | see                  |
|            | MLatom.py KRR help   |
+------------+----------------------+
| sGDML      | sGDML [default]      |
|            | GDML                 |
+------------+----------------------+
| GAP        | GAP-SOAP             |
+------------+----------------------+
| PhysNet    | PhysNet              |
+------------+----------------------+
| DeePMD-kit | DeepPot-SE [default] |
|            | DPMD                 |
+------------+----------------------+
| TorchANI   | ANI [default]        |
+------------+----------------------+

Calculations with native implementations do not require this argument. For third-party models the user should provide either MLmodelType and/or MLprog argument (see also installation instructions). Note that to request KRR-CM model, one has to choose descriptor and algorithm details manually.

eqXYZfileIn=[file with XYZ coordinates of equilibrium geometry]
- optional.
  
  By default, tries to look for eq.xyz file, if not found, uses the minimum-energy structure in the data set.
- can only be used for the KREG model to construct the RE descriptor.

Additional output arguments

YestFile=[output file with estimated Y values]
- this argument is optional and no default parameters are provided.
- makes predictions Y for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it.
YgradXYZestFile=[output file with estimated XYZ gradients]
- this argument is optional and no default parameters are provided.
- should be used only with XYZfile option. Calculates first XYZ derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it.
YgradEstFile=[output file with estimated gradients]
- this argument is optional and no default parameters are provided.
- should be used only with XfileIn option. Calculates first derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it.

Note

Calculations with third-party programs may also generate additional output files.

Additional options for TorchANI interface

Arguments with their default values:

`ani.batch_size=8`	batch size
`ani.max_epochs=10000000`	max epochs
`ani.early_stopping_learning_rate=0.00001`	learning rate that triggers early-stopping
`ani.force_coefficient=0.1`	weight for force
`ani.Rcr=5.2`	radial cutoff radius
`ani.Rca=3.5`	angular cutoff radius
`ani.EtaR=1.6`	radial smoothness in radial part
`ani.ShfR=0.9,1.16875,1.4375,1.70625,1.975,2.24375,2.5125,2.78125,3.05,3.31875,3.5875,3.85625,4.125,4.9375,4.6625,4.93125`	radial shifts in radial part
`ani.Zeta=32`	angular smoothness
`ani.ShfZ=0.19634954,0.58904862,0.9817477,1.3744468,1.7671459,2.1598449,2.552544,2.9452431`	angular shifts
`ani.EtaA=8`	radial smoothness in angular part
`ani.ShfA=0.9,1.55,2.2,2.85`	radial shifts in angular part
`ani.Neuron_l1=160`	number of neurons in layer 1
`ani.Neuron_l2=128`	number of neurons in layer 2
`ani.Neuron_l3=96`	number of neurons in layer 3
`ani.AF1='CELU'`	acitivation function for layer 1
`ani.AF2='CELU'`	acitivation function for layer 2
`ani.AF3='CELU'`	acitivation function for layer 3

Additional options for sGDML

Arguments with their default values:

`sgdml.gdml=False`	use GDML instead of sGDML
`sgdml.cprsn=False`	compress kernel matrix along symmetric degrees of freedom
`sgdml.no_E=False`	not to predict energies
`sgdml.E_cstr=False`	include the energy constraints in the kernel
`sgdml.s=<s1>[,<s2>[,...]] or <start>:[<step>:]<stop>`	set hyperparameter sigma, see sgdml create -h for details.

Additional options for PhysNet

Arguments with their default values:

`physnet.num_features=128`	number of input features
`physnet.num_basis=64`	number of radial basis functions
`physnet.num_blocks=5`	number of stacked modular building blocks
`physnet.num_residual_atomic=2`	number of residual blocks for atom-wise refinements
`physnet.num_residual_interaction=3`	number of residual blocks for refinements of proto-message
`physnet.num_residual_output=1`	number of residual blocks in output blocks
`physnet.cutoff=10.0`	cutoff radius for interactions in the neural network
`physnet.seed=42`	random seed
`physnet.learning_rate=0.0008`	starting learning rate
`physnet.decay_steps=10000000`	decay steps
`physnet.decay_rate=0.1`	decay rate for learning rate
`physnet.batch_size=12`	training batch size
`physnet.valid_batch_size=2`	validation batch size
`physnet.force_weight=52.91772105638412`	weight for force
`physnet.summary_interval=5`	interval for summary
`physnet.validation_interval=5`	interval for validation
`physnet.save_interval=10`	interval for model saving

Additional options for GAP and QUIP

gapfit.xxx=x xxx could be any option for gap_fit (e.g. default_sigma). Note that there’s no need to set at_file and gp_file.
gapfit.gap.xxx=x xxx could be any option for gap.

Arguments with their default values:

`gapfit.default_sigma={0.0005,0.001,0,0}`	hyperparameter sigmas for energies, forces, virals and hessians
`gapfit.e0_method=average`	method for determining e0
`gapfit.gap.type=soap`	descriptor type
`gapfit.gap.l_max=6`	max number of angular basis functions
`gapfit.gap.n_max=6`	max number of radial basis functions
`gapfit.gap.atom_sigma=0.5`	hyperparameter for Gaussain smearing of atom density
`gapfit.gap.zeta=4`	hyperparameter for kernel sensitivity
`gapfit.gap.cutoff=6.0`	cutoff radius of local environment
`gapfit.gap.cutoff_transition_width=0.5`	cutoff transition width
`gapfit.gap.delta=1`	hyperparameter delta for kernel scaling

Additional options for DeePMD-kit

Expressions like deepmd.xxx.xxx=X specify arguments for DeePMD, follows the structure of DeePMD’s json input file.

For example:

deepmd.training.stop_batch=N is an equivalent of

{

    ...

    "training": {

    ...

    "stop_batch": N

    ...

    }

    ...

}

in DeePMD-kit’s json input. In addition, option deepmd.input=S intakes a input json file S as a template. Final input file will be generated based on it with deepmd.xxx.xxx=X options (if any). Check default template file bin/interfaces/DeePMDkit/template.json for default values.

Example

See tutorial for training the KREG models.

Here we show how to train an ANI-type model on ethanol PES (trains only on energies). ethanol_geometries.xyz, ethanol_energies.txt

In MLatom, except for the KREG model, we need to specify MLmodelType. The input is very simple:

createMLmodel                            # Specify the task for MLatom
MLmodelType=ANI                          # Specify the model type
XYZfile=ethanol_geometries.xyz           # File with XYZ geometries
Yfile=ethanol_energies.txt               # File with reference energies

Training generic ML models

MLatom allows to train kernel ridge regression (KRR) models for any generic data set with input vectors X and reference labels Y. A range of kernel functionals are supported. Instead of using this option, it may be more convenient to use one of the popular ML models available in MLatom.

Required arguments

Below are required arguments but typically more options are needed, e.g., for choosing a molecular descriptor and algorithm hyperparameters, as shown later.

createMLmodel
- requests training an ML model.
  
  Currently only KRR models are supported.
XYZfile=[input file with XYZ coordinates] or XfileIn=[input file with input vectors X]
- one and only one of these two options can be chosen.
  
  No default file names.
- XYZfile: requests to train on a data set with many molecules provided in file with their XYZ coordinates. The units of coordinates are arbitrary, but many simulations with MLatom require Å which are recommended.
  
  XfileIn: requests to train on a data set with many input vectors (one input vector per line in text file), which are typically molecular descriptors.
Yfile=[input file with reference values] and/or YgradXYZfile=[input file with reference XYZ gradients]
- one or both of these two options can be chosen.
  
  No default file names.
- Yfile are often energies, it is recommended to use Hartree if the model is intended to be used in further simulations.
  
  YgradXYZfile are often energy gradients, it is recommended to use Hartree/Å. Note that gradients are negative forces and appropriate sign should be used. Also, note that sparse gradients can be provided, where for geometries without gradients, YgradXYZfile file should contain ‘0’ followed by a blank line (see tutorial).
MLmodelOut=[output file with trained model]
- no default file name.
- saves model to a user-defined file, commonly with .unf extension. If the file already exists, MLatom will not overwrite it and stop.

KRR-related arguments

prior=[offset of reference values]
- 0.0 [default].
  
  Mean use average of reference scalar values.
  
  Any other user-defined decimal/integer number.
- It is often useful to offset reference values, e.g., by removing average value. This may improve stability of the model and make learning easier.
KRRtask=[one of tasks]
- learnVal learns reference values [default if only Yfile provided].
  
  learnGradXYZ explicitly learns only XYZ gradients (should be requested for correct simulations). Works only with the KREG model (RE descriptor and Gaussian kernel).
  
  learnValGradXYZ explicitly learns both scalar values and XYZ gradients [default if both Yfile and YgradXYZfile are provided]. Works only with the KREG model (RE descriptor and Gaussian kernel).
- specifies what to learn: scalar values and/or XYZ gradients.
lambda=[regularization hyperparameter]
- 0.0 [default]. opt optimize hyperapameter, see dedicated manual.
  
  Any other user-defined nonnegative decimal/integer number.
- It is recommended to always optimize this hyperparameter. Usually, lambda parameter should be rather small but larger than zero, e.g., 10^-6.
lambdaGradXYZ=[regularization hyperparameter for XYZ gradients part]
- similar to lambda.
  
  Can be used for KRRtask=learnGradXYZ and KRRtask=learnValGradXYZ.
  
  For KRRtask=learnGradXYZ, both lambda and lambdaGradXYZ are equivalent.
- similar to lambda, may be helpful if it is hard to learn both scalar values and XYZ gradients with a single lambda.
kernel=[kernel function]
- Gaussian``[default]. Its hyperparameter: ``sigma.
  Modifications of Gaussian kernel:
  
  periodKernel. Its hyperparameters: sigma, period.
  
  decayKernel. Its hyperparameters: sigma, sigmap, period.
  
  Laplacian. Its hyperparameter: sigma.
  
  exponential. Its hyperparameter: sigma.
  
  Matern is the most flexible but relatively slow, hyperparameters: nn, sigma. nn = 0 makes Matern kernel equivalent to exponential kernel, very large nn makes it equivalent to Gaussian kernel.
  
  linear. No hyperparameters.
- Many of these kernel functions have hyperparameters that are recommended to be defined by indicated arguments. Linear kernel makes KRR equivalent to ridge regression, i.e., kernalized multiple linear regression (MLR) and MLatom prints out coefficients of an equivalent MLR model.
sigma=[length scale hyperparameter]
- 100.0 [default for kernel=Gaussian and kernel=Matern].
  
  800.0 [default for kernel=Laplacian and kernel=exponential]
  
  opt optimize hyperapameter, see dedicated manual.
  
  Any other user-defined positive decimal/integer number.
- scale length hyperparameter present in most kernel functions. It is recommended to always optimize this hyperparameter, no good default general value can be recommended.
sigmap=[length scale hyperparameter of a periodic part]
- 100.0 [default, can be used only with kernel=decayKernel]
  
  opt optimize hyperapameter, see dedicated manual.
  
  Any other user-defined positive decimal/integer number.
- It is recommended to always optimize this hyperparameter, no good default general value can be recommended.
period=[length scale hyperparameter]
- 1.0 [default, can be used in both kernel=periodKernel and kernel=decayKernel]
  
  opt optimize hyperapameter, see dedicated manual.
  
  Any other user-defined positive decimal/integer number.
- It is recommended to always optimize this hyperparameter, no good default general value can be recommended.
nn=[length scale hyperparameter]
- 2 [default, can only be used for kernel=Matern]
  
  opt optimize hyperapameter, see dedicated manual
  
  Any other user-defined positive integer number.
- Since it is an integer hyperparameters, it is usually easy to manually check several values from 1 to 5, because 0 corresponds to exponential kernel, and more than 5 are already close to Gaussian kernel.
permInvKernel
- optional. Related options: molDescrType=permuted, permInvGroups, permInvNuclei, Nperm, selectperm, permIndIn, permlen.
- requests calculations with permutationally invariant kernel. Recommended for small data sets to ensure that permutation of homonuclear atoms will not change ML predictions.
Nperm=[number of permutations]
- optional, can only be used with permInvKernel and XfileIn.
- defines number of permutations in the user-provided file with reference values. Each line of input vector file must contain input vectors with molecular descriptors concatenated for all atomic permutation of a single geometry. See also related tutorial.
selectperm
- optional, can only be used with permInvKernel and molDescrType=permuted.
- may be useful to find most relevant permutations nad reduce the number of permutations by minimizing distance RMSD to an equilibrium structure. Prints out list of selected permutations. See also related tutorial.
permIndIn=[file with permutations list]
- optional, can only be used with permInvKernel and molDescrType=permuted and permlen.
- See also related tutorial.
permlen=[number of permutations in permIndIn]
- optional, can only be used with permInvKernel and molDescrType=permuted and permIndIn.
- See also related tutorial.
matDecomp=[type of matrix decomposition]
- Cholesky [default]
  
  LU
  
  Bunch-Kaufman
- Cholesky is the most efficient, but for very difficult cases (e.g., too small hyperparameter lambda), other types can be used. MLatom first tries to do Cholesky decomposition, if it fails, MLatom tries to do Bunch-Kaufman and, finally, LU. Thus, usually, the user does not need to worry about this option.
invMatrix
- not used by default.
  
  Optional.
- requests inverting kernel matrix to train the model. Not recommended because it is much slower than the default option.

Molecular descriptor arguments

If the user only provides XYZ file with XYZfile argument, XYZ coordinates need to be first converted into the molecular descriptor.

molDescriptor=[molecular descriptor]
- RE [default] (relative-to-equilibrium)
  
  CM (Coulomb matrix)
  
  ID (inverse internuclear distances)
- RE descriptor is well-suited for accurate descriptioin of single-molecule PES.
  
  CM is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.
  
  ID is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor.
molDescrType=[type of molecular descriptor]
- unsorted [default for RE]
  
  sorted [default for CM]
  
  permuted (optional, can be used for both RE and CM)
- unsorted descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.
  
  sorted descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options: XYZsortedFileOut, permInvGroups, permInvNuclei. See also related tutorial.
  
  permuted augments the descriptor with the permutations of user-defined atoms. Related arguments: permInvKernel, permInvGroups, permInvNuclei. See also related tutorial.
XYZsortedFileOut=[output file with with sorted XYZ coordinates]
- optional.
  
  Only works with molDescriptor=RE molDescrType=sorted.
- saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out
permInvNuclei=[permutationally invariant nuclei]
- optional.
  
  Should be used with molDescrType=permuted (and often with permInvKernel)
- E.g. permInvNuclei=2-3.5-6 will permute atoms 2,3 and 6,7. See also related tutorial.
permInvGroups=[permutationally invariant groups]
- optional.
  
  Should be used with molDescrType=permuted (and often with permInvKernel)
- E.g. for water dimer permInvGroups=1,2,3-4,5,6 generates permuted atom indices by flipping the monomers in a dimer.

Additional output arguments

YestFile=[output file with estimated Y values]
- this argument is optional and no default parameters are provided.
- makes predictions Y for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it.
YgradXYZestFile=[output file with estimated XYZ gradients]
- this argument is optional and no default parameters are provided.
- should be used only with XYZfile option. Calculates first XYZ derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it.
YgradEstFile=[output file with estimated gradients]
- this argument is optional and no default parameters are provided.
- should be used only with XfileIn option. Calculates first derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it.

Example

Here we show how to train a simple model for the H₂ dissociation curve with kernel ridge regression.

Download R_20.dat file with 20 points corresponding to internuclear distances in the H₂ molecule in Å.

Download E_FCI_20.dat file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points.

Train (option createMLmodel) ML model and save it to a file (option MLmodelOut=mlmod_E_FCI_20_overfit.unf) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^-11 and λ=0:

mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=0.00000000001 lambda=0.0 sampling=none > create_E_FCI_20_overfit.out

In the output file create_E_FCI_20_overfit.out you can see that the error for the created ML model is essentially zero for the training set. Option sampling=none ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (options useMLmodel MLmodelIn) for calculating energies for its own training set and save them to E_ML_20_overfit.dat file:

mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat YestFile=E_ML_20_overfit.dat debug > use_E_FCI_20_overfit.out

Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option debug also prints the values of the regression coefficients alpha to the output file use_E_FCI_20_overfit.out. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because of the hyperparameter choice. Thus, optimization of hyperparameters is strongly recommended.

Optimizing hyperparameters

It is often desirable/necessary to optimize hyperparameters, although many models may have reasonable hyperparameters and/or by default optimize their hyperparameters. There are two main different ways to optimize hyperparameters with MLatom described below:

grid search for KRR models (including KREG & KRR-CM),

optimization with hyperopt. Grid search is applicable for small number of hyperparameters (one or two) and is very robust, optimization with hyperopt never gives a guarantee of finding good hyperparameters but is more flexible.

Arguments

The optimization objective is to minimize the validation error. For this, the training data set has to be split into the sub-training and validation sets.

minimizeError=[type of validation error to minimize]
- RMSE [default];
- MAE
Nsubtrain=[number of the sub-training points or a fraction of the training points]
- 80% of the training set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set.
- points can be sampled in one of the usual ways using sampling argument. By default, randomly.
Nvalidate=[number of the validation points or a fraction of the training points]
- By default, the remaining points of the training set after subtracting the sub-training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set.
- points can be sampled in one of the usual ways using sampling argument. By default, randomly.
CVopt
- optional.
  
  Related option NcvOptFolds
- N-fold cross-validation error. By default, 5-fold cross-validation is used.
NcvOptFolds=[number of CV folds]
- 5 [default].
  
  Can be used only with CVopt.
- If this number is equal to the number of data points, leave-one-out cross-validation is performed.
  
  Only random or no sampling can be used for cross-validation.
LOOopt
- optional.
- Leave-one-out cross-validation.
  
  Only random or no sampling can be used.
iCVoptPrefOut=[prefix of files with indices for CVopt]
- optinal.
  
  No default prefixes.
- file names will include the required prefix.
Nuse=[N first entries of the data set file to be used]
- 100% [default];
  
  optional.
- sometimes it is useful for tests just use a part of a data set.

Grid search for kernel ridge regression models

Grid search is performed on a logarithmic grid. After the best parameters are found in the first iteration, MLatom can perform more iterations of a logarithmic grid search. This option is used only for λ and/or σ hyperparameters of KRR.

lgOptDepth=[depth of log search]
- 3 [default]
- often, depth of one or two suffices and is much faster. 3 is a safer option.
NlgLambda=[number of points on the logarithmic grid (base 2) optimization of lambda]
- 6 [default]
- used with kernel ridge regression and lambda=opt argument.
lgLambdaL=[lowest value of log2 λ for a logarithmic grid optimization of lambda]
- -35.0 [default]
- used with kernel ridge regression and lambda=opt argument.
lgLambdaH=[highest value of log2 λ for a logarithmic grid optimization of lambda]
- -6.0 [default]
- used with kernel ridge regression and lambda=opt argument.
NlgSigma=[number of points on the logarithmic grid (base 2) for optimization of sigma]
- 6 [default]
- used with kernel ridge regression and sigma=opt argument.
lgSigmaL=[lowest value of log2 σ for a logarithmic grid optimization of sigma]
- 6.0 [default for kernel=Gaussian and kernel=Matern];
  
  5.0 [default for kernel=Laplacian and kernel=exponential]
- used with kernel ridge regression and sigma=opt argument.
lgSigmaH=[highest value of log2 σ for a logarithmic grid optimization of sigma]
- 9.0 [default for kernel=Gaussian and kernel=Matern];
  
  12.0 [default for kernel=Laplacian and kernel=exponential]
on-the-fly
- not used by default.
  
  Optional.
- on-the-fly calculation of kernel matrix elements for validation, by default it is false and those elements are stored making calculations faster

Optimization with hyperopt

Optimization with hyperopt requires installation of the hyperopt package. This package provides a general solution to the optimization problem using Bayesian methods with Tree-structured Parzen Estimator (TPE).

[argument name of hyperparameter to optimize, e.g., sigma]=hyperopt.uniform(lb,ub) or [argument name of hyperparameter to optimize, e.g., sigma]=hyperopt.loguniform(lb,ub) or [argument name of hyperparameter to optimize, e.g., sigma]=hyperopt.qunifrom(lb,ub,q)
- No default values.
- lower bound lb, and upper bound ub.
  
  hyperopt.uniform(lb,ub): linear search space.
  
  hyperopt.loguniform(lb,ub): logarithmic search space, base 2.
  
  hyperopt.qunifrom(lb,ub,q): discrete linear space, rounded by q.
hyperopt.max_evals=[maximum number of attempts]
- 8 [default]
- often, several hundreds or even thousands of evaluations are required.
hyperopt.losstype=[type of loss for several reference properties]
- geomean [default];
  
  weighted (used with hyperopt.w_grad)
- geomean uses the geometric mean of losses for different properties (typically, energies and forces).
  
  weighted currently only needs to define weight for forces (negative XYZ gradients)
hyperopt.w_grad=[weight for XYZ gradients]
- 0.1 [default].
  
  Should be used with hyperopt.losstype=weighted
hyperopt.points_to_evaluate=[xx,xx,...],[xx,xx,...],...
- optional, no default parameters.
- specify initial guesses before auto-searching, each point inside a pair of square brackets should have all values to be optimized in order. these evaluations are NOT counted in max_evals.

Examples

Two typical examples:

mlatom createMLmodel XYZfile=CH3Cl.xyz Yfile=en.dat MLmodelOut=CH3Cl.unf sigma=opt kernel=Matern

mlatom estAccMLmodel XYZfile=CH3Cl.xyz Yfile=en.dat sigma=hyperopt.loguniform(4,20)

Evaluating ML models

MLatom can evaluate the ML model, i.e., estimate its generalization error. For this, the total data set should be split into the training and test sets. ML model can be either trained as usual (with a generic model or a popular model) or provided with MLmodelIn argument. If the model is trained, the user can choose the required arguments to train the model. Below, only arguments unique to this feature are given.

Also, MLatom can calculate learning curves (test error vs the number of the training set).

Arguments

estAccMLmodel
- required.
- requests estimating generalization error on the test set. This argument cannot be used together with createMLmodel or useMLmodel. ML model can be either trained as usual with a generic model or a popular model. Or ML model can be provided with MLmodelIn argument.
Ntrain=[number of the sub-training points or a fraction of the training points]
- 80% of the total set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set.
- points can be sampled in one of the usual ways using sampling argument. By default, randomly.
Ntest=[number of the validation points or a fraction of the training points]
- By default, the remaining points of the total set after subtracting the training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set.
- points can be sampled in one of the usual ways using sampling argument. By default, randomly.
CVtest
- optional.
  
  Related option NcvOptFolds.
- N-fold cross-validation error. By default, 5-fold cross-validation is used.
NcvTestFolds=[number of CV folds]
- 5 [default].
  
  Can be used only with CVopt.
- if this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.
LOOtest
- optional.
- leave-one-out cross-validation. Only random or no sampling can be used.
learningCurve
- should be used with lcNtrains argument
- produces learning curves. This option produces the following output files in directory learningCurve:
  results.json JSON database file with all results
  
  lcy.csv CSV database file with results for values
  
  lcygradxyz.csv CSV database file with results for XYZ gradients
  
  lctimetrain.csv CSV database file with training timings
  
  lctimepredict.csv CSV database file with prediction timings
lcNtrains=[N,N,N,...,N training set sizes]
- required argument if learningCurve is requested
lcNrepeats=[N,N,N,...,N numbers of repeats for each Ntrain] or lcNrepeats=[N,N,N,...,N number of repeats for all Ntrains]
- 3 [default]
- necessary to get error bars.
Nuse=[N first entries of the data set file to be used]
- 100% [default];
  
  optional.
- sometimes it is useful for tests just use a part of a data set.
sampling=user-defined
- optional.
  
  Requires arguments iTrainIn, iTestIn, and/or iCVtestPrefIn.
iTrainIn=[file with indices of training points]
- optional.
  
  No default file names.
iTestIn=[file with indices of test points]
- optional.
  
  No default file names.
iCVtestPrefIn=[prefix of files with indices for CVtest]
- optional.
  
  No default file names.
MLmodelIn=[file with ML model]
- optional.
  
  No default file names.
- requests to read a file with ML model.
iTrainOut=[file with indices of training points]
- optional.
  
  No default file names.
- generates indices for the training set.
iTestOut=[file with indices of test points]
- optional.
  
  No default file names.
- generates indices for the test set.
iSubtrainOut=[file with indices of sub-training points]
- optional.
  
  No default file names.
- generates indices for the sub-training set.
iValidateOut=[file with indices of validation points]
- optional.
  
  No default file names.
- generates indices for the validation set.
iCVtestPrefOut=[prefix of files with indices for CVtest]
- optional.
  
  No default file names.
- file names will include the required prefix.

Examples

Simple example:

mlatom estAccMLmodel XYZfile=CH3Cl.xyz Yfile=en.dat sigma=opt lambda=opt

Example of learning curve:

mlatom learningCurve Yfile=y.dat XYZfile=xyz.dat kernel=Gaussian sigma=opt lambda=opt lcNtrains=100,250,500,1000,2500,5000,10000 lcNrepeats=64,32,16,8,4,2,1

With this command training set sizes listed in lcNtrains will be tested repeatedly for 64, 32, 16, 8, 4, 2, 1 time(s), respectively. All data generated (including csv reports) will be stored in the folder learningCurve under the current directory.

Δ-learning

Δ-machine learning can be used with one of the usual options. Below, arguments unique to delta-learning are described. See also a tutorial.

`deltaLearn`	required. Should be used with one of: `createMLmodel` `useMLmodel MLmodelIn` `estAccMLmodel`
`Yb=[file with data obtained with baseline method]`	required for both training and predictions.
`Yt=[file with data obtained with target method]`	required only for training.
`YestT=[file with ML estimations of target method]`	required for predictions.
`YestFile=[file with ML corrections to baseline method]`	required for predictions.
`YgradXYZb=[file with baseline XYZ gradients]`	optional.
`YgradXYZt=[file with target XYZ gradients]`	optional.
`YgradXYZestT=[file with ML estimations of target XYZ gradients]`	optional.
`YgradXYZestFile=[file with ML corrections to baseline XYZ gradients]`	optional.

Example

mlatom estAccMLmodel deltaLearn XfileIn=x.dat Yb=UHF.dat Yt=FCI.dat YestT=D-ML.dat YestFile=corr_ML.dat

Self-correction

Self-correction as described here. Can be used with one of the usual options. Below, arguments unique to self-correction are described. See also a tutorial.

selfCorrect

required. Should be used with one of: createMLmodel useMLmodel MLmodelIn estAccMLmodel

Example

mlatom estAccMLmodel selfCorrect XYZfile=xyz.dat Yfile=y.dat