通用机器学习模型
New: we highly recommend you to check out UAIQM that is our ultimate solution to the universal ML models.
MLatom支持广泛的基于通用机器学习(ML)的模型,包括机器学习势和混合机器学习增强量子力学(QM)方法。它们无需训练、开箱即用。下表列出了可用方法的特定教程链接:
方法 |
模型类型 |
MLatom中的关键词 |
支持的元素 |
梯度 |
Hessian |
电荷 |
自由基 |
激发态 |
---|---|---|---|---|---|---|---|---|
ML & ML/QM |
|
all elements |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
not available yet |
|
ML/QM |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
|
ML/QM |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
|
ML/QM |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
|
ML/QM |
|
H, C, N, O |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
ML/QM |
|
H, C, N, O |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
ML/QM |
|
H, C, N, O |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
ML/QM |
|
主族元素 |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
ML/QM |
|
主族元素 |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
ML/QM |
|
主族元素 |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
ML/QM |
|
主族元素 |
\(\times\) |
\(\times\) |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
|
MLP |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, C, N, O, F, S, Cl |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, C, N, O, F, S, Cl |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, C, N, O |
\(\checkmark\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, B, C, N, O, F, Si, P, S, Cl, As, Se, Br, I |
\(\checkmark\) |
\(\times\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
|
MLP |
|
H, B, C, N, O, F, Si, P, S, Cl, As, Se, Br, I |
\(\checkmark\) |
\(\times\) |
\(\checkmark\) |
\(\times\) |
\(\times\) |
在本教程中,我们首先介绍如何使用这些通用方法在MLatom中执行各种任务。然后我们将详细介绍每种方法。
使用通用机器学习模型
下面简要概述了如何在MLatom中使用通用ML模型:
单点能计算
对于单点能计算,MLatom的输入文件只需要3-5行,可使用上述指定的方法之一:
AIMNet2@b973c
xyzfile=sp.xyz
yestfile=energy.dat
其中 sp.xyz
是分子的XYZ构型,也可以直接定义:
AIMNet2@b973c
xyzfile='
2
H 0.000000 0.000000 0.363008
H 0.000000 0.000000 -0.363008
5
C 0.000000 0.000000 0.000000
H 0.627580 0.627580 0.627580
H -0.627580 -0.627580 0.627580
H 0.627580 -0.627580 -0.627580
H -0.627580 0.627580 -0.627580
'
yestfile=energy.dat
使用关键词 ygradxyzestfile
和 hessianestfile
可以在指定的文件中获得梯度和Hessian。
由于DM21函数集成在PySCF中,因此用户需要在输入文件中使用 method
和 qmprog
关键字在MLatom中 定义QM方法 。下面是一个使用DM21函数及6-31G*基组的输入文件示例:
method=DM21/6-31G*
qmprog=pyscf
xyzfile=sp.xyz
yestfile=energy.dat
Python API能够更灵活地在MLatom中选用方法。在我们的示例中,用户可以使用 mlatom.models.methods
模块定义方法,并指定上面提到的关键字,例如:
method = mlatom.models.methods(method='ANI-1xnr')
# method = mlatom.models.methods(method='DM21/6-31G*', program='pyscf')
这里我们提供了一个例子,使用ANI-1xnr计算能量、梯度和Hessian。
import mlatom as ml
# read molecule from .xyz file
molDB = ml.data.molecular_database.from_xyz_file('sp.xyz')
# define method
model = ml.models.methods(method='ANI-1xnr')
model.predict(
molecular_database=molDB,
calculate_energy=True,
calculate_energy_gradients=True,
calculate_hessian=True)
print(f'Energy in Hartree for molecule 0: {molDB[0].energy}')
print(f'Gradients in Hartree/Angstrom for molecule 1: {molDB[1].get_energy_gradients()}')
print(f'Hessian in Hartree/Angstrom^-2 for molecule 1: {molDB[1].hessian}')
有关使用MLatom进行单点计算的更多详细信息,请查看我们的 教程 。
几何优化和频率计算
几何优化是研究化学体系时的常见任务,对优化后的分子进行频率计算、获取热化学性质也是分析的重点。
在MLatom中使用输入文件进行几何优化,用户只需要在第一行使用 geomopt
选项,并指定要使用的方法、提供初始猜测:
geomopt # 1. requests geometry optimization
ANI-1ccx # 2. universal MLP
xyzfile=' # 3. initial geometry guess
9
C -1.691449880 -0.315985130 0.000000000
H -1.334777040 0.188413060 0.873651500
H -1.334777040 0.188413060 -0.873651500
H -2.761449880 -0.315971940 0.000000000
C -1.178134160 -1.767917280 0.000000000
H -1.534806620 -2.272315330 0.873651740
H -1.534807450 -2.272316160 -0.873650920
O 0.251865840 -1.767934180 -0.000001150
H 0.572301420 -2.672876720 0.000175020
'
optxyz=opt.xyz # 4. (optional) file with optimized geometry.
optprog=geometric # 5. request geometric optimizer
每优化一步都会被打印到输出文件中,在3.4.0及之后的版本可以通过 printall
和 printmin
关键字来控制。用户还可以通过关键字 dumpopttrajs
选择是否转储优化轨迹。
几何优化结束后,可以使用输入文件中的 freq
选项进行频率计算,如下所示:
freq # 1. requests frequency calculation
ANI-1ccx # 2. universal MLP
xyzfile=' # 3. optimized geometry
9
C -1.672571 -0.341122 -0.000001
H -1.307766 0.181713 0.885095
H -1.307762 0.181707 -0.885099
H -2.764560 -0.305014 -0.000003
C -1.188732 -1.771664 0.000009
H -1.559124 -2.298647 0.885998
H -1.559099 -2.298653 -0.885987
O 0.237878 -1.729915 0.000028
H 0.575701 -2.626896 0.000135
'
在输出文件中,用户可以找到振动分析,包括频率、每个简正模的约化质量和力常数,以及热化学计算的结果。用户可以下载本例中的 输出文件
。
分子动力学
机器学习势的优点之一是具有超快的速度。常用的DFT方法(如果不使用DM21)传播数千条轨迹需要几周的时间,而机器学习仅需几个小时。MLatom提供了一种简便的运行 MD 的方法,也提供了在化学反应模拟中流行的 准经典MD 。
如果在MLatom中使用输入文件,这里唯一的区别是关键字 method
。例如,如果想使用指向RKS B97-3c的AIMNet2来运行NVT系综中的氢分子动力学(使用Nosé–Hoover恒温器),则输入文件如下所示:
MD # 1. requests molecular dynamics
AIMNet2@b973c # 2. use AIMNet2@B97-3c method
initConditions=user-defined # 3. use user-defined initial conditions
initXYZ=h2_init.xyz # 4. file with initial geometry; Unit: Angstrom
initVXYZ=h2_init.vxyz # 5. file with initial velocity; Unit: Angstrom/fs
dt=0.3 # 6. time step; Unit: fs
trun=30 # 7. total time; Unit: fs
thermostat=Nose-Hoover # 8. use Nose-Hoover thermostat
ensemble=NVT # 9. NVT ensemble
temperature=300 # 10. Run MD at 300 Kelvin
初始的XYZ坐标和速度可在此处下载:h2_init.xyz
, h2_init.vxyz
下面我们还提供了使用Python API运行相同任务的代码片段。与之前一样,只有定义所使用方法的代码会被更改。
import mlatom as ml
# Use user-defined initial conditions
mol = ml.data.molecule.from_xyz_file('h2_init.xyz')
init_cond_db = ml.generate_initial_conditions(molecule=mol,
generation_method='user-defined',
file_with_initial_xyz_coordinates='h2_init.xyz',
file_with_initial_xyz_velocities='h2_init.vxyz')
init_mol = init_cond_db[0]
# Initializing model
model = ml.models.methods(method='AIMNet2@b973c')
# Initializing thermostat
nose_hoover = ml.md.Nose_Hoover_thermostat(temperature=300, molecule=init_mol)
# Run molecular dynamics
dyn = ml.md(model=model,
molecule_with_initial_conditions=init_mol,
thermostat=nose_hoover,
ensemble='NVT',
time_step=0.3,
maximum_propagation_time=30.0)
# Dump trajectory
traj = dyn.molecular_trajectory
traj.dump(filename='traj', format='plain_text')
traj.dump(filename='traj.h5', format='h5md')
print(f"Number of steps in the trajectory: {len(traj.steps)}")
Fine-tuning universal models
Tutorial on transfer learning from universal models¶
In this tutorial, we will improve the universal machine learning potential ANI-1ccx-gelu specifically on CH3NO2 molecule, which will be reflected in improved harmonic frequencies compared with experiment
import mlatom as ml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Overview of the dataset¶
We take the datasets from https://doi.org/10.1021/acs.jctc.1c00249. There are 9001 training points with energies and forces at MP2/aug-cc-pVTZ level.
# prepare the dataset
training_data = ml.data.molecular_database.load('CH3NO2_MP2_avtz.json',format='json')
print('Number of training points: ', len(training_data))
Number of training points: 9001
# plot the energy distribution
energies = training_data.get_properties('energy')
plt.hist(energies)
plt.xlabel('Energy (Hartree)')
plt.ylabel('Number of entries')
Text(0, 0.5, 'Number of entries')
Load universal model¶
ani1ccx_gelu = ml.models.methods(method='ANI-1ccx-gelu')
# ani = ml.models.methods(method='ANI-1ccx')
# ani = ml.models.methods(method='ANI-2x')
Fine tune¶
There are several parameters we need to care about when transfer learning (though we have default settings in MLatom, you might want to customize the procedure yourself)
Arguments to pass to train()
¶
The arguments available can be found in manual for API about torchani interface. Here several key arguments are listed:
reset_energy_shifter [list, bool]
: Control how the self atomic energies will be extracted. By default we will use those extracted from the training data. If set to False, we will use those from pretrained models. If list, values inside will be used as the self atomic energies for scaling.file_to_save_model [str]
: The file name to save the retrained models. Defaul "{universal model name}_retrained.pt.cv{model index}"verbose [bool or int]
: Control the information printed out when training. 1 will print out training procedure and 2 will print out metrics from training and validation.hyperparameters [dict]
: Control the training procedure, will explain in detail later.
Parameters in hyperparameters
¶
batch_size
: default 8max_epochs
: default 100 for transfer learningfixed_layers
: default 1 and 3 layer are fixed.
ani1ccx_gelu_tl = ani1ccx_gelu.train(
molecular_database=training_data,
property_to_learn='energy',
xyz_derivative_property_to_learn = 'energy_gradients',
# verbose=2,
hyperparameters={
'max_epochs':1000,
'batch_size':512,
# 'fixed_layers':[[0,4],[0,4],[0,4],[0,4]]
}
)
Start retraining on model 0... Start retraining on model 1... Start retraining on model 2... Start retraining on model 3... Start retraining on model 4... Start retraining on model 5... Start retraining on model 6... Start retraining on model 7...
After transfer learning, the models will be saved in the current directory either named in the combination of the universal model name plus 'retrained' or the user defined name for the model. You can still load them with torchani interface to do the simulations:
# for loading one model
ani_tl_cv0 = ml.models.ani(model_file='ani1ccxgelu_retrained.pt.cv0')
# for loading multiple models as an ensemble
children = [
ml.models.model_tree_node(
name=f'nn{ii}',
model=ml.models.ani(model_file=f'ani1ccxgelu_retrained.pt.cv{ii}'),
operator='predict')
for ii in range(8)]
ani_tl_ensemble = ml.models.model_tree_node(
name='nn',
children=children,
operator='average')
Compare harmonic frequencies¶
def calculate_harmonic_frequency(
calculator=None,
initmol=None,
opt_program='geometric', # do not use gaussian which currently cannot recognize the retrained method
freq_program='pyscf',
):
geomopt = ml.simulations.optimize_geometry(
model=calculator,
initial_molecule=initmol,
program=opt_program)
optmol = geomopt.optimized_molecule
ml.simulations.freq(
model=calculator,
molecule=optmol,
program=freq_program,)
return optmol
# load initial molecule
initmol = ml.data.molecule.from_xyz_file('CH3NO2_init.xyz')
# load universal model
ani1ccx_gelu = ml.models.methods(method='ANI-1ccx-gelu')
# calculate harmonic frequency
freqmol_anigelu = calculate_harmonic_frequency(ani1ccx_gelu, initmol)
freqmol_anigelu_tl = calculate_harmonic_frequency(ani1ccx_gelu_tl, initmol)
# load performance table and check MAE
reference_table = pd.read_csv('CH3NO2_harmonic.csv')
reference_table['ANI-1ccx-gelu'] = freqmol_anigelu.frequencies
reference_table['ANI-1ccx-gelu-TL'] = freqmol_anigelu_tl.frequencies
print('MAE (cm-1) of harmonic frequencies compared to MP2:')
mae = abs(reference_table['ANI-1ccx-gelu'].astype(np.float32)-reference_table['MP2/avtz'].astype(np.float32)).mean()
print(f'ANI-1ccx-gelu: {mae}')
mae = abs(reference_table['ANI-1ccx-gelu-TL'].astype(np.float32)-reference_table['MP2/avtz'].astype(np.float32)).mean()
print(f'ANI-1ccx-gelu-tl: {mae}')
MAE (cm-1) of harmonic frequencies compared to MP2: ANI-1ccx-gelu: 23.052539825439453 ANI-1ccx-gelu-tl: 2.7490692138671875
reference_table
Model | MP2/avtz | exp | ANI-1ccx-gelu | ANI-1ccx-gelu-TL | |
---|---|---|---|---|---|
0 | 1 | 28.91 | - | 39.162458 | 27.883071 |
1 | 2 | 478.65 | 479 | 513.907740 | 481.061675 |
2 | 3 | 610.43 | 599 | 621.854552 | 612.820447 |
3 | 4 | 669.67 | 647 | 672.364872 | 674.836687 |
4 | 5 | 940.48 | 921 | 965.614276 | 941.793370 |
5 | 6 | 1127.28 | 1097 | 1118.814348 | 1125.883438 |
6 | 7 | 1148.99 | 1153 | 1119.440409 | 1148.615487 |
7 | 8 | 1412.12 | 1384 | 1417.415336 | 1410.780877 |
8 | 9 | 1430.54 | 1413 | 1462.593407 | 1431.172850 |
9 | 10 | 1491.90 | 1449 | 1476.140398 | 1490.940371 |
10 | 11 | 1502.67 | 1488 | 1477.055380 | 1496.736090 |
11 | 12 | 1745.72 | 1582 | 1649.087029 | 1757.541987 |
12 | 13 | 3115.24 | 2965 | 3096.572449 | 3114.823545 |
13 | 14 | 3221.29 | 3048 | 3215.946024 | 3215.934658 |
14 | 15 | 3247.61 | 3048 | 3223.968484 | 3246.913792 |
备注
Fine-tuning for the universal models is supported for ANI-type models ANI-1x, ANI-1ccx, ANI-1ccx-gelu, and ANI-2x.
When using this feature, please cite:
Seyedeh Fatemeh Alavi, Yuxinxin Chen, Yi-Fan Hou, Fuchun Ge, Peikun Zheng, Pavlo O. Dral. Towards Accurate and Efficient Anharmonic Vibrational Frequencies with the Universal Interatomic Potential ANI-1ccx-gelu and Its Fine-Tuning. 2024. Preprint on ChemRxiv: https://doi.org/10.26434/chemrxiv-2024-c8s16 (2024-10-09).
AIQM1
AIQM1(人工智能-量子力学方法1)是一种通用方法,对于基态闭壳层物种,其计算精度接近于黄金标准的耦合簇量子力学方法,同时计算速度很快,与低水平半经验量子力学方法相近。此外,AIQM1也适用于带电物种、自由基物种以及激发态的计算。详情请参阅 AIQM1论文 。请将本文与其他必要的 引文 一起引用:
Peikun Zheng, Roman Zubatyuk, Wei Wu, Olexandr Isayev, Pavlo O. Dral. Artificial Intelligence-Enhanced Quantum Chemical Method with Broad Applicability. Nat. Commun. 2021, 12, 7022, DOI: 10.1038/s41467-021-27340-2.
优势: 对于闭壳层分子,AIQM1可用于精确、快速的能量计算和几何优化。
限制: 此方法目前仅可用于计算包含H, C, N, O元素的化合物。
可查看 详细教程
DM21
DM21是DeepMind在Science上发表的一种ML增强的DFT方法(使用该方法时请引用 此文 )。可按照 GitHub页面 安装。DM21有四种变体(DM21 - default, DM21m, DM21mc, DM21mu),详细信息请参见GitHub页面。
使用DM21及其变体类似于使用常见的DFT泛函。用户需要指定要使用的泛函和基组。值得注意的是,DM21并不稳定,不一定会收敛。预测的时间比之前的方法要长,因为默认情况下,在MLatom中它将从相对便宜的B3LYP泛函开始,以使SCF更快。在目前的实现中,它只能用于单点计算(接口程序不提供梯度或Hessian,我们还没有实现该方法的数值导数)。
输入文件示例:
method=DM21/6-31G*
qmprog=pyscf
xyzfile='
2
H 0.000000 0.000000 0.363008
H 0.000000 0.000000 -0.363008
5
C 0.000000 0.000000 0.000000
H 0.627580 0.627580 0.627580
H -0.627580 -0.627580 0.627580
H 0.627580 -0.627580 -0.627580
H -0.627580 0.627580 -0.627580
'
yestfile=energy.dat
在Pyhon中:
import mlatom as ml
# read molecule from .xyz file
molDB = ml.data.molecular_database.from_xyz_file('sp.xyz')
# define method
method = mlatom.models.methods(method='DM21/6-31G*', program='pyscf')
method.predict(
molecular_database=molDB,
calculate_energy=True,
calculate_energy_gradients=True,
calculate_hessian=True)
print(f'Energy in Hartree for molecule 0: {molDB[0].energy}')
print(f'Gradients in Hartree/Angstrom for molecule 1: {molDB[1].get_energy_gradients()}')
print(f'Hessian in Hartree/Angstrom^-2 for molecule 1: {molDB[1].hessian}'')
ANI模型库
MLatom包含 TorchANI 的 ANI模型库 中的3个公共模型: ANI-1x, ANI-1ccx 和 ANI-2x 。此外,MLatom还支持使用 D4色散校正方法ANI-1x-D4和ANI-2x-D4 。下面我们提供了在MLatom中使用这些方法时的注意事项。
ANI-1x和ANI-2x在DFT水平上进行训练
ANI-1ccx具有指向CCSD(T)*/CBS的最高精度
ANI-1ccx和ANI-1x仅限于C H N O元素,而ANI-2x可用于C H N O F Cl S元素
在ANI-1x-D4和ANI-2x-D4中的D4色散校正对应于ωB97X泛函
这些方法仅限于预测基态中性闭壳层化合物的能量和力
MLatom将根据神经网络(NN)预测之间的标准差报告这些方法计算的不确定性
输入文件示例:
ANI-1ccx
geomopt
xyzfile='
2
H 0.000000 0.000000 0.363008
H 0.000000 0.000000 -0.363008
5
C 0.000000 0.000000 0.000000
H 0.627580 0.627580 0.627580
H -0.627580 -0.627580 0.627580
H 0.627580 -0.627580 -0.627580
H -0.627580 0.627580 -0.627580
'
在Pyhon中:
import mlatom as ml
# read molecule from .xyz file
molDB = ml.data.molecular_database.from_xyz_file('sp.xyz')
# define method
method = mlatom.models.methods(method='ANI-1ccx')
method.predict(
molecular_database=molDB,
calculate_energy=True,
calculate_energy_gradients=True,
calculate_hessian=True)
print(f'Energy in Hartree for molecule 0: {molDB[0].energy}')
print(f'Gradients in Hartree/Angstrom for molecule 1: {molDB[1].get_energy_gradients()}')
print(f'Hessian in Hartree/Angstrom^-2 for molecule 1: {molDB[1].hessian})
活性ANI:ANI-1xnr
ANI-1xnr是一种通用的反应性ANI型神经网络,在凝聚态反应数据上被训练,能够处理含有C、H、N、O元素的真实反应体系,参见 Nature Chemistry出版物 。该实现从 ani-1xnr GitHub存储库 连接到模型。
备注
任何模型第一次被实例化时,这些模型将自动从ani-model-zoo存储库下载到本地文件夹 ./local
。用户可以选择事先下载。
输入类似于 其他ANI模型
AIMNet2
AIMNet2 旨在解决ANI处理非局域相互作用和开壳层带电种能力较差的问题。有两个针对B97-3c和ωB97M-D3精度的预训练模型(用户需要使用关键词 aimnet2@b973c
或 aimnet2@wb97m-d3
)。适用于H、B、C、N、O、F、Si、P、S、Cl、As、Se、Br、I等14种元素。目前MLatom不支持Hessian。
备注
任何模型第一次被实例化时,模型将自动从 AIMNet2 GitHub存储库 下载到本地文件夹 ./local
。用户可以选择事先下载。
输入文件示例:
AIMNet2@wb97m-d3
geomopt
xyzfile='
2
H 0.000000 0.000000 0.363008
H 0.000000 0.000000 -0.363008
5
C 0.000000 0.000000 0.000000
H 0.627580 0.627580 0.627580
H -0.627580 -0.627580 0.627580
H 0.627580 -0.627580 -0.627580
H -0.627580 0.627580 -0.627580
'
在Pyhon中:
import mlatom as ml
# read molecule from .xyz file
molDB = ml.data.molecular_database.from_xyz_file('sp.xyz')
# define method
method = mlatom.models.methods(method='AIMNet2@wb97m-d3')
method.predict(
molecular_database=molDB,
calculate_energy=True,
calculate_energy_gradients=True,
calculate_hessian=True)
print(f'Energy in Hartree for molecule 0: {molDB[0].energy}')
print(f'Gradients in Hartree/Angstrom for molecule 1: {molDB[1].get_energy_gradients()}')
print(f'Hessian in Hartree/Angstrom^-2 for molecule 1: {molDB[1].hessian})