.. _lecture2: ===================================================================== Universal machine learning models ===================================================================== - :ref:`Slides ` - :ref:`What are the universal ML models? ` - :ref:`Frequencies and thermochemistry ` - :ref:`Uncertainty quantification ` .. _ml-uni-slides: Slides ------ :download:`Slides <_static/mlatom/4-XACSW2024_20240704_Dral_wm.pdf>`: .. raw:: html .. _lecture2_universal: What are the universal ML models? --------------------------------- .. raw:: html Before we saw that ML models can be very fast and accurate at the same time. However, we needed to use (and train) the model specifically for the H\ :sub:`2` example. This is not ideal and it would be nice to have ML models that can be applied to different systems of interest without needing specific training. Such models do exist and they are referred in literature as *universal* ML models. Example of the universal ML-based model is the `UAIQM `__ and its earliest incarnation `AIQM1 `__: .. image:: _static/lecture2/AIQM1.png :width: 400 :align: center :alt: AIQM1 method [Figure credit: reproduced from Pavlo O. Dral. `AI in computational chemistry through the lens of a decade-long journey `__. *Chem. Commun.* **2024**, *60*, 3240--3258. DOI: 10.1039/D4CC00010B under the CC-BY license.] AIQM1 is approaching the accuracy of CCSD(T)/CBS method and it is an ML-improved semi-empirical QM method, i.e., it is slower than a pure ML model. An example of a pure ML model is ANI-1ccx which also targets the accuracy of the CCSD(T)/CBS method (and, in fact, was `published `__ earlier than AIQM1). If both AIQM1 and ANI-1ccx methods are approaching the same CCSD(T)/CBS, then why bother using slower AIQM1? Let's run an experiment: .. include:: materials/lecture2/task2.1_h2_h2o_aiqm1_ani1ccx_opt/task2.1_h2_h2o_aiqm1_ani1ccx_opt.inc After you run these simulations, you should have obtained the following result (compared to other approaches): .. image:: _static/lecture2/h2_h2o_aiqm1_ani1ccx.png :width: 800 :align: center :alt: Hydrogen and water molecules optimized with AIQM1 and ANI-1ccx One thing that you have noticed is that calculations with both methods are indeed very fast, much faster than with CCSD(T) methods you tried in your homework or task in Lecture 1. The optimized geometry of water molecule is also very close to experimental structure, much closer than with a slower B3LYP/6-31G* method. ANI-1ccx is slightly worse but still provides very good agreement with experiment for H\ :sub:`2`\ O. .. note:: AIQM1 requires the semi-empirical program. On `MLatom\@XACS cloud computing service `__ we cannot use `MNDO `__ which supports analytical gradients and excited-state simulations but use the `development version `__ of `SCINE Sparrow `__ (see also a `paper `__ on Sparrow). This version does not support analytical gradients and no excited-state calculations with AIQM1. Hence, Sparrow can be used with numerical gradients for geometry optimizations and MD but might lead to numerical instabilities in, e.g., frequency calculations. There is no such problem with UAIQM. .. include:: outdating.rst However, you might be surprised how bad ANI-1ccx is for H\ :sub:`2`. AIQM1 is also far from CCSD(T) quality. What happened? AIQM1 and ANI-1ccx are both approaching CCSD(T)/CBS but AIQM1 ≠ ANI-1ccx ≠ CCSD(T)/CBS! They are ML-based models and trained on data which might or might not contain enough training points for H\ :sub:`2`. How the model was trained is also important. In general, you observe that AIQM1 is more robust than ANI-1ccx for the price of being slower. We will explore why AIQM1 is better in the future lectures. The important lesson to learn is that any ML model might have limitations (like QM models too). Hence, always pay attention and use your expertise to check the results, i.e., by following recommendations: - compare with other high-level QM calculations (which can be unavailable or prohibitively expensive), - compare to experiment (may be not reliable either!), - check uncertainty (easiest and often works, see the second part or this lecture). Those were just few molecules we checked. Overall, the quality of AIQM1 is excellent as evidenced by extensive benchmarks on many different molecules which show that AIQM1 optimized geometries are close to experiment and definetely better than many popular DFT methods which might even be qualitatively wrong (see example below for C\ :sub:`18` carbon): .. image:: _static/lecture2/geoms_aiqm1.png :width: 800 :align: center :alt: The quality of AIQM1-optimized geometries [Figure credit: reproduced from Pavlo O. Dral. `AI in computational chemistry through the lens of a decade-long journey `__. *Chem. Commun.* **2024**, *60*, 3240--3258. DOI: 10.1039/D4CC00010B under the CC-BY license.] Even more, AIQM1 was used to revise experimental geometries obtained via X-ray diffraction (see figure above). You can safely use AIQM1 for quite large systems, where it is faster and more accurate than DFT, i.e., no reason to spend more time with DFT to get worse result! .. image:: _static/lecture2/nanolassos_aiqm1.png :width: 600 :align: center :alt: AIQM1 is superior to DFT for geometry optimizations [Figure credit: reproduced from Pavlo O. Dral, *et al*. `MLatom 3: A Platform for Machine Learning-enhanced Computational Chemistry Simulations and Workflows `_, *J. Chem. Theory Comput.* **2024**, *20*, 1193--1213. DOI: 10.1021/acs.jctc.3c01203 under CC-BY 4.0 license.] You should be, however, aware of the current limitations of the universal models in general. See `up-to-date list `__. The models available in MLatom on July 3, 2024, are overviewed below (partly based on exerpt from the `MLatom 3 paper `__): * Our best offering: the `UAIQM models `__. * Universal potentials ANI-1ccx, ANI-1x, ANI-2x, ANI-1x-D4, ANI-2x-D4, AIMnet-2, ANI-1xnr. * ANI-1ccx is the most accurate and approaches gold-standard CCSD(T) accuracy. Other methods approach the density functional theory (DFT) level. * ANI-1ccx and ANI-1x are limited to CHNO elements, while ANI-2x can be used for CHNOFClS elements. * We allow the user to use D4 dispersion-corrected universal ANI potentials that might be useful for noncovalent complexes. * Hybrid QM/ML methods AIQM1, AIQM1\@DFT, AIQM1\@DFT*, and DM21 are more transferable and accurate than pretrained ML models but slower (the speed of semiempirical QM methods, which are still much faster than DFT). * AIQM1 is approaching gold-standard CCSD(T) accuracy, while AIQM1\@DFT and AIQM1\@DFT* target the DFT accuracy for neutral, closed-shell molecules in their ground state. All these methods are limited to the CHNO elements. AIQM1 and AIQM1\@DFT include explicit D4 dispersion corrections for the ωB97X functional, while AIQM1\@DFT* does not. * These methods can also be used to calculate charged species, radicals, excited states, and other QM properties such as dipole moments, charges, oscillator strengths, and nonadiabatic couplings. .. _lecture2_thermo: Frequencies and thermochemistry ------------------------------- After geometry optimization, the next step is to do frequency calculations (typically within rigid-rotor harmonic oscillator approximation) which achieves several goals: 1) obtain zero-point vibrational energy (molecules are always in motion and this should be taken into account!), 2) other thermochemical properties (enthalpies, Gibbs free energies, heats of formation, etc.), 3) perform normal mode analysis to check the nature of vibrations and their frequencies. These calculations also allow you to check whether your optimized geometries correspond to a true minima on potential energy surface (PES) - the minima have all real frequencies while you will see imaginary (shown as negative) frequencies for non-minima. In principle, AIQM1 and ANI-1ccx are able to deliver very `accurate thermochemical properties `__, particularly heats of formation with close-to chemical accuracy (errors not much larger than 1 kcal/mol). See it for yourself: .. include:: materials/lecture2/task2.2_freq/task2.2_freq.inc .. note:: Frequency calculations require the evaluation of the second-order energy derivatives. Many methods do not have analytical derivatives, while numerical derivatives are often noisy and may lead to high inaccuracies and lots of negative frequencies which are artificial. This is the case, e.g., if AIQM1 is used on the cloud where the default QM program is Sparrow which has no analytical derivatives for AIQM1. Check in the MLatom program output what program is used - Sparrow (not reliable yet) or MNDO (can be trusted). Calculations with the semiempirical MNDO program are possible but the reader has to install it locally as `described in MLatom's documentation `__. .. include:: outdating.rst Creation of the initial XYZ geometry of vinylacetylene takes time. We do provide you with the initial guess, but what is a faster way to get XYZ geometry for a new molecule? I have asked ChatGPT in 2023 to generate XYZ coordinates or an input file for Gaussian program and it generated some gibberish: .. image:: _static/lecture2/chatgpt.png :width: 600 :align: center :alt: Vineylacetylene input file for the Gaussian program suggested by ChatGPT. That might improve with time though. One more trick is to look up the coordinates in the internet as many molecules were already either calculated or have known experimental structures. Indeed, for vinylacetylene we can find `XYZ structure `__ optimized at a quite decent CCSD(T)/6-31G* level of theory and made available on a very useful `NIST website `__ (check it out -- it is a treasure trove for any chemist!): .. image:: _static/lecture2/nist.png :width: 500 :align: center :alt: Vineylacetylene geometry from the NIST website. Going back to our exercise, you can find that for water molecule, it has no negative frequencies, i.e., it is a true minimum. Its heat of formation is predicted to be -58.46 kcal/mol at AIQM1 which is quite close to the experimental value of -57.8 kcal/mol (in gas phase! [J.B.Pedley, R.D.Naylor, and S.P.Kirby, "Thermochemical Data of Organic Compounds", 2nd ed., Chapman and Hall, London, 1986]; the error is only 0.7 kcal/mol). And there is no warning about uncertainty, see below. The output file looks like this: .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 Rotational symmetry number: 1 This is a nonlinear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 1734.1034 1.0833 1.9194 2 3809.1282 1.0819 9.2489 3 3829.5496 1.0449 9.0285 ============================================================================== Thermochemistry for molecule 1 ============================================================================== Standard deviation of NNs : 0.00028358 Hartree 0.17795 kcal/mol Energy : -76.38362520 Hartree ZPE-exclusive internal energy at 0 K: -76.38363 Hartree Zero-point vibrational energy : 0.02135 Hartree Internal energy at 0 K: -76.36227 Hartree Enthalpy at 298 K: -76.35849 Hartree Gibbs free energy at 298 K: -76.38056 Hartree Atomization enthalpy at 0 K: 0.35083 Hartree 220.14828 kcal/mol ZPE-exclusive atomization energy at 0 K: 0.37218 Hartree 233.54734 kcal/mol Heat of formation at 298 K: -0.09316 Hartree -58.45749 kcal/mol ============================================================================== Just for your reference, it is -58.33 kcal/mol at AIQM1 (the error of only 0.5 kcal/mol), see the output file: .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 This is a nonlinear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 1654.6962 1.0829 1.7470 2 3850.4309 1.0449 9.1277 3 3931.4391 1.0817 9.8508 ============================================================================== Thermochemistry for molecule 1 ============================================================================== Standard deviation of NN contribution : 0.00009088 Hartree 0.05703 kcal/mol NN contribution : -0.00110508 Hartree Sum of atomic self energies : -63.04934947 Hartree ODM2* contribution : -13.33326549 Hartree D4 contribution : -0.00002289 Hartree Total energy : -76.38374293 Hartree ZPE-exclusive internal energy at 0 K: -76.38374 Hartree Zero-point vibrational energy : 0.02150 Hartree Internal energy at 0 K: -76.36225 Hartree Enthalpy at 298 K: -76.35847 Hartree Gibbs free energy at 298 K: -76.37988 Hartree Atomization enthalpy at 0 K: 0.35062 Hartree 220.01876 kcal/mol ZPE-exclusive atomization energy at 0 K: 0.37212 Hartree 233.50896 kcal/mol Heat of formation at 298 K: -0.09295 Hartree -58.32770 kcal/mol ============================================================================== Wall-clock time: 4.54 s (0.08 min, 0.00 hours) MLatom terminated on 25.04.2024 at 11:15:15 ============================================================================== .. _lecture2_uq: Uncertainty quantification -------------------------- For hydrogen molecule, we see that it is also a true minimum, its heat of formation is -23.09 kcal/mol. And as a chemist, you must know what the experimental value is and realize that the error is too large. However, you will notice that the output file contains a warning message ``* Warning * Heat of formation have high uncertainty!``: .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 Rotational symmetry number: 1 This is a linear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 2866.6581 1.0080 4.8805 ============================================================================== Thermochemistry for molecule 1 ============================================================================== Standard deviation of NNs : 0.00705607 Hartree 4.42775 kcal/mol Energy : -1.20973475 Hartree ZPE-exclusive internal energy at 0 K: -1.20973 Hartree Zero-point vibrational energy : 0.00653 Hartree Internal energy at 0 K: -1.20320 Hartree Enthalpy at 298 K: -1.19990 Hartree Gibbs free energy at 298 K: -1.21434 Hartree Atomization enthalpy at 0 K: 0.20144 Hartree 126.40693 kcal/mol ZPE-exclusive atomization energy at 0 K: 0.20797 Hartree 130.50502 kcal/mol Heat of formation at 298 K: -0.03680 Hartree -23.09143 kcal/mol * Warning * Heat of formation have high uncertainty! ============================================================================== For your reference, the heat of formation at AIQM1 is not very accurate either: it is -2.86 kcal/mol and uncertain too: .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 This is a linear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 3500.7003 1.0078 7.2769 ============================================================================== Thermochemistry for molecule 1 ============================================================================== Standard deviation of NN contribution : 0.00891919 Hartree 5.59688 kcal/mol NN contribution : -0.00210675 Hartree Sum of atomic self energies : -0.08587317 Hartree ODM2* contribution : -1.09094346 Hartree D4 contribution : -0.00000889 Hartree Total energy : -1.17893227 Hartree ZPE-exclusive internal energy at 0 K: -1.17893 Hartree Zero-point vibrational energy : 0.00797 Hartree Internal energy at 0 K: -1.17096 Hartree Enthalpy at 298 K: -1.16765 Hartree Gibbs free energy at 298 K: -1.18240 Hartree Atomization enthalpy at 0 K: 0.16920 Hartree 106.17224 kcal/mol ZPE-exclusive atomization energy at 0 K: 0.17717 Hartree 111.17663 kcal/mol Heat of formation at 298 K: -0.00455 Hartree -2.85652 kcal/mol * Warning * Heat of formation have high uncertainty! ============================================================================== Wall-clock time: 4.49 s (0.07 min, 0.00 hours) MLatom terminated on 25.04.2024 at 11:26:42 ============================================================================== Indeed, the calculations have high uncertainty as judged by the standard deviation of neural network (NN) contribution, it is quite large: 4.43 and 5.60 kcal/mol at ANI-1ccx and AIQM1, respectively, compared to only 0.18 and 0.06 kcal/mol for a water molecule. AIQM1 and ANI-1ccx both contain many (eight) NNs whose predictions are averaged. These NNs learned the same property and, hence, if they start to deviate from each other it means they are likely to stray away too far from the region they were trained on. The deviation between NNs is often used in ML to judge how bad the predictions are (e.g., used in query by committee and :ref:`active learning `, see next lectures). The question is how large deviation is still acceptable and hence `we calibrated them `__ on a standard data set and MLatom flags when deviation exceeds the calibrated thresholds. This is consistent with our previous observation that ANI-1ccx and AIQM1 are not particularly good for the H\ :sub:`2` molecule, i.e., the optimized bond lengths were quite off from the experimental value. In case of vinylacetylene, we observe that it is also true minimum, its heat of formation is 68.53 kcal/mol, and it is not uncertain: .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 Rotational symmetry number: 1 This is a nonlinear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 234.3120 3.5401 0.1145 2 305.3683 2.6251 0.1442 3 542.0444 1.4458 0.2503 4 546.4789 1.2326 0.2169 5 559.3526 2.3867 0.4400 6 588.7133 1.5581 0.3182 7 900.8130 1.3032 0.6230 8 966.6489 2.6463 1.4569 9 991.1206 1.2769 0.7390 10 1163.7123 1.4019 1.1185 11 1332.3004 1.4117 1.4764 12 1499.1878 1.2277 1.6258 13 1750.8294 4.0546 7.3229 14 2206.4408 5.7716 16.5550 15 3205.7596 1.0906 6.6035 16 3262.3771 1.0654 6.6806 17 3342.5988 1.1208 7.3781 18 3388.8071 1.1735 7.9404 ============================================================================== Thermochemistry for molecule 1 ============================================================================== Standard deviation of NNs : 0.00057841 Hartree 0.36296 kcal/mol Energy : -154.53147078 Hartree ZPE-exclusive internal energy at 0 K: -154.53147 Hartree Zero-point vibrational energy : 0.06102 Hartree Internal energy at 0 K: -154.47045 Hartree Enthalpy at 298 K: -154.46484 Hartree Gibbs free energy at 298 K: -154.49658 Hartree Atomization enthalpy at 0 K: 1.29896 Hartree 815.10992 kcal/mol ZPE-exclusive atomization energy at 0 K: 1.35999 Hartree 853.40367 kcal/mol Heat of formation at 298 K: 0.10920 Hartree 68.52699 kcal/mol ============================================================================== At AIQM1, the heat of formation is 69.04 kcal/mol and is also confident: .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 This is a nonlinear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 203.2675 3.0894 0.0752 2 293.6230 2.4319 0.1235 3 537.1646 3.5850 0.6095 4 671.2905 1.5015 0.3987 5 686.5656 1.2252 0.3403 6 720.9366 1.4999 0.4593 7 879.5200 2.4730 1.1271 8 931.0665 1.3842 0.7070 9 996.7269 1.0846 0.6348 10 1099.9357 1.6593 1.1828 11 1313.2985 1.3102 1.3315 12 1431.8700 1.2556 1.5167 13 1645.4021 3.8620 6.1603 14 2162.6121 6.3527 17.5052 15 3135.1808 1.0791 6.2496 16 3174.0848 1.0667 6.3318 17 3260.7611 1.1187 7.0083 18 3456.1498 1.1511 8.1013 ============================================================================== Thermochemistry for molecule 1 ============================================================================== Standard deviation of NN contribution : 0.00008761 Hartree 0.05498 kcal/mol NN contribution : 0.00715111 Hartree Sum of atomic self energies : -133.90358074 Hartree ODM2* contribution : -20.63402319 Hartree D4 contribution : -0.00056531 Hartree Total energy : -154.53101813 Hartree ZPE-exclusive internal energy at 0 K: -154.53102 Hartree Zero-point vibrational energy : 0.06060 Hartree Internal energy at 0 K: -154.47042 Hartree Enthalpy at 298 K: -154.46492 Hartree Gibbs free energy at 298 K: -154.49656 Hartree Atomization enthalpy at 0 K: 1.29803 Hartree 814.52617 kcal/mol ZPE-exclusive atomization energy at 0 K: 1.35863 Hartree 852.55199 kcal/mol Heat of formation at 298 K: 0.11002 Hartree 69.04119 kcal/mol ============================================================================== Wall-clock time: 6.03 s (0.10 min, 0.00 hours) MLatom terminated on 25.04.2024 at 11:31:47 ============================================================================== Nevertheless, the experimental value used in one of the databases for fitting semi-empirical methods, was 73 kcal/mol. This value deviates from both ANI-1ccx and AIQM1's predictions by 4 kcal/mol which is too much for confident calculations. Hence, we double checked (and you can too) by calculating the value with another QM method (G4 as implemented in the Gaussian program) and it also gave heat of formation of 69.1 kcal/mol, i.e., basically the same as AIQM1. Hence, I suspected that something might be wrong with that experimental value and, indeed, I found in the NIST database a value of 70.4 kcal/mol which is much closer to the ANI-1ccx and AIQM1 predictions. This is another example (`among many similar `__) where accurate AIQM1 calculations helped to revise the experimental values. Now let's see what we can get if we use DFT. .. include:: materials/lecture2/task2.3_b3lyp_hof/task2.3_b3lyp_hof.inc As you can see in the output file, the heat of formation at B3LYP is -42.30 kcal/mol which is very wrong compared to the experimental -57.8 kcal/mol! .. code-block:: ============================================================================== Vibration analysis for molecule 1 ============================================================================== Multiplicity: 1 Rotational symmetry number: 1 This is a nonlinear molecule Mode Frequencies Reduced masses Force Constants (cm^-1) (AMU) (mDyne/A) 1 1710.8681 1.0830 1.8677 2 3720.9628 1.0452 8.5264 3 3843.9202 1.0813 9.4132 ============================================================================== Thermochemistry for molecule 1 ============================================================================== ZPE-exclusive internal energy at 0 K: -76.36989 Hartree Zero-point vibrational energy : 0.02113 Hartree Internal energy at 0 K: -76.34876 Hartree Enthalpy at 298 K: -76.34498 Hartree Gibbs free energy at 298 K: -76.36708 Hartree Atomization enthalpy at 0 K: 0.32508 Hartree 203.98967 kcal/mol ZPE-exclusive atomization energy at 0 K: 0.34621 Hartree 217.25003 kcal/mol Heat of formation at 298 K: -0.06741 Hartree -42.29876 kcal/mol ============================================================================== Wall-clock time: 3.43 s (0.06 min, 0.00 hours) It is indeed well known that DFT methods struggle with getting heats of formation properly. We will discuss this issue in the :ref:`next lecture `. Another important observation is that DFT (or any other QM method) *never provides you any uncertainty of its prediction*! This is a big advantage of ML methods that they do provide uncertainty quantification. That's why when many people complain that they are not sure whether ML is good, they forget that they can never be sure whether their DFT method is good. In ML we at least get an uncertainty of predictions and can improve the ML model later with more data. .. _lecture2_homework: Qualitative problems with PES shape ----------------------------------- One of the problems with ML and universal ML is that they often do not describe correctly the dissociation curves of molecules. See it for yourself: .. include:: materials/lecture2/homework2.1/homework2.1.inc