Kernel ridge regression in Julia

Here we show how to use kernel ridge regression (KRR) in MLatom to train and use your model. This function is based on our implementation in Julia.

Prerequisites

MLatom 3.18.0 or later
Julia 1.10.2 (Julia v1.4+ should work)
julia 0.6.2 (this is a Python package)

Apparently, you should have Julia installed in your local environment. You can download it easily from its website and install it following the instructions.

After you successfully dowanload and configure Julia, you can enter julia in the command line and open the interactive session.

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia>

Note

As you see, we are using Julia 1.10.2. But to make PyJulia work, you should have at least Julia v1.4+.

If you have no knowledge about Julia, it’s totally OK, because we call Julia functions from Python. Before that, we need to install such a bridge in Python:

pip install julia==0.6.2

Then in Python, run the following codes:

import julia
julia.install()

But things are not done yet. The Python interpreter is statically linked to libpython, but PyJulia does not fully support such Python interpreter. We recommend using python-jl to run your script:

python-jl your_script.py

If it works well, you will be able to run the following codes:

from julia import Base
print(Base.sind(90))

Data and models

Here, you can download dataset and models that will be used in the following tutorials.

ML database

We introduce a new data class called ml_database You can save x and y values in it. The ml.models.krr will use ml_database.x and ml_database.y to train the model.

Here are the examples on how to deal with this class.

import mlatom as ml

mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat") # This will be saved in mlDB.x
mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci")
mlDB.read_from_y_file(filename="E_UHF_451.dat",property_name="uhf")
mlDB.y = mlDB.get_property("fci")

Train a KRR model

After loading the ML database, we can train a model.

model = ml.models.krr(model_file="E_FCI.npz",kernel_function='Gaussian',ml_program='mlatom.jl') # The model will be saved in E_FCI.npz. Gaussian function is used as the kernel function.
model.hyperparameters['sigma'].value = 4.0
model.hyperparameters['lambda'].value = 0.00001
model.train(ml_database=mlDB,prior=-1.0) # Set the prior value. The prior will be subtracted from each y value before training (and added back after prediction).

# Predict values with the model
model.predict(ml_database=mlDB,property_to_predict='fci_yest')

Now we support several kernel functions:

Gaussian kernel function: \(k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-\lvert \textbf{x}_1-\textbf{x}_2 \rvert ^{2} / (2\sigma ^{2}))\)
Periodic Gaussian kernel function: \(k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-2.0(\text{sin} (\pi \lvert \textbf{x}_1-\textbf{x}_2 \rvert))^2 / \sigma ^2)\)
Decaying periodic Gaussian kernel function: \(k(\textbf{x}_1,\textbf{x}_2) = \text{exp} (-\lvert \textbf{x}_1-\textbf{x}_2 \rvert ^{2} / (2\sigma ^{2}) - 2.0(\text{sin} (\pi \lvert \textbf{x}_1-\textbf{x}_2 \rvert))^2 / \sigma_{p} ^2)\)
Matern kernel function: \(k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-\lvert \textbf{x}_1-\textbf{x}_2 \rvert / \sigma) \sum \limits _{k=0} ^{n} \frac {(n+k)!} {(2n)!} \begin{pmatrix} n \\ k \end{pmatrix} (2\lvert \textbf{x}_1-\textbf{x}_2 \rvert / \sigma)^{n-k}\)

Hyperparameters of these kernel functions:

kernel function	how to call in MLatom	hyperparameters
Gaussian	`'Gaussian'`	`sigma`
Periodic Gaussian	`'periodic_Gaussian'`	`sigma`, `period`
Decaying periodic Gaussian	`'decaying_periodic_Gaussian'`	`sigma`, `period`, `sigmap`
Matern	`'Matern'`	`sigma`, `nn`

Default values of hyperparameters:

hyperparameters	default value	description
`sigma`	`1.0`	sigma
`period`	`2E-35`	period
`sigmap`	`100.0`	sigmap
`nn`	`1.0`	nn
`lambda`	`2`	regularization parameter of KRR

Load a KRR model

If the model file exists, it will be loaded automatically during the initialization of the instance.

import mlatom as ml

mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat")
model = ml.models.krr(model_file='E_FCI.npz',ml_program='mlatom.jl')
model.predict(ml_database=mlDB,property_to_predict='fci_yest')

Optimize hyperparameters

Below, we show how to optimize hyperparameters using grid search.

import mlatom as ml
mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat") # This will be saved in mlDB.x
mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci")
mlDB.y = mlDB.get_property("fci")

# Optimize hyperparameters
model = ml.models.krr(model_file="E_FCI_grid_search.npz",kernel_function='Gaussian',ml_program='mlatom.jl')
model.hyperparameters['sigma'].minval = 2**-5              # Set the minimum value of sigma
model.hyperparameters['sigma'].maxval = 2**9               # Set the maximum value of sigma
model.hyperparameters['sigma'].optimization_space = 'log'
model.hyperparameters['lambda'].minval = 2**-35            # Set the minimum value of lambda
model.hyperparameters['lambda'].maxval = 1.0               # Set the maximum value of lambda
model.hyperparameters['lambda'].optimization_space = 'log'

sub,val = mlDB.split(number_of_splits=2, fraction_of_points_in_splits=[0.9, 0.1], sampling='none')
model.optimize_hyperparameters(subtraining_ml_database=sub,validation_ml_database=val,
                            optimization_algorithm='grid',
                            hyperparameters=['lambda','sigma'],
                            training_kwargs={'prior':-1.0},
                            prediction_kwargs={'property_to_predict':'yest'},debug=False)

model.predict(ml_database=mlDB,property_to_predict='fci_yest')

Multi-outputs

The model can be trained on multiple y values at the same time. Below we show how to train a model on both FCI and UHF energies.

import mlatom as ml

mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat")
mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci")
mlDB.read_from_y_file(filename="E_UHF_451.dat",property_name="uhf")
Ntrain = len(mlDB.y)
mlDB.y = np.concatenate((mlDB.get_property('fci').reshape((Ntrain,1)),mlDB.get_property('uhf').reshape((Ntrain,1))),axis=1)

model = ml.models.krr(model_file="E_FCI_multi_output.npz",kernel_function='Gaussian',ml_program='mlatom.jl')

# Optimize hyperparameters
model.hyperparameters['sigma'].minval = 2**-5              # Set the minimum value of sigma
model.hyperparameters['sigma'].maxval = 2**9               # Set the maximum value of sigma
model.hyperparameters['sigma'].optimization_space = 'log'
model.hyperparameters['lambda'].minval = 2**-35            # Set the minimum value of lambda
model.hyperparameters['lambda'].maxval = 1.0               # Set the maximum value of lambda
model.hyperparameters['lambda'].optimization_space = 'log'

sub,val = mlDB.split(number_of_splits=2, fraction_of_points_in_splits=[0.9, 0.1], sampling='none')
model.optimize_hyperparameters(subtraining_ml_database=sub,validation_ml_database=val,
                            optimization_algorithm='grid',
                            hyperparameters=['lambda','sigma'],
                            training_kwargs={'prior':-1.0},
                            prediction_kwargs={'property_to_predict':'yest'},debug=False)

model.predict(ml_database=mlDB,property_to_predict='fci_uhf_yest')