Dataset preparation¶
We only support HDF5 format datasets for training (compatible with HDF5 used in MACE, see HDF5 Dataset for details). You should either use datasets from MACE or convert your own dataset to this format.
We provided a dataset conversion toolkit for this purpose. We recommend to install it in
a different environment than dipm to avoid conflicts. We provided a command-line
interface dipm-cvt-cli for user-friendly usage.
To convert a dataset, run the following command (requires dipm_cvt to be installed):
dipm-cvt-cli -d <input_path> <output_path>
Options¶
Input path can any of the following:
A single local dataset file.
A local directory containing multiple dataset files.
A compressed local dataset file (e.g.
*.tar.gz,*.zip).A URL to a dataset file (e.g.
https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-300.tar.gz)A huggingface hub dataset identifier (e.g.
hf://colabfit/rMD17).An OpenQDC dataset identifier (e.g.
openqdc://SpiceV2).
Output path must be a directory except single file input.
Supported dataset formats:
LMDB (PyG format in extension
.lmdb, see OC20 and OC22)LMDB (Compressed JSON format in extension
.lmdbase, see OMOL25)ExtXYZ (Extended XYZ in extension
.xyzor.extxyz, see ASE extxyz format)
Available options:
Options |
Description |
|---|---|
|
Path to the reference energies file for OC20 dataset. See OC22 website for details. |
|
Whether to split every file in the dataset into multiple smaller
files and save them in the subdirectory with the same name as the
original file. Cannot be used with |
|
Whether to merge multiple dataset files into one, cannot be used
with |
|
Approximate size (in MB) of each resulting file. When |
|
The original downloaded dataset file will be removed after
conversion. If you want to keep it, specify |
|
Energy unit of the original dataset. Will be converted to |
|
Distance unit of the original dataset. Will be converted to
|
Examples¶
To convert local datasets, split into 512 MB files and convert energy unit from kcal/mol to eV:
dipm-cvt-cli -d /home/user/datasets /home/user/converted_datasets --split_file --split_size 512 --energy_unit kcal/mol
To use a OpenQDC dataset and cache the downloaded OpenQDC files:
dipm-cvt-cli -d openqdc://SpiceV2 /home/user/converted_datasets --download_dir /home/user/download_dir
To use a url to a compressed dataset file:
dipm-cvt-cli -d https://dl.fbaipublicfiles.com/opencatalystproject/data/omat/241018/omat/train/rattled-300.tar.gz /home/user/converted_datasets