HDF5 Dataset

This class reads data from a MACE compatible HDF5 format file organized in the following way. The data must be grouped into batches containing the same number of data points, with each batch name being config_batch_{idx}. It’s fine to put all data into a single batch group. Every data point must defined as a group named config_{idx}. Below, we provide an example of how to read the data from such a compliant HDF5 file to demonstrate how the data is organized:

def unpack_value(value):
    '''If the value is a string of "None", return None, otherwise return the value.'''
    if isinstance(value, bytes) and value == b"None":
        return None
    return value

def get_value(group, name):
    '''If the attribute exists, unpack and return its value, otherwise return None.'''
    if name in group:
        return unpack_value(group[name][()])
    return None

with h5py.File(hdf5_dataset_path, "r") as h5file:
    # Deciding the batch index and data point index to load
    batch_index = 0
    data_index = 0

    # Get the group containing the data point
    batch_group = h5file[f"config_batch_{batch_index}"]
    data_group = batch_group[f"config_{data_index}"]

    # Attributes that must exist
    positions = data_group["positions"][()]
    atomic_numbers = data_group["atomic_numbers"][()]

    # Attributes that can be None
    pbc = unpack_value(data_group["pbc"][()])
    cell = unpack_value(data_group["cell"][()])

    # Attributes contained in the "properties" group
    forces = get_value(data_group["properties"], "forces")
    energy = get_value(data_group["properties"], "energy")
    stress = get_value(data_group["properties"], "stress")

See below for the API reference to the associated loader class.

class dipm.data.chemical_datasets.hdf5_dataset.Hdf5Dataset(file_path: PathLike, shuffle: bool = False)

Loads data from a single hdf5 file.

__init__(file_path: PathLike, shuffle: bool = False)
__getitem__(index)
__len__() int
release()

Release dataset file handles.

dipm.data.chemical_datasets.hdf5_dataset.create_datasets(config: ChemicalDatasetsConfig) tuple[ConcatDataset, ConcatDataset | None, ConcatDataset | None]

Create training, validation and test datasets based on the provided configuration.

Parameters:

config (ChemicalDatasetsConfig) – Configuration object containing the dataset paths etc.

Returns:

A tuple of training, validation (if provided), and test dataset (if provided).