HDF5 Dataset

This class reads data from a MACE compatible HDF5 format file organized in the following way. The data must be grouped into batches containing the same number of data points, with each batch name being config_batch_{idx}. It’s fine to put all data into a single batch group. Every data point must defined as a group named config_{idx}. Below, we provide an example of how to read the data from such a compliant HDF5 file to demonstrate how the data is organized:

def unpack_value(value):
    '''If the value is a string of "None", return None, otherwise return the value.'''
    if isinstance(value, bytes) and value == b"None":
        return None
    return value

def get_value(group, name):
    '''If the attribute exists, unpack and return its value, otherwise return None.'''
    if name in group:
        return unpack_value(group[name][()])
    return None

with h5py.File(hdf5_dataset_path, "r") as h5file:
    # Deciding the batch index and data point index to load
    batch_index = 0
    data_index = 0

    # Get the group containing the data point
    batch_group = h5file[f"config_batch_{batch_index}"]
    data_group = batch_group[f"config_{data_index}"]

    # Attributes that must exist
    positions = data_group["positions"][()]
    atomic_numbers = data_group["atomic_numbers"][()]

    # Attributes that can be None
    pbc = unpack_value(data_group["pbc"][()])
    cell = unpack_value(data_group["cell"][()])

    # Attributes contained in the "properties" group
    forces = get_value(data_group["properties"], "forces")
    energy = get_value(data_group["properties"], "energy")
    stress = get_value(data_group["properties"], "stress")

See below for the API reference to the associated loader class.

class dipm.data.chemical_datasets.hdf5_dataset.Hdf5Dataset(path: PathLike, exclude_ids: ndarray | None = None, task: int | None = None, post_process_fn: Callable[[ChemicalSystem], _T_co] | None = None)

Loads data from a single hdf5 file.

__init__(path: PathLike, exclude_ids: ndarray | None = None, task: int | None = None, post_process_fn: Callable[[ChemicalSystem], _T_co] | None = None)
__getitem__(index)
__len__() int
release()

Release dataset file handles.