HDF5 Dataset¶
This class reads data from a MACE compatible
HDF5 format file organized in the following way. The data
must be grouped into batches containing the same number of data points, with each batch
name being config_batch_{idx}. It’s fine to put all data into a single batch group.
Every data point must defined as a group named config_{idx}. Below, we provide
an example of how to read the data from such a compliant HDF5 file to demonstrate
how the data is organized:
def unpack_value(value):
'''If the value is a string of "None", return None, otherwise return the value.'''
if isinstance(value, bytes) and value == b"None":
return None
return value
def get_value(group, name):
'''If the attribute exists, unpack and return its value, otherwise return None.'''
if name in group:
return unpack_value(group[name][()])
return None
with h5py.File(hdf5_dataset_path, "r") as h5file:
# Deciding the batch index and data point index to load
batch_index = 0
data_index = 0
# Get the group containing the data point
batch_group = h5file[f"config_batch_{batch_index}"]
data_group = batch_group[f"config_{data_index}"]
# Attributes that must exist
positions = data_group["positions"][()]
atomic_numbers = data_group["atomic_numbers"][()]
# Attributes that can be None
pbc = unpack_value(data_group["pbc"][()])
cell = unpack_value(data_group["cell"][()])
# Attributes contained in the "properties" group
forces = get_value(data_group["properties"], "forces")
energy = get_value(data_group["properties"], "energy")
stress = get_value(data_group["properties"], "stress")
See below for the API reference to the associated loader class.
- class dipm.data.chemical_datasets.hdf5_dataset.Hdf5Dataset(path: PathLike, exclude_ids: ndarray | None = None, task: int | None = None, post_process_fn: Callable[[ChemicalSystem], _T_co] | None = None)¶
Loads data from a single hdf5 file.
- __init__(path: PathLike, exclude_ids: ndarray | None = None, task: int | None = None, post_process_fn: Callable[[ChemicalSystem], _T_co] | None = None)¶
- __getitem__(index)¶
- __len__() int¶
- release()¶
Release dataset file handles.