HDF5 Dataset¶
This class reads data from a MACE compatible
HDF5 format file organized in the following way. The data
must be grouped into batches containing the same number of data points, with each batch
name being config_batch_{idx}. It’s fine to put all data into a single batch group.
Every data point must defined as a group named config_{idx}. Below, we provide
an example of how to read the data from such a compliant HDF5 file to demonstrate
how the data is organized:
def unpack_value(value):
'''If the value is a string of "None", return None, otherwise return the value.'''
if isinstance(value, bytes) and value == b"None":
return None
return value
def get_value(group, name):
'''If the attribute exists, unpack and return its value, otherwise return None.'''
if name in group:
return unpack_value(group[name][()])
return None
with h5py.File(hdf5_dataset_path, "r") as h5file:
# Deciding the batch index and data point index to load
batch_index = 0
data_index = 0
# Get the group containing the data point
batch_group = h5file[f"config_batch_{batch_index}"]
data_group = batch_group[f"config_{data_index}"]
# Attributes that must exist
positions = data_group["positions"][()]
atomic_numbers = data_group["atomic_numbers"][()]
# Attributes that can be None
pbc = unpack_value(data_group["pbc"][()])
cell = unpack_value(data_group["cell"][()])
# Attributes contained in the "properties" group
forces = get_value(data_group["properties"], "forces")
energy = get_value(data_group["properties"], "energy")
stress = get_value(data_group["properties"], "stress")
See below for the API reference to the associated loader class.
- class dipm.data.chemical_datasets.hdf5_dataset.Hdf5Dataset(file_path: PathLike, shuffle: bool = False)¶
Loads data from a single hdf5 file.
- __init__(file_path: PathLike, shuffle: bool = False)¶
- __getitem__(index)¶
- __len__() int¶
- release()¶
Release dataset file handles.
- dipm.data.chemical_datasets.hdf5_dataset.create_datasets(config: ChemicalDatasetsConfig) tuple[ConcatDataset, ConcatDataset | None, ConcatDataset | None]¶
Create training, validation and test datasets based on the provided configuration.
- Parameters:
config (ChemicalDatasetsConfig) – Configuration object containing the dataset paths etc.
- Returns:
A tuple of training, validation (if provided), and test dataset (if provided).