Dataset Configs¶

class dipm.data.configs.ChemicalDatasetsConfig(*, train_dataset_paths: ~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)] | list[~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)]], valid_dataset_paths: ~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)] | list[~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)]] | None = None, test_dataset_paths: ~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)] | list[~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)]] | None = None, dataset_splits: tuple[~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])], ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])], ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])]] | tuple[~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])], ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])], ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])]] | None = None, shuffle: bool = True, parallel: bool = True, train_num_to_load: ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, valid_num_to_load: ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, test_num_to_load: ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None)¶

Pydantic-based config related to data preprocessing and loading into `ChemicalSystem`s.

When directories are given in *_dataset_paths, files ending with .hdf5 or .h5 in those directories will be automatically detected and added.

In dataset_splits and *_num_to_load, if a float is given, it will be interpreted as proportion. If an integer is given, it will be interpreted as number of data points.

train_dataset_paths¶

Path(s) to where the training set(s) are located. Cannot be empty. Will be converted to a list after validation.

Type:: pathlib.Path | list[pathlib.Path]

valid_dataset_paths¶

Path(s) to where the validation set(s) are located. This can be empty. Will be converted to a list after validation.

Type:: pathlib.Path | list[pathlib.Path] | None

test_dataset_paths¶

Path(s) to where the test set(s) are located. This can be empty. Will be converted to a list after validation.

Type:: pathlib.Path | list[pathlib.Path] | None

dataset_splits¶

Split train dataset(s) into train, validation and test datasets. Cannot be provided if valid_dataset_paths or test_dataset_paths are not empty. If None, then no splitting will be done.

Type:: tuple[int, int, int] | tuple[float, float, float] | None

shuffle¶

Whether to shuffle the data before splitting and loading. Default is True.

Type:: bool

parallel¶

Whether to use parallel loading or not. Every dataset file will use a separate process to load data. Default is True.

Type:: bool

train_num_to_load¶

Number of training set data points to load from the given dataset. By default, this is None which means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.

Type:: int | float | None

valid_num_to_load¶

Number of validation set data points to load from the given dataset. By default, this is None which means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.

Type:: int | float | None

test_num_to_load¶

Number of test set data points to load from the given dataset. By default, this is None which means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.

Type:: int | float | None

class dipm.data.configs.GraphDatasetBuilderConfig(*, graph_cutoff_angstrom: Annotated[float, Gt(gt=0)] = 5.0, max_n_node: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_edge: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_neighbors_per_atom: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, batch_size: Annotated[int, Gt(gt=0)] = 16, num_batch_prefetch: Annotated[int, Gt(gt=0)] = 1, batch_prefetch_num_devices: Annotated[int, Gt(gt=0)] = 1, use_formation_energies: bool = False, drop_unseen_elements: bool = False)¶

Pydantic-based config related to graph dataset building and preprocessing.

graph_cutoff_angstrom¶

Graph cutoff distance in Angstrom to apply when creating the graphs. Default is 5.0.

Type:: float

max_n_node¶

This value will be multiplied with the batch size to determine the maximum number of nodes we allow in a batch. Note that a batch will always contain max_n_node * batch_size nodes, as the remaining ones are filled up with dummy nodes. If set to None, a reasonable value will be automatically computed. Default is None.

Type:: int | None

max_n_edge¶

This value will be multiplied with the batch size to determine the maximum number of edges we allow in a batch. Note that a batch will always contain max_n_edge * batch_size edges, as the remaining ones are filled up with dummy edges. If set to None, a reasonable value will be automatically computed. Default is None.

Type:: int | None

max_neighbors_per_atom¶

The maximum number of neighbors to consider for each atom. If None, all neighbors within the cutoff will be considered. Default is None.

Type:: int | None

batch_size¶

The number of graphs in a batch. Will be filled up with dummy graphs if either the maximum number of nodes or edges are reached before the number of graphs is reached. Default is 16.

Type:: int

num_batch_prefetch¶

Number of batched graphs to prefetch while iterating over batches. Default is 1.

Type:: int

batch_prefetch_num_devices¶

Number of threads to use for prefetching. Default is 1.

Type:: int

use_formation_energies¶

Whether the energies in the dataset should already be transformed to subtract the average atomic energies. Default is False. Make sure that if you set this to True, the models assume "zero" atomic energies as can be set in the model hyperparameters.

Type:: bool

drop_unseen_elements¶

If dataset_info is provided, whether to drop unseen elements in the training dataset from the dataset_info atomic numbers table. If True, remember to remove unused embeddings from the model by yourself. Default is False.

Type:: bool