Dataset Configs¶
- class dipm.data.configs.ChemicalDatasetsConfig(*, train_dataset_paths: ~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)] | list[~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)]], valid_dataset_paths: ~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)] | list[~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)]] | None = None, test_dataset_paths: ~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)] | list[~typing.Annotated[~pathlib.Path, ~pydantic.functional_validators.BeforeValidator(func=~dipm.typing.fields.<lambda>, json_schema_input_type=PydanticUndefined)]] | None = None, dataset_splits: tuple[~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])], ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])], ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])]] | tuple[~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])], ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])], ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])]] | None = None, shuffle: bool = True, parallel: bool = True, train_num_to_load: ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, valid_num_to_load: ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, test_num_to_load: ~typing.Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | ~typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None)¶
Pydantic-based config related to data preprocessing and loading into `ChemicalSystem`s.
When directories are given in
*_dataset_paths, files ending with.hdf5or.h5in those directories will be automatically detected and added.In
dataset_splitsand*_num_to_load, if a float is given, it will be interpreted as proportion. If an integer is given, it will be interpreted as number of data points.- train_dataset_paths¶
Path(s) to where the training set(s) are located. Cannot be empty. Will be converted to a list after validation.
- Type:
pathlib.Path | list[pathlib.Path]
- valid_dataset_paths¶
Path(s) to where the validation set(s) are located. This can be empty. Will be converted to a list after validation.
- Type:
pathlib.Path | list[pathlib.Path] | None
- test_dataset_paths¶
Path(s) to where the test set(s) are located. This can be empty. Will be converted to a list after validation.
- Type:
pathlib.Path | list[pathlib.Path] | None
- dataset_splits¶
Split train dataset(s) into train, validation and test datasets. Cannot be provided if
valid_dataset_pathsortest_dataset_pathsare not empty. IfNone, then no splitting will be done.- Type:
tuple[int, int, int] | tuple[float, float, float] | None
- shuffle¶
Whether to shuffle the data before splitting and loading. Default is
True.- Type:
bool
- parallel¶
Whether to use parallel loading or not. Every dataset file will use a separate process to load data. Default is
True.- Type:
bool
- train_num_to_load¶
Number of training set data points to load from the given dataset. By default, this is
Nonewhich means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.- Type:
int | float | None
- valid_num_to_load¶
Number of validation set data points to load from the given dataset. By default, this is
Nonewhich means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.- Type:
int | float | None
- test_num_to_load¶
Number of test set data points to load from the given dataset. By default, this is
Nonewhich means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.- Type:
int | float | None
- class dipm.data.configs.GraphDatasetBuilderConfig(*, graph_cutoff_angstrom: Annotated[float, Gt(gt=0)] = 5.0, max_n_node: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_edge: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_neighbors_per_atom: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, batch_size: Annotated[int, Gt(gt=0)] = 16, num_batch_prefetch: Annotated[int, Gt(gt=0)] = 1, batch_prefetch_num_devices: Annotated[int, Gt(gt=0)] = 1, use_formation_energies: bool = False, drop_unseen_elements: bool = False)¶
Pydantic-based config related to graph dataset building and preprocessing.
- graph_cutoff_angstrom¶
Graph cutoff distance in Angstrom to apply when creating the graphs. Default is 5.0.
- Type:
float
- max_n_node¶
This value will be multiplied with the batch size to determine the maximum number of nodes we allow in a batch. Note that a batch will always contain max_n_node * batch_size nodes, as the remaining ones are filled up with dummy nodes. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- max_n_edge¶
This value will be multiplied with the batch size to determine the maximum number of edges we allow in a batch. Note that a batch will always contain max_n_edge * batch_size edges, as the remaining ones are filled up with dummy edges. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- max_neighbors_per_atom¶
The maximum number of neighbors to consider for each atom. If None, all neighbors within the cutoff will be considered. Default is
None.- Type:
int | None
- batch_size¶
The number of graphs in a batch. Will be filled up with dummy graphs if either the maximum number of nodes or edges are reached before the number of graphs is reached. Default is 16.
- Type:
int
- num_batch_prefetch¶
Number of batched graphs to prefetch while iterating over batches. Default is 1.
- Type:
int
- batch_prefetch_num_devices¶
Number of threads to use for prefetching. Default is 1.
- Type:
int
- use_formation_energies¶
Whether the energies in the dataset should already be transformed to subtract the average atomic energies. Default is
False. Make sure that if you set this toTrue, the models assume"zero"atomic energies as can be set in the model hyperparameters.- Type:
bool
- drop_unseen_elements¶
If
dataset_infois provided, whether to drop unseen elements in the training dataset from thedataset_infoatomic numbers table. IfTrue, remember to remove unused embeddings from the model by yourself. Default isFalse.- Type:
bool