Dataset Configs¶
- class dipm.data.configs.DatasetCreationConfig(*, train_dataset_paths: list[Path] | dict[str, list[Path]], valid_dataset_paths: list[Path] | dict[str, list[Path]] | None = None, test_dataset_paths: list[Path] | dict[str, list[Path]] | None = None, dataset_splits: tuple[Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])], Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])], Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])]] | tuple[Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])], Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])], Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])]] | None = None, train_num_to_load: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, valid_num_to_load: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, test_num_to_load: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0), Le(le=1.0)])] | None = None, random_subset: bool = True)¶
Pydantic-based config related to creating dataset classes.
When directories are given in
*_dataset_paths, files ending with.hdf5or.h5in those directories will be automatically detected and added. If dict is given, the keys will be used as task/dataset names and the values as paths.In
dataset_splitsand*_num_to_load, if a float is given, it will be interpreted as proportion. If an integer is given, it will be interpreted as number of data points.- train_dataset_paths¶
Path(s) to where the training set(s) are located. Cannot be empty. Will be converted to a list after validation.
- Type:
list[pathlib.Path] | dict[str, list[pathlib.Path]]
- valid_dataset_paths¶
Path(s) to where the validation set(s) are located. This can be empty. Will be converted to a list after validation.
- Type:
list[pathlib.Path] | dict[str, list[pathlib.Path]] | None
- test_dataset_paths¶
Path(s) to where the test set(s) are located. This can be empty. Will be converted to a list after validation.
- Type:
list[pathlib.Path] | dict[str, list[pathlib.Path]] | None
- dataset_splits¶
Split train dataset(s) into train, validation and test datasets. Cannot be provided if
valid_dataset_pathsortest_dataset_pathsare not empty. IfNone, then no splitting will be done.- Type:
tuple[int, int, int] | tuple[float, float, float] | None
- train_num_to_load¶
Number of training set data points to load from the given dataset. By default, this is
Nonewhich means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.- Type:
int | float | None
- valid_num_to_load¶
Number of validation set data points to load from the given dataset. By default, this is
Nonewhich means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.- Type:
int | float | None
- test_num_to_load¶
Number of test set data points to load from the given dataset. By default, this is
Nonewhich means all the data points are loaded. If multiple dataset paths are given, this limit will apply in total.- Type:
int | float | None
- random_subset¶
Whether to randomly split dataset(s) when
dataset_splitsor*_num_to_loadis provided. Default isTrue.- Type:
bool
- class dipm.data.configs.DatasetManagerConfig(*, graph_cutoff_angstrom: Annotated[float, Gt(gt=0)] = 5.0, shuffle: bool = True, num_workers: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0)])] | None = None, use_shared_memory: bool | None = None, max_n_node: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_edge: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_neighbors_per_atom: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, batch_size: Annotated[int, Gt(gt=0)] = 16, num_batch_prefetch: Annotated[int, Ge(ge=0)] = 128, load_into_memory: bool = False, use_formation_energies: bool = False, drop_unseen_elements: bool = False, update_dataset_info: bool = False)¶
Pydantic-based config related to data preprocessing and loading into `ChemicalSystem`s.
- graph_cutoff_angstrom¶
Graph cutoff distance in Angstrom to apply when creating the graphs. Default is 5.0.
- Type:
float
- shuffle¶
Whether to shuffle the data before splitting and loading. Default is
True.- Type:
bool
- num_workers¶
Number of subprocesses to load data. If
None, will usemin(num_files, num_cpus)subprocesses. Default isNone.- Type:
int | None
Whether to use shared memory for data loading. If
None, will use shared memory if multiprocessing is enabled. Default isNone.- Type:
bool | None
- max_n_node¶
This value will be multiplied with the batch size to determine the maximum number of nodes we allow in a batch. Note that a batch will always contain max_n_node * batch_size nodes, as the remaining ones are filled up with dummy nodes. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- max_n_edge¶
This value will be multiplied with the batch size to determine the maximum number of edges we allow in a batch. Note that a batch will always contain max_n_edge * batch_size edges, as the remaining ones are filled up with dummy edges. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- max_neighbors_per_atom¶
The maximum number of neighbors to consider for each atom. If None, all neighbors within the cutoff will be considered. Default is
None.- Type:
int | None
- batch_size¶
The number of graphs in a batch. Will be filled up with dummy graphs if either the maximum number of nodes or edges are reached before the number of graphs is reached. Default is 16.
- Type:
int
- num_batch_prefetch¶
Number of batched graphs to prefetch while iterating over batches. Default is 128.
- Type:
int
- load_into_memory¶
Whether to load the entire dataset into memory. Default is
False.- Type:
bool
- use_formation_energies¶
Whether the energies in the dataset should already be transformed to subtract the average atomic energies. Default is
False. Make sure that if you set this toTrue, the models assume"zero"atomic energies as can be set in the model hyperparameters.- Type:
bool
- drop_unseen_elements¶
If
dataset_infois provided, whether to drop unseen elements in the training dataset from thedataset_infoatomic numbers table. IfTrue, remember to remove unused embeddings from the model by yourself. Default isFalse.- Type:
bool
- update_dataset_info¶
If
dataset_infois provided, whether to update the dataset related information in thedataset_infoobject.- Type:
bool