Dataset Info

class dipm.data.dataset_info.DatasetInfo(*, cutoff_distance_angstrom: float, max_neighbors_per_atom: int | None = None, task_list: list[str] | None = None, atomic_energies_map: dict[int, float | list[float]], avg_num_neighbors: float = 1.0, avg_num_nodes: float = 1.0, avg_r_min_angstrom: float | None = None, scaling_mean: float = 0.0, scaling_stdev: float = 1.0, median_num_neighbors: int = 1, max_total_edges: int = 1, median_num_nodes: int = 1, max_num_nodes: int = 1)

Pydantic dataclass holding information computed from the dataset that is (potentially) required by the models. There are three types of fields:

  1. User specified fields: These fields are specified by the user but cannot be changed when fine-tuning.

  2. Model related computed fields: These fields are computed from the dataset but are bound to the model and cannot be changed when fine-tuning.

  3. Dataset related computed fields: These fields are computed from the dataset and can / are recommended to be changed when fine-tuning.

cutoff_distance_angstrom

The graph cutoff distance that was used in the dataset in Angstrom.

Type:

float

max_neighbors_per_atom

The maximum number of neighbors to consider for each atom. Do NOT use it typically, as it will broke the smoothness.

Type:

int | None

task_list

List of different tasks/datasets used in training. None (default) means no task embedding used / only one task. If provided, values of the atomic energies map must be lists of floats, one for each task.

Type:

list[str] | None

atomic_energies_map

A dictionary mapping the atomic numbers to the computed average atomic energies for that element.

Type:

dict[int, float | list[float]]

avg_num_neighbors

The mean number of neighbors an atom has in the dataset.

Type:

float

avg_num_nodes

The mean number of nodes per graph in the dataset.

Type:

float

avg_r_min_angstrom

The mean minimum edge distance for a structure in the dataset.

Type:

float | None

scaling_mean

The mean used for the rescaling of the dataset values, the default being 0.0.

Type:

float

scaling_stdev

The standard deviation used for the rescaling of the dataset values, the default being 1.0.

Type:

float

median_num_neighbors

The median number of neighbors an atom has in the dataset.

Type:

int

max_total_edges

The maximum number of edges in the dataset.

Type:

int

median_num_nodes

The median number of nodes per graph in the dataset.

Type:

int

max_num_nodes

The maximum number of nodes per graph in the dataset.

Type:

int

__init__(**data: Any) None

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.