Structure Datasets#
The boltz_data.dataset module provides the logic for creating and accessing large-scale structure datasets for machine learning and analysis workflows.
What is a Structure Dataset?#
A structure dataset is an parsed collection of 3D structures of proteins, nucleic acids and small molecules. It consists of structures, parsed as BZBioMol files and saved as compressed CBOR (.cbor.gz) files, and metadata, stored as parquet files. The metadata includes information about structures, chains and interfaces across all structures in the dataset.
Dataset Structure#
A typical structure dataset has the following form:
dataset/
├── dataset.yaml # Dataset metadata, such as generation time and aggregated statistics
├── structures/ # Compressed BZBioMol files of each structure
│ ├── 1abc.cbor.gz
│ ├── 2xyz.cbor.gz
│ └── ...
├── structures.parquet # Structure information, such as number of atoms, bonds, residues, chains, interfaces
└── chains.parquet # Chain information, such as chain ID, type, sequence, number of residues, number of atoms
Generating Datasets#
Generating a structure dataset involves taking a set of input files (in mmCIF format) and producing a new structure dataset. This involves
Processing each structure in parallel. Each input structure may produce multiple output structures due to the presence of multiple assemblies.
Running clustering on the sets of protein and nucleic acid sequences to identify homologous sequences.
Running the Pipeline#
To run the dataset generation pipeline, call the boltz_data.dataset.generate module. This can be done either through pixi (pixi run generate-dataset) or directly (python -m boltz_data.dataset.generate).
pixi run generate-dataset dataset /path/to/input
╭──────────────────────────────────────────────────────────────────────────────╮
│ Generating Structure Dataset │
╰──────────────────────────────────────────────────────────────────────────────╯
Processing 13 input CIFs...
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 • 0:00:03 • 0:00:00
Processed 13 input CIFs into 19 structures in 3 seconds
Clustered 16 unique protein sequences into 11 clusters in 5 seconds
Clustered 6 unique nucleic acid sequences into 6 clusters in 2 seconds
Found 14 unique SMILES
Wrote dataset to /tmp/tmp168gy_c8
Dataset size: 689.9 kB
2025-11-14 13:57:04,799 INFO worker.py:2013 -- Started a local Ray instance.
Generation options#
Run pixi run generate-dataset --help to see the available options for generating a dataset.
/bin/bash: line 1: pixi: command not found
If files already exist in the output directory, the generation will fail unless the --replace-existing flag is used. If this is set, the output directory will be cleared before generation begins.
The --sample flag can be used to randomly subsample the input files. This can be useful for testing or debugging.
Steps during generation#
The input directories are parsed for mmCIF files recursively.
All mmCIF files are parsed in parallel (using Ray). The resultant files are written to the
structuresdirectory.Protein and nucleic acid sequences are clustered with MMseqs2.
Metadata about structures are collated and written to the various metadata files.
Loading Datasets#
Generally, interacting with a dataset should be done through the boltz_data.dataset.StructureDataset class, rather than directly accessing the files in the dataset directory. This allows non-backwards compatibile changes to be made to the format.
The entry point to this is the boltz_data.dataset.structure_dataset_from_path function. This can take a path to either the enclosing dataset folder or the dataset.yaml file itself.
from boltz_data.dataset import structure_dataset_from_path
from rich import print
dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
print(dataset.model_dump())
{ 'started': datetime.datetime(2025, 11, 14, 12, 7, 8, 696174, tzinfo=datetime.timezone.utc), 'finished': datetime.datetime(2025, 11, 14, 12, 7, 12, 494391, tzinfo=datetime.timezone.utc), 'num_structures': 19, 'num_chains': 121, 'num_interfaces': 151, 'structures_path': PosixPath('structures'), 'templates_path': None, 'structure_metadata_path': PosixPath('structures.parquet'), 'chain_metadata_path': PosixPath('chains.parquet'), 'interface_metadata_path': PosixPath('interfaces.parquet'), 'template_metadata_path': None }
Accessing Structures#
The individual structures in a dataset are stored as BZBioMol files in the structures directory. These can be loaded using the boltz_data.dataset.StructureDataset.get_structure_bzmol method.
# Get BZBioMol for a structure
bzmol = dataset.get_structure_bzmol(structure_id="1abc")
print(f"Atoms: {bzmol.num_atoms}")
print(f"Chains: {bzmol.num_chains}")
Working with Metadata#
The individual metadata files can be opened as Polars dataframes using the boltz_data.dataset.StructureDataset.structures_df, boltz_data.dataset.StructureDataset.chains_df and boltz_data.dataset.StructureDataset.interfaces_df properties.
from boltz_data.dataset import structure_dataset_from_path
dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
dataset.structures_df
| structure_id | original_structure_id | title | keywords | deposited_date | released_date | revised_date | structure_version | num_atoms | num_bonds | num_residues | num_chains | num_interfaces | method | resolution |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | date | date | date | str | i64 | i64 | i64 | i64 | i64 | str | f64 |
| "101D" | "101D" | "'REFINEMENT OF NETROPSIN BOUND… | "DNA" | 1994-12-14 | 1995-02-27 | 2023-11-22 | "1.3" | 528 | 586 | 26 | 4 | 0 | "'X-RAY DIFFRACTION'" | 2.25 |
| "157D" | "157D" | "'CRYSTAL AND MOLECULAR STRUCTU… | "RNA" | 1994-02-01 | 1994-05-31 | 2024-02-07 | "1.3" | 518 | 578 | 24 | 2 | 0 | "'X-RAY DIFFRACTION'" | 1.8 |
| "1HV4_ASM1" | "1HV4" | "'CRYSTAL STRUCTURE ANALYSIS OF… | "'OXYGEN STORAGE/TRANSPORT'" | 2001-01-07 | 2001-01-17 | 2023-08-09 | "1.4" | 4644 | 4780 | 578 | 8 | 0 | "'X-RAY DIFFRACTION'" | 2.8 |
| "1HV4_ASM2" | "1HV4" | "'CRYSTAL STRUCTURE ANALYSIS OF… | "'OXYGEN STORAGE/TRANSPORT'" | 2001-01-07 | 2001-01-17 | 2023-08-09 | "1.4" | 4644 | 4780 | 578 | 8 | 0 | "'X-RAY DIFFRACTION'" | 2.8 |
| "1L6X_ASM1" | "1L6X" | "'FC FRAGMENT OF RITUXIMAB BOUN… | "'IMMUNE SYSTEM'" | 2002-03-14 | 2002-04-10 | 2024-10-16 | "2.2" | 2073 | 2132 | 251 | 3 | 0 | "'X-RAY DIFFRACTION'" | 1.65 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "6B0B_ASM2" | "6B0B" | "'Crystal structure of human AP… | "HYDROLASE/RNA" | 2017-09-14 | 2017-10-25 | 2024-10-16 | "1.5" | 3925 | 4056 | 459 | 5 | 0 | "'X-RAY DIFFRACTION'" | 3.280062 |
| "7K2A_ASM1" | "7K2A" | "'Kelch domain of human KEAP1 b… | "'PROTEIN BINDING/Inhibitor'" | 2020-09-08 | 2021-04-07 | 2024-11-20 | "1.2" | 2387 | 2441 | 312 | 2 | 0 | "'X-RAY DIFFRACTION'" | 1.9 |
| "7K2A_ASM2" | "7K2A" | "'Kelch domain of human KEAP1 b… | "'PROTEIN BINDING/Inhibitor'" | 2020-09-08 | 2021-04-07 | 2024-11-20 | "1.2" | 2313 | 2367 | 301 | 1 | 0 | "'X-RAY DIFFRACTION'" | 1.9 |
| "7LL8" | "7LL8" | "'D-Protein RFX-V1 Bound to the… | "'BIOSYNTHETIC PROTEIN'" | 2021-02-03 | 2021-02-17 | 2024-11-06 | "1.5" | 2496 | 2554 | 312 | 4 | 0 | "'X-RAY DIFFRACTION'" | 2.31 |
| "8HNH" | "8HNH" | "'Cryo-EM structure of human OA… | "'MEMBRANE PROTEIN'" | 2022-12-07 | 2023-09-13 | 2024-10-30 | "1.2" | 5739 | 5878 | 728 | 4 | 0 | "'ELECTRON MICROSCOPY'" | 3.73 |
from boltz_data.dataset import structure_dataset_from_path
dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
dataset.chains_df
| structure_id | chain_id | description | sequence | smiles | resname | resnum | cluster_idx | num_atoms | num_residues | num_bonds | type |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | i64 | i64 | i64 | i64 | i64 | str |
| "101D" | "A" | "DNA (5'-D(*CP*GP*CP*GP*AP*AP*T… | "CGCGAATT(CBR)GCG" | null | null | null | 12 | 248 | 12 | 277 | "dna" |
| "101D" | "B" | "DNA (5'-D(*CP*GP*CP*GP*AP*AP*T… | "CGCGAATT(CBR)GCG" | null | null | null | 12 | 248 | 12 | 277 | "dna" |
| "101D" | "C" | "MAGNESIUM ION" | null | "[Mg+2]" | "MG" | 26 | 28 | 1 | 1 | 0 | "small_molecule" |
| "101D" | "D" | "NETROPSIN" | null | "Cn1cc(NC(=O)c2cc(NC(=O)CNC(=N)… | "NT" | 25 | 24 | 31 | 1 | 32 | "small_molecule" |
| "157D" | "A" | "RNA (5'-R(*CP*GP*CP*GP*AP*AP*U… | "CGCGAAUUAGCG" | null | null | null | 13 | 259 | 12 | 289 | "rna" |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "7LL8" | "D" | "RFX-V1" | null | "CCC(C)C(NC(=O)CNC(=O)C(C)NC(=O… | null | null | 22 | 413 | 53 | 422 | "small_molecule" |
| "8HNH" | "A" | "Solute carrier organic anion t… | "MDQNQHLNKTAEAQPSENKKTRYCNGLKMF… | null | null | null | 6 | 5643 | 724 | 5775 | "protein" |
| "8HNH" | "B" | "2-acetamido-2-deoxy-beta-D-glu… | null | "CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)… | null | null | 19 | 29 | 2 | 30 | "small_molecule" |
| "8HNH" | "C" | "(2R,3aR,10Z,11aS,12aR,14aR)-N-… | null | "COc1ccc2c(OC3CC4C(=O)NC5(C(=O)… | "30B" | 801 | 23 | 52 | 1 | 58 | "small_molecule" |
| "8HNH" | "D" | "2-acetamido-2-deoxy-beta-D-glu… | null | "CC(=O)NC1C(O)OC(CO)C(O)C1O" | "NAG" | 802 | 18 | 15 | 1 | 15 | "small_molecule" |
from boltz_data.dataset import structure_dataset_from_path
dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
dataset.interfaces_df
| structure_id | chain1_id | chain2_id | num_atoms_within_5a | chain1_residues_within_5a | chain2_residues_within_5a |
|---|---|---|---|---|---|
| str | str | str | i64 | list[i64] | list[i64] |
| "101D" | "A" | "B" | 1168 | [0, 0, … 11] | [0, 0, … 11] |
| "101D" | "A" | "C" | 1168 | [0, 1] | [0, 0] |
| "101D" | "A" | "D" | 1168 | [3, 4, … 9] | [0, 0, … 0] |
| "101D" | "B" | "C" | 1168 | [9] | [0] |
| "101D" | "B" | "D" | 1168 | [3, 4, … 9] | [0, 0, … 0] |
| … | … | … | … | … | … |
| "7LL8" | "B" | "D" | 1571 | [39, 41, … 84] | [22, 24, … 51] |
| "8HNH" | "A" | "B" | 282 | [501, 502, 515] | [0, 0, 0] |
| "8HNH" | "A" | "C" | 282 | [37, 40, … 579] | [0, 0, … 0] |
| "8HNH" | "A" | "D" | 282 | [513, 515] | [0, 0] |
| "8HNH" | "B" | "D" | 282 | [0] | [0] |