Structure Datasets#

The boltz_data.dataset module provides the logic for creating and accessing large-scale structure datasets for machine learning and analysis workflows.

What is a Structure Dataset?#

A structure dataset is an parsed collection of 3D structures of proteins, nucleic acids and small molecules. It consists of structures, parsed as BZBioMol files and saved as compressed CBOR (.cbor.gz) files, and metadata, stored as parquet files. The metadata includes information about structures, chains and interfaces across all structures in the dataset.

Dataset Structure#

A typical structure dataset has the following form:

dataset/
├── dataset.yaml          # Dataset metadata, such as generation time and aggregated statistics
├── structures/           # Compressed BZBioMol files of each structure
│   ├── 1abc.cbor.gz
│   ├── 2xyz.cbor.gz
│   └── ...
├── structures.parquet    # Structure information, such as number of atoms, bonds, residues, chains, interfaces
└── chains.parquet        # Chain information, such as chain ID, type, sequence, number of residues, number of atoms

Generating Datasets#

Generating a structure dataset involves taking a set of input files (in mmCIF format) and producing a new structure dataset. This involves

Processing each structure in parallel. Each input structure may produce multiple output structures due to the presence of multiple assemblies.
Running clustering on the sets of protein and nucleic acid sequences to identify homologous sequences.

Running the Pipeline#

To run the dataset generation pipeline, call the boltz_data.dataset.generate module. This can be done either through pixi (pixi run generate-dataset) or directly (python -m boltz_data.dataset.generate).

pixi run generate-dataset dataset /path/to/input

╭──────────────────────────────────────────────────────────────────────────────╮
│ Generating Structure Dataset                                                 │
╰──────────────────────────────────────────────────────────────────────────────╯
Processing 13 input CIFs...
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 • 0:00:03 • 0:00:00
Processed 13 input CIFs into 19 structures in 3 seconds
Clustered 16 unique protein sequences into 11 clusters in 5 seconds
Clustered 6 unique nucleic acid sequences into 6 clusters in 2 seconds
Found 14 unique SMILES
Wrote dataset to /tmp/tmp168gy_c8
Dataset size: 689.9 kB

2025-11-14 13:57:04,799	INFO worker.py:2013 -- Started a local Ray instance.

Generation options#

Run pixi run generate-dataset --help to see the available options for generating a dataset.

/bin/bash: line 1: pixi: command not found

If files already exist in the output directory, the generation will fail unless the --replace-existing flag is used. If this is set, the output directory will be cleared before generation begins.

The --sample flag can be used to randomly subsample the input files. This can be useful for testing or debugging.

Steps during generation#

The input directories are parsed for mmCIF files recursively.
All mmCIF files are parsed in parallel (using Ray). The resultant files are written to the structures directory.
Protein and nucleic acid sequences are clustered with MMseqs2.
Metadata about structures are collated and written to the various metadata files.

Loading Datasets#

Generally, interacting with a dataset should be done through the boltz_data.dataset.StructureDataset class, rather than directly accessing the files in the dataset directory. This allows non-backwards compatibile changes to be made to the format.

The entry point to this is the boltz_data.dataset.structure_dataset_from_path function. This can take a path to either the enclosing dataset folder or the dataset.yaml file itself.

from boltz_data.dataset import structure_dataset_from_path
from rich import print

dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
print(dataset.model_dump())

{
    'started': datetime.datetime(2025, 11, 14, 12, 7, 8, 696174, tzinfo=datetime.timezone.utc),
    'finished': datetime.datetime(2025, 11, 14, 12, 7, 12, 494391, tzinfo=datetime.timezone.utc),
    'num_structures': 19,
    'num_chains': 121,
    'num_interfaces': 151,
    'structures_path': PosixPath('structures'),
    'templates_path': None,
    'structure_metadata_path': PosixPath('structures.parquet'),
    'chain_metadata_path': PosixPath('chains.parquet'),
    'interface_metadata_path': PosixPath('interfaces.parquet'),
    'template_metadata_path': None
}

Accessing Structures#

The individual structures in a dataset are stored as BZBioMol files in the structures directory. These can be loaded using the boltz_data.dataset.StructureDataset.get_structure_bzmol method.

# Get BZBioMol for a structure
bzmol = dataset.get_structure_bzmol(structure_id="1abc")

print(f"Atoms: {bzmol.num_atoms}")
print(f"Chains: {bzmol.num_chains}")

Working with Metadata#

The individual metadata files can be opened as Polars dataframes using the boltz_data.dataset.StructureDataset.structures_df, boltz_data.dataset.StructureDataset.chains_df and boltz_data.dataset.StructureDataset.interfaces_df properties.

from boltz_data.dataset import structure_dataset_from_path

dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
dataset.structures_df

shape: (19, 15)

structure_id	original_structure_id	title	keywords	deposited_date	released_date	revised_date	structure_version	num_atoms	num_bonds	num_residues	num_chains	num_interfaces	method	resolution
str	str	str	str	date	date	date	str	i64	i64	i64	i64	i64	str	f64
"101D"	"101D"	"'REFINEMENT OF NETROPSIN BOUND…	"DNA"	1994-12-14	1995-02-27	2023-11-22	"1.3"	528	586	26	4	0	"'X-RAY DIFFRACTION'"	2.25
"157D"	"157D"	"'CRYSTAL AND MOLECULAR STRUCTU…	"RNA"	1994-02-01	1994-05-31	2024-02-07	"1.3"	518	578	24	2	0	"'X-RAY DIFFRACTION'"	1.8
"1HV4_ASM1"	"1HV4"	"'CRYSTAL STRUCTURE ANALYSIS OF…	"'OXYGEN STORAGE/TRANSPORT'"	2001-01-07	2001-01-17	2023-08-09	"1.4"	4644	4780	578	8	0	"'X-RAY DIFFRACTION'"	2.8
"1HV4_ASM2"	"1HV4"	"'CRYSTAL STRUCTURE ANALYSIS OF…	"'OXYGEN STORAGE/TRANSPORT'"	2001-01-07	2001-01-17	2023-08-09	"1.4"	4644	4780	578	8	0	"'X-RAY DIFFRACTION'"	2.8
"1L6X_ASM1"	"1L6X"	"'FC FRAGMENT OF RITUXIMAB BOUN…	"'IMMUNE SYSTEM'"	2002-03-14	2002-04-10	2024-10-16	"2.2"	2073	2132	251	3	0	"'X-RAY DIFFRACTION'"	1.65
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"6B0B_ASM2"	"6B0B"	"'Crystal structure of human AP…	"HYDROLASE/RNA"	2017-09-14	2017-10-25	2024-10-16	"1.5"	3925	4056	459	5	0	"'X-RAY DIFFRACTION'"	3.280062
"7K2A_ASM1"	"7K2A"	"'Kelch domain of human KEAP1 b…	"'PROTEIN BINDING/Inhibitor'"	2020-09-08	2021-04-07	2024-11-20	"1.2"	2387	2441	312	2	0	"'X-RAY DIFFRACTION'"	1.9
"7K2A_ASM2"	"7K2A"	"'Kelch domain of human KEAP1 b…	"'PROTEIN BINDING/Inhibitor'"	2020-09-08	2021-04-07	2024-11-20	"1.2"	2313	2367	301	1	0	"'X-RAY DIFFRACTION'"	1.9
"7LL8"	"7LL8"	"'D-Protein RFX-V1 Bound to the…	"'BIOSYNTHETIC PROTEIN'"	2021-02-03	2021-02-17	2024-11-06	"1.5"	2496	2554	312	4	0	"'X-RAY DIFFRACTION'"	2.31
"8HNH"	"8HNH"	"'Cryo-EM structure of human OA…	"'MEMBRANE PROTEIN'"	2022-12-07	2023-09-13	2024-10-30	"1.2"	5739	5878	728	4	0	"'ELECTRON MICROSCOPY'"	3.73

from boltz_data.dataset import structure_dataset_from_path

dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
dataset.chains_df

shape: (121, 12)

structure_id	chain_id	description	sequence	smiles	resname	resnum	cluster_idx	num_atoms	num_residues	num_bonds	type
str	str	str	str	str	str	i64	i64	i64	i64	i64	str
"101D"	"A"	"DNA (5'-D(CPGPCPGPAPAP*T…	"CGCGAATT(CBR)GCG"	null	null	null	12	248	12	277	"dna"
"101D"	"B"	"DNA (5'-D(CPGPCPGPAPAP*T…	"CGCGAATT(CBR)GCG"	null	null	null	12	248	12	277	"dna"
"101D"	"C"	"MAGNESIUM ION"	null	"[Mg+2]"	"MG"	26	28	1	1	0	"small_molecule"
"101D"	"D"	"NETROPSIN"	null	"Cn1cc(NC(=O)c2cc(NC(=O)CNC(=N)…	"NT"	25	24	31	1	32	"small_molecule"
"157D"	"A"	"RNA (5'-R(CPGPCPGPAPAP*U…	"CGCGAAUUAGCG"	null	null	null	13	259	12	289	"rna"
…	…	…	…	…	…	…	…	…	…	…	…
"7LL8"	"D"	"RFX-V1"	null	"CCC(C)C(NC(=O)CNC(=O)C(C)NC(=O…	null	null	22	413	53	422	"small_molecule"
"8HNH"	"A"	"Solute carrier organic anion t…	"MDQNQHLNKTAEAQPSENKKTRYCNGLKMF…	null	null	null	6	5643	724	5775	"protein"
"8HNH"	"B"	"2-acetamido-2-deoxy-beta-D-glu…	null	"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)…	null	null	19	29	2	30	"small_molecule"
"8HNH"	"C"	"(2R,3aR,10Z,11aS,12aR,14aR)-N-…	null	"COc1ccc2c(OC3CC4C(=O)NC5(C(=O)…	"30B"	801	23	52	1	58	"small_molecule"
"8HNH"	"D"	"2-acetamido-2-deoxy-beta-D-glu…	null	"CC(=O)NC1C(O)OC(CO)C(O)C1O"	"NAG"	802	18	15	1	15	"small_molecule"

from boltz_data.dataset import structure_dataset_from_path

dataset = structure_dataset_from_path("../../tests/data/structure_dataset")
dataset.interfaces_df

shape: (151, 6)

structure_id	chain1_id	chain2_id	num_atoms_within_5a	chain1_residues_within_5a	chain2_residues_within_5a
str	str	str	i64	list[i64]	list[i64]
"101D"	"A"	"B"	1168	[0, 0, … 11]	[0, 0, … 11]
"101D"	"A"	"C"	1168	[0, 1]	[0, 0]
"101D"	"A"	"D"	1168	[3, 4, … 9]	[0, 0, … 0]
"101D"	"B"	"C"	1168	[9]	[0]
"101D"	"B"	"D"	1168	[3, 4, … 9]	[0, 0, … 0]
…	…	…	…	…	…
"7LL8"	"B"	"D"	1571	[39, 41, … 84]	[22, 24, … 51]
"8HNH"	"A"	"B"	282	[501, 502, 515]	[0, 0, 0]
"8HNH"	"A"	"C"	282	[37, 40, … 579]	[0, 0, … 0]
"8HNH"	"A"	"D"	282	[513, 515]	[0, 0]
"8HNH"	"B"	"D"	282	[0]	[0]

Structure Datasets

Contents