Structure Definitions#

The boltz_data.definition module provides classes for defining molecular structures programmatically, separate from coordinate data.

What are Structure Definitions?#

Structure definitions describe molecular topology—sequences, chains, bonds—without requiring 3D coordinates. They’re useful for:

  • Creating structures from scratch

  • Defining theoretical complexes before generating coordinates

  • Storing structure metadata independently

  • Round-trip conversions with BZMol

Core Classes#

EntityDefinition#

An entity is a distinct chemical component (protein chain, DNA strand, ligand, etc.).

Types:

  • ProteinDefinition - Protein sequence

  • DNADefinition - DNA sequence

  • RNADefinition - RNA sequence

  • LigandCCDDefinition - Small molecule from CCD

  • LigandSMILESDefinition - Small molecule from SMILES

  • BranchedPolymerDefinition - Glycans and branched structures

StructureDefinition#

A complete structure containing multiple entities and chains.

from boltz_data.definition import StructureDefinition, ChainDefinition

structure = StructureDefinition(
    entities=[...],  # List of EntityDefinition objects
    chains={...},    # Dict mapping chain IDs to ChainDefinition
    bonds=None       # Optional inter-chain bonds
)

Creating Protein Structures#

Simple Protein#

from boltz_data.definition import ProteinDefinition

# Define a protein entity
protein = ProteinDefinition(
    type="protein",
    sequence="MKFLKFSLLTAVLLSVVFAFSSCGDDDDTGYLPPSQAIQDLLKRM",
    description="Example protein"
)

Protein with Non-Standard Residues#

Use parentheses for non-standard residues:

protein = ProteinDefinition(
    type="protein",
    sequence="MKF(MSE)KF",  # Selenomethionine at position 4
    description="Protein with selenomethionine"
)

Protein with Custom Bonds#

from boltz_data.definition import ProteinDefinition, InternalBond

# Define disulfide bonds
protein = ProteinDefinition(
    type="protein",
    sequence="CGGGC",
    bonds=[
        InternalBond(
            residue_index_1=0,  # First cysteine
            atom_name_1="SG",
            residue_index_2=4,  # Last cysteine
            atom_name_2="SG",
            bond_order=1
        )
    ]
)

Creating Nucleic Acid Structures#

DNA#

from boltz_data.definition import DNADefinition

dna = DNADefinition(
    type="dna",
    sequence="ATCGATCG",
    description="Example DNA strand"
)

RNA#

from boltz_data.definition import RNADefinition

rna = RNADefinition(
    type="rna",
    sequence="AUCGAUCG",
    description="Example RNA strand"
)

Creating Ligand Structures#

From CCD#

from boltz_data.definition import LigandCCDDefinition

ligand = LigandCCDDefinition(
    type="ligand_ccd",
    comp_id="ATP",
    description="Adenosine triphosphate"
)

From SMILES#

from boltz_data.definition import LigandSMILESDefinition

ligand = LigandSMILESDefinition(
    type="ligand_smiles",
    smiles="CCO",
    description="Ethanol"
)

Creating Multi-Chain Structures#

Protein-Ligand Complex#

from boltz_data.definition import (
    StructureDefinition,
    ChainDefinition,
    ProteinDefinition,
    LigandCCDDefinition,
)

# Define entities
protein = ProteinDefinition(
    type="protein",
    sequence="MKFLKFSLLTAVLLSVVFAFSSCGDDDDTGYLPPSQAIQDLLKRM"
)

ligand = LigandCCDDefinition(
    type="ligand_ccd",
    comp_id="ATP"
)

# Create structure with two chains
structure = StructureDefinition(
    entities=[protein, ligand],
    chains={
        "A": ChainDefinition(entity_idx=0),  # Protein on chain A
        "B": ChainDefinition(entity_idx=1),  # Ligand on chain B
    }
)

Protein Dimer#

# Same entity used for both chains
structure = StructureDefinition(
    entities=[protein],  # Single entity
    chains={
        "A": ChainDefinition(entity_idx=0),  # First copy
        "B": ChainDefinition(entity_idx=0),  # Second copy (same entity)
    }
)

With Residue Numbers#

structure = StructureDefinition(
    entities=[protein],
    chains={
        "A": ChainDefinition(
            entity_idx=0,
            residue_numbers=list(range(1, 48))  # Custom numbering
        ),
    }
)

Inter-Chain Bonds#

Connect different chains with InterChainBond:

from boltz_data.definition import InterChainBond

structure = StructureDefinition(
    entities=[protein1, protein2],
    chains={
        "A": ChainDefinition(entity_idx=0),
        "B": ChainDefinition(entity_idx=1),
    },
    bonds=[
        InterChainBond(
            chain_id_1="A",
            residue_index_1=5,  # Residue ordinal in chain
            atom_name_1="SG",
            chain_id_2="B",
            residue_index_2=12,
            atom_name_2="SG",
            bond_order=1
        )
    ]
)

Converting to BZMol#

Convert definitions to 3D structures:

from boltz_data.mol import bzmol_from_definition
from boltz_data.ccd import chemical_component_dictionary_from_path

# Load CCD dictionary
ccd = chemical_component_dictionary_from_path("ccd.pkl.gz")

# Convert to BZMol
bzmol = bzmol_from_definition(
    protein,
    chemical_component_dictionary=ccd
)

# Or for full structures
from boltz_data.mol import bzmol_from_structure

bzmol = bzmol_from_structure(structure, ccd)

Converting from BZMol#

Extract structure definition from BZMol:

from boltz_data.mol import structure_from_bzmol

# Convert BZBioMol back to definition
structure = structure_from_bzmol(bzmol, chemical_component_dictionary=ccd)

# Now you can inspect or modify the definition
for entity in structure.entities:
    if entity.type == "protein":
        print(f"Protein sequence: {entity.sequence}")

Branched Polymers#

For glycans and other branched structures:

from boltz_data.definition import BranchedPolymerDefinition, InternalBond

glycan = BranchedPolymerDefinition(
    type="branched_polymer",
    comp_ids=["NAG", "NAG", "BMA", "MAN"],
    bonds=[
        InternalBond(
            residue_index_1=0,
            atom_name_1="O4",
            residue_index_2=1,
            atom_name_2="C1",
            bond_order=1
        ),
        InternalBond(
            residue_index_1=1,
            atom_name_1="O4",
            residue_index_2=2,
            atom_name_2="C1",
            bond_order=1
        ),
        # Branch point
        InternalBond(
            residue_index_1=2,
            atom_name_1="O6",
            residue_index_2=3,
            atom_name_2="C1",
            bond_order=1
        ),
    ]
)

Serialization#

Structure definitions are Pydantic models and can be serialized:

from boltz_data import fs

# Save as JSON
fs.write_json("structure.json", structure)

# Save as YAML
fs.write_yaml("structure.yaml", structure)

# Load back
structure = fs.read_object("structure.yaml", as_=StructureDefinition)

Use Cases#

1. Theoretical Structure Design#

Design a structure before generating coordinates:

# Design a fusion protein
fusion = ProteinDefinition(
    type="protein",
    sequence="MKFLK" + "GGS" + "AIQDK",  # Protein1 + linker + Protein2
)

# Generate coordinates later
from boltz_data.mol import bzmol_from_definition, generate_conformer

bzmol = bzmol_from_definition(fusion, chemical_component_dictionary=ccd)
bzmol = generate_conformer(bzmol, seed=42)

2. Structure Modification#

Modify sequences programmatically:

# Load structure
structure = structure_from_bzmol(bzmol, ccd)

# Get entity
protein_entity = structure.entities[0]

# Modify sequence
mutated_sequence = protein_entity.sequence[:10] + "A" + protein_entity.sequence[11:]
mutated_entity = ProteinDefinition(
    type="protein",
    sequence=mutated_sequence
)

# Create new structure
new_structure = StructureDefinition(
    entities=[mutated_entity],
    chains=structure.chains
)

# Convert back to BZMol
new_bzmol = bzmol_from_structure(new_structure, ccd)

3. Complex Assembly#

Build multi-component complexes:

# Define all components
protein = ProteinDefinition(type="protein", sequence="MKFLKF...")
dna = DNADefinition(type="dna", sequence="ATCGATCG")
ligand = LigandCCDDefinition(type="ligand_ccd", comp_id="ATP")

# Assemble into structure
complex_structure = StructureDefinition(
    entities=[protein, dna, ligand],
    chains={
        "A": ChainDefinition(entity_idx=0),  # Protein
        "B": ChainDefinition(entity_idx=1),  # DNA
        "C": ChainDefinition(entity_idx=2),  # Ligand
    }
)

# Generate 3D structure
bzmol = bzmol_from_structure(complex_structure, ccd)

API Reference#

For detailed API documentation, see: