File I/O and Serialization#

The boltz_data.fs module provides a unified interface for reading and writing data in multiple formats, with automatic compression support.

Overview#

Key features:

  • Multiple formats: Support for JSON, YAML, CBOR, pickle, Parquet, CSV and CIF

  • Compression: Automatically handle paths ending with .gz and .bz2

  • Smart I/O: read_object() and write_object() infers the format by extension

  • Type safety: Pydantic model serialization/deserialization

  • Remote files: Works with various remote sources, such as S3, GCS, HTTP via smart_open

Quick Start#

Smart Object I/O#

The simplest way to save/load data:

from boltz_data import fs
from boltz_data.mol import bzmol_from_smiles

# Create a molecule
mol = bzmol_from_smiles("CCO")

# Write automatically detects format from extension
fs.write_object("molecule.json", mol)
fs.write_object("molecule.yaml", mol)
fs.write_object("molecule.cbor", mol)
fs.write_object("molecule.pkl", mol)

# Read automatically detects format
mol = fs.read_object("molecule.json")

With Compression#

Compression is automatic based on extension:

# These are automatically compressed
fs.write_object("molecule.json.gz", mol)    # gzip
fs.write_object("molecule.cbor.gz", mol)    # gzip
fs.write_object("molecule.pkl.bz2", mol)    # bzip2
fs.write_object("molecule.yaml.xz", mol)    # lzma

# Automatically decompressed when reading
mol = fs.read_object("molecule.cbor.gz")

Format-Specific Functions#

JSON#

Human-readable text format:

from boltz_data import fs

# Write JSON
fs.write_json("data.json", {"key": "value"})
fs.write_json("data.json.gz", {"key": "value"})  # Compressed

# Read JSON
data = fs.read_json("data.json")

Use JSON when:

  • Human readability is important

  • Data will be consumed by other tools

  • Debugging or inspection is needed

Pros: Human-readable, universal support Cons: Larger file size, slower than binary formats

YAML#

Human-friendly configuration format:

from boltz_data import fs

# Write YAML
fs.write_yaml("config.yaml", {
    "model": "my_model",
    "parameters": {"lr": 0.001}
})

# Read YAML
config = fs.read_yaml("config.yaml")

Use YAML when:

  • Configuration files

  • Human editing is expected

  • Comments are needed

Pros: Most human-readable, supports comments Cons: Slower parsing, larger files

CBOR#

Binary format, optimized for scientific data:

from boltz_data import fs

# Write CBOR
fs.write_cbor("data.cbor", mol)
fs.write_cbor("data.cbor.gz", mol)  # Recommended for BZMol

# Read CBOR
mol = fs.read_cbor("data.cbor")

Use CBOR when:

  • Storing BZMol objects

  • Binary efficiency is important

  • NumPy arrays are involved (RFC8746 support)

Pros: Fast, compact, preserves NumPy dtypes Cons: Not human-readable

Special feature: Full support for NumPy arrays via RFC8746:

import numpy as np

data = {
    "coordinates": np.random.rand(1000, 3).astype(np.float32),
    "labels": np.array([1, 2, 3], dtype=np.uint8)
}

# Array dtypes are preserved
fs.write_cbor("arrays.cbor.gz", data)
loaded = fs.read_cbor("arrays.cbor.gz")

assert loaded["coordinates"].dtype == np.float32
assert loaded["labels"].dtype == np.uint8

Pickle#

Python-specific binary format:

from boltz_data import fs

# Write pickle
fs.write_pickle("data.pkl", any_python_object)
fs.write_pickle("data.pkl.gz", any_python_object)

# Read pickle
obj = fs.read_pickle("data.pkl")

Use Pickle when:

  • Python-only workflows

  • Custom Python objects

  • Maximum compatibility with existing code

Pros: Handles any Python object Cons: Python-only, security concerns, version sensitivity

Polars DataFrames#

Efficient storage for tabular data:

from boltz_data import fs
import polars as pl

# Create DataFrame
df = pl.DataFrame({
    "id": ["1abc", "2xyz"],
    "resolution": [1.5, 2.0]
})

# Write as CSV or Parquet
fs.write_polars_df("data.csv", df)
fs.write_polars_df("data.parquet", df)
fs.write_polars_df("data.parquet.gz", df)

# Read back
df = fs.read_polars_df("data.csv")
df = fs.read_polars_df("data.parquet")

Use Polars when:

  • Tabular/structured data

  • Large datasets

  • Need SQL-like operations

Pros: Very fast, columnar format, great for analytics Cons: Requires Polars

Type-Safe I/O#

With Pydantic Models#

from boltz_data import fs
from boltz_data.definition import ProteinDefinition

# Create model
protein = ProteinDefinition(
    type="protein",
    sequence="MKFLKF"
)

# Write
fs.write_object("protein.yaml", protein)

# Read with type checking
protein = fs.read_object("protein.yaml", as_=ProteinDefinition)

# This will raise a validation error if the file doesn't match

With BZMol#

from boltz_data import fs
from boltz_data.mol import BZMol, bzmol_from_smiles

# Create molecule
mol = bzmol_from_smiles("CCO")

# Write (CBOR recommended for BZMol)
fs.write_object("molecule.cbor.gz", mol)

# Read with type hint
mol = fs.read_object("molecule.cbor.gz", as_=BZMol)

File Path Utilities#

Get Extension#

from boltz_data import fs

# Get extension, ignoring compression
ext = fs.get_extension("data.json.gz", ignore_compression=True)
# Returns: ".json"

ext = fs.get_extension("data.json.gz", ignore_compression=False)
# Returns: ".gz"

Copy Files#

from boltz_data import fs

# Copy file (handles remote sources/destinations)
fs.copy_file("source.cbor.gz", "destination.cbor.gz")

# Works with S3, GCS, etc.
fs.copy_file("s3://bucket/file.cbor.gz", "local.cbor.gz")

Remote File Support#

All I/O functions support remote files via smart_open:

S3 Example#

from boltz_data import fs

# Write to S3
fs.write_object("s3://my-bucket/molecule.cbor.gz", mol)

# Read from S3
mol = fs.read_object("s3://my-bucket/molecule.cbor.gz")

HTTP Example#

# Read from HTTP
data = fs.read_json("https://example.com/data.json")

GCS Example#

# Google Cloud Storage
fs.write_object("gs://my-bucket/data.cbor.gz", mol)
mol = fs.read_object("gs://my-bucket/data.cbor.gz")

Format Comparison#

Format

Size

Speed

Human-Readable

NumPy Support

Use Case

CBOR

Small

Fast

No

Excellent (RFC8746)

BZMol, scientific data

JSON

Large

Medium

Yes

Poor

Debug, interchange

YAML

Large

Slow

Yes

Poor

Config, human editing

Pickle

Small

Fast

No

Good

Python-only workflows

Parquet

Small

Very Fast

No

Good

Tabular data, analytics

Best Practices#

1. Choose the Right Format#

# For BZMol objects - use CBOR with compression
fs.write_object("molecule.cbor.gz", mol)

# For configuration - use YAML
fs.write_yaml("config.yaml", config)

# For tabular data - use Parquet
fs.write_polars_df("chains.parquet", chains_df)

# For debugging - use JSON
fs.write_json("debug.json", data)

2. Always Compress Large Files#

# Good - compressed
fs.write_object("large_mol.cbor.gz", large_mol)

# Bad - uncompressed wastes space
fs.write_object("large_mol.cbor", large_mol)

3. Use Type Hints#

from boltz_data.mol import BZMol

# Good - type checked
mol = fs.read_object("molecule.cbor.gz", as_=BZMol)

# Less safe - no type checking
mol = fs.read_object("molecule.cbor.gz")

4. Handle Errors#

from pathlib import Path

path = Path("molecule.cbor.gz")
if path.exists():
    mol = fs.read_object(path)
else:
    print(f"File not found: {path}")

Common Workflows#

Save Dataset of Molecules#

from boltz_data import fs
from boltz_data.mol import bzmol_from_smiles
from pathlib import Path

# Create molecules
molecules = {
    "ethanol": bzmol_from_smiles("CCO"),
    "methanol": bzmol_from_smiles("CO"),
    "propanol": bzmol_from_smiles("CCCO"),
}

# Save each
output_dir = Path("molecules")
output_dir.mkdir(exist_ok=True)

for name, mol in molecules.items():
    fs.write_object(output_dir / f"{name}.cbor.gz", mol)

# Load all back
loaded_molecules = {}
for path in output_dir.glob("*.cbor.gz"):
    name = path.stem.replace(".cbor", "")
    loaded_molecules[name] = fs.read_object(path)

Configuration Management#

# Save configuration
config = {
    "model": {
        "hidden_size": 256,
        "num_layers": 4,
    },
    "training": {
        "learning_rate": 0.001,
        "batch_size": 32,
    }
}

fs.write_yaml("config.yaml", config)

# Load and use
config = fs.read_yaml("config.yaml")
model = Model(hidden_size=config["model"]["hidden_size"])

Export Results#

import polars as pl

# Process results
results = []
for mol in molecules:
    result = {
        "name": mol.name,
        "num_atoms": mol.num_atoms,
        "mass": calculate_mass(mol),
    }
    results.append(result)

# Save as DataFrame
df = pl.DataFrame(results)
fs.write_polars_df("results.csv", df)
fs.write_polars_df("results.parquet.gz", df)  # More efficient

Batch Processing with Checkpointing#

from pathlib import Path
from tqdm import tqdm

checkpoint_dir = Path("checkpoints")
checkpoint_dir.mkdir(exist_ok=True)

for i, structure_id in enumerate(tqdm(structure_ids)):
    checkpoint_file = checkpoint_dir / f"{structure_id}.cbor.gz"

    # Skip if already processed
    if checkpoint_file.exists():
        continue

    # Process
    bzmol = process_structure(structure_id)

    # Save checkpoint
    fs.write_object(checkpoint_file, bzmol)

Advanced: Direct CBOR Usage#

For advanced users needing fine control:

from boltz_data.cbor import dumps, loads
import numpy as np

# Serialize to bytes
data = {
    "array": np.array([1, 2, 3], dtype=np.float32),
    "value": 42
}
cbor_bytes = dumps(data)

# Deserialize from bytes
loaded = loads(cbor_bytes)

# Can be used with any transport (network, database, etc.)

Performance Tips#

  1. Use CBOR for BZMol objects - 2-3x faster than JSON

  2. Compress large files - .gz adds minimal time, saves 70-90% space

  3. Use Parquet for DataFrames - Much faster than CSV for large data

  4. Batch small files - Combine many small objects into one file

  5. Stream large data - Process line-by-line instead of loading all

API Reference#

For detailed API documentation, see: