File I/O and Serialization#

The boltz_data.fs module provides a unified interface for reading and writing data in multiple formats, with automatic compression support.

Overview#

Key features:

Multiple formats: Support for JSON, YAML, CBOR, pickle, Parquet, CSV and CIF
Compression: Automatically handle paths ending with .gz and .bz2
Smart I/O: read_object() and write_object() infers the format by extension
Type safety: Pydantic model serialization/deserialization
Remote files: Works with various remote sources, such as S3, GCS, HTTP via smart_open

Quick Start#

Smart Object I/O#

The simplest way to save/load data:

from boltz_data import fs
from boltz_data.mol import bzmol_from_smiles

# Create a molecule
mol = bzmol_from_smiles("CCO")

# Write automatically detects format from extension
fs.write_object("molecule.json", mol)
fs.write_object("molecule.yaml", mol)
fs.write_object("molecule.cbor", mol)
fs.write_object("molecule.pkl", mol)

# Read automatically detects format
mol = fs.read_object("molecule.json")

With Compression#

Compression is automatic based on extension:

# These are automatically compressed
fs.write_object("molecule.json.gz", mol)    # gzip
fs.write_object("molecule.cbor.gz", mol)    # gzip
fs.write_object("molecule.pkl.bz2", mol)    # bzip2
fs.write_object("molecule.yaml.xz", mol)    # lzma

# Automatically decompressed when reading
mol = fs.read_object("molecule.cbor.gz")

Format-Specific Functions#

JSON#

Human-readable text format:

from boltz_data import fs

# Write JSON
fs.write_json("data.json", {"key": "value"})
fs.write_json("data.json.gz", {"key": "value"})  # Compressed

# Read JSON
data = fs.read_json("data.json")

Use JSON when:

Human readability is important
Data will be consumed by other tools
Debugging or inspection is needed

Pros: Human-readable, universal support Cons: Larger file size, slower than binary formats

YAML#

Human-friendly configuration format:

from boltz_data import fs

# Write YAML
fs.write_yaml("config.yaml", {
    "model": "my_model",
    "parameters": {"lr": 0.001}
})

# Read YAML
config = fs.read_yaml("config.yaml")

Use YAML when:

Configuration files
Human editing is expected
Comments are needed

Pros: Most human-readable, supports comments Cons: Slower parsing, larger files

CBOR#

Binary format, optimized for scientific data:

from boltz_data import fs

# Write CBOR
fs.write_cbor("data.cbor", mol)
fs.write_cbor("data.cbor.gz", mol)  # Recommended for BZMol

# Read CBOR
mol = fs.read_cbor("data.cbor")

Use CBOR when:

Storing BZMol objects
Binary efficiency is important
NumPy arrays are involved (RFC8746 support)

Pros: Fast, compact, preserves NumPy dtypes Cons: Not human-readable

Special feature: Full support for NumPy arrays via RFC8746:

import numpy as np

data = {
    "coordinates": np.random.rand(1000, 3).astype(np.float32),
    "labels": np.array([1, 2, 3], dtype=np.uint8)
}

# Array dtypes are preserved
fs.write_cbor("arrays.cbor.gz", data)
loaded = fs.read_cbor("arrays.cbor.gz")

assert loaded["coordinates"].dtype == np.float32
assert loaded["labels"].dtype == np.uint8

Pickle#

Python-specific binary format:

from boltz_data import fs

# Write pickle
fs.write_pickle("data.pkl", any_python_object)
fs.write_pickle("data.pkl.gz", any_python_object)

# Read pickle
obj = fs.read_pickle("data.pkl")

Use Pickle when:

Python-only workflows
Custom Python objects
Maximum compatibility with existing code

Pros: Handles any Python object Cons: Python-only, security concerns, version sensitivity

Polars DataFrames#

Efficient storage for tabular data:

from boltz_data import fs
import polars as pl

# Create DataFrame
df = pl.DataFrame({
    "id": ["1abc", "2xyz"],
    "resolution": [1.5, 2.0]
})

# Write as CSV or Parquet
fs.write_polars_df("data.csv", df)
fs.write_polars_df("data.parquet", df)
fs.write_polars_df("data.parquet.gz", df)

# Read back
df = fs.read_polars_df("data.csv")
df = fs.read_polars_df("data.parquet")

Use Polars when:

Tabular/structured data
Large datasets
Need SQL-like operations

Pros: Very fast, columnar format, great for analytics Cons: Requires Polars

Type-Safe I/O#

With Pydantic Models#

from boltz_data import fs
from boltz_data.definition import ProteinDefinition

# Create model
protein = ProteinDefinition(
    type="protein",
    sequence="MKFLKF"
)

# Write
fs.write_object("protein.yaml", protein)

# Read with type checking
protein = fs.read_object("protein.yaml", as_=ProteinDefinition)

# This will raise a validation error if the file doesn't match

With BZMol#

from boltz_data import fs
from boltz_data.mol import BZMol, bzmol_from_smiles

# Create molecule
mol = bzmol_from_smiles("CCO")

# Write (CBOR recommended for BZMol)
fs.write_object("molecule.cbor.gz", mol)

# Read with type hint
mol = fs.read_object("molecule.cbor.gz", as_=BZMol)

File Path Utilities#

Get Extension#

from boltz_data import fs

# Get extension, ignoring compression
ext = fs.get_extension("data.json.gz", ignore_compression=True)
# Returns: ".json"

ext = fs.get_extension("data.json.gz", ignore_compression=False)
# Returns: ".gz"

Copy Files#

from boltz_data import fs

# Copy file (handles remote sources/destinations)
fs.copy_file("source.cbor.gz", "destination.cbor.gz")

# Works with S3, GCS, etc.
fs.copy_file("s3://bucket/file.cbor.gz", "local.cbor.gz")

Remote File Support#

All I/O functions support remote files via smart_open:

S3 Example#

from boltz_data import fs

# Write to S3
fs.write_object("s3://my-bucket/molecule.cbor.gz", mol)

# Read from S3
mol = fs.read_object("s3://my-bucket/molecule.cbor.gz")

HTTP Example#

# Read from HTTP
data = fs.read_json("https://example.com/data.json")

GCS Example#

# Google Cloud Storage
fs.write_object("gs://my-bucket/data.cbor.gz", mol)
mol = fs.read_object("gs://my-bucket/data.cbor.gz")

Format Comparison#

Format	Size	Speed	Human-Readable	NumPy Support	Use Case
CBOR	Small	Fast	No	Excellent (RFC8746)	BZMol, scientific data
JSON	Large	Medium	Yes	Poor	Debug, interchange
YAML	Large	Slow	Yes	Poor	Config, human editing
Pickle	Small	Fast	No	Good	Python-only workflows
Parquet	Small	Very Fast	No	Good	Tabular data, analytics

Best Practices#

1. Choose the Right Format#

# For BZMol objects - use CBOR with compression
fs.write_object("molecule.cbor.gz", mol)

# For configuration - use YAML
fs.write_yaml("config.yaml", config)

# For tabular data - use Parquet
fs.write_polars_df("chains.parquet", chains_df)

# For debugging - use JSON
fs.write_json("debug.json", data)

2. Always Compress Large Files#

# Good - compressed
fs.write_object("large_mol.cbor.gz", large_mol)

# Bad - uncompressed wastes space
fs.write_object("large_mol.cbor", large_mol)

3. Use Type Hints#

from boltz_data.mol import BZMol

# Good - type checked
mol = fs.read_object("molecule.cbor.gz", as_=BZMol)

# Less safe - no type checking
mol = fs.read_object("molecule.cbor.gz")

4. Handle Errors#

from pathlib import Path

path = Path("molecule.cbor.gz")
if path.exists():
    mol = fs.read_object(path)
else:
    print(f"File not found: {path}")

Common Workflows#

Save Dataset of Molecules#

from boltz_data import fs
from boltz_data.mol import bzmol_from_smiles
from pathlib import Path

# Create molecules
molecules = {
    "ethanol": bzmol_from_smiles("CCO"),
    "methanol": bzmol_from_smiles("CO"),
    "propanol": bzmol_from_smiles("CCCO"),
}

# Save each
output_dir = Path("molecules")
output_dir.mkdir(exist_ok=True)

for name, mol in molecules.items():
    fs.write_object(output_dir / f"{name}.cbor.gz", mol)

# Load all back
loaded_molecules = {}
for path in output_dir.glob("*.cbor.gz"):
    name = path.stem.replace(".cbor", "")
    loaded_molecules[name] = fs.read_object(path)

Configuration Management#

# Save configuration
config = {
    "model": {
        "hidden_size": 256,
        "num_layers": 4,
    },
    "training": {
        "learning_rate": 0.001,
        "batch_size": 32,
    }
}

fs.write_yaml("config.yaml", config)

# Load and use
config = fs.read_yaml("config.yaml")
model = Model(hidden_size=config["model"]["hidden_size"])

Export Results#

import polars as pl

# Process results
results = []
for mol in molecules:
    result = {
        "name": mol.name,
        "num_atoms": mol.num_atoms,
        "mass": calculate_mass(mol),
    }
    results.append(result)

# Save as DataFrame
df = pl.DataFrame(results)
fs.write_polars_df("results.csv", df)
fs.write_polars_df("results.parquet.gz", df)  # More efficient

Batch Processing with Checkpointing#

from pathlib import Path
from tqdm import tqdm

checkpoint_dir = Path("checkpoints")
checkpoint_dir.mkdir(exist_ok=True)

for i, structure_id in enumerate(tqdm(structure_ids)):
    checkpoint_file = checkpoint_dir / f"{structure_id}.cbor.gz"

    # Skip if already processed
    if checkpoint_file.exists():
        continue

    # Process
    bzmol = process_structure(structure_id)

    # Save checkpoint
    fs.write_object(checkpoint_file, bzmol)

Advanced: Direct CBOR Usage#

For advanced users needing fine control:

from boltz_data.cbor import dumps, loads
import numpy as np

# Serialize to bytes
data = {
    "array": np.array([1, 2, 3], dtype=np.float32),
    "value": 42
}
cbor_bytes = dumps(data)

# Deserialize from bytes
loaded = loads(cbor_bytes)

# Can be used with any transport (network, database, etc.)

Performance Tips#

Use CBOR for BZMol objects - 2-3x faster than JSON
Compress large files - .gz adds minimal time, saves 70-90% space
Use Parquet for DataFrames - Much faster than CSV for large data
Batch small files - Combine many small objects into one file
Stream large data - Process line-by-line instead of loading all

API Reference#

For detailed API documentation, see:

boltz_data.fs API reference
boltz_data.cbor API reference

File I/O and Serialization

Contents