File I/O and Serialization#
The boltz_data.fs module provides a unified interface for reading and writing data in multiple formats, with automatic compression support.
Overview#
Key features:
Multiple formats: Support for JSON, YAML, CBOR, pickle, Parquet, CSV and CIF
Compression: Automatically handle paths ending with
.gzand.bz2Smart I/O:
read_object()andwrite_object()infers the format by extensionType safety: Pydantic model serialization/deserialization
Remote files: Works with various remote sources, such as S3, GCS, HTTP via
smart_open
Quick Start#
Smart Object I/O#
The simplest way to save/load data:
from boltz_data import fs
from boltz_data.mol import bzmol_from_smiles
# Create a molecule
mol = bzmol_from_smiles("CCO")
# Write automatically detects format from extension
fs.write_object("molecule.json", mol)
fs.write_object("molecule.yaml", mol)
fs.write_object("molecule.cbor", mol)
fs.write_object("molecule.pkl", mol)
# Read automatically detects format
mol = fs.read_object("molecule.json")
With Compression#
Compression is automatic based on extension:
# These are automatically compressed
fs.write_object("molecule.json.gz", mol) # gzip
fs.write_object("molecule.cbor.gz", mol) # gzip
fs.write_object("molecule.pkl.bz2", mol) # bzip2
fs.write_object("molecule.yaml.xz", mol) # lzma
# Automatically decompressed when reading
mol = fs.read_object("molecule.cbor.gz")
Format-Specific Functions#
JSON#
Human-readable text format:
from boltz_data import fs
# Write JSON
fs.write_json("data.json", {"key": "value"})
fs.write_json("data.json.gz", {"key": "value"}) # Compressed
# Read JSON
data = fs.read_json("data.json")
Use JSON when:
Human readability is important
Data will be consumed by other tools
Debugging or inspection is needed
Pros: Human-readable, universal support Cons: Larger file size, slower than binary formats
YAML#
Human-friendly configuration format:
from boltz_data import fs
# Write YAML
fs.write_yaml("config.yaml", {
"model": "my_model",
"parameters": {"lr": 0.001}
})
# Read YAML
config = fs.read_yaml("config.yaml")
Use YAML when:
Configuration files
Human editing is expected
Comments are needed
Pros: Most human-readable, supports comments Cons: Slower parsing, larger files
CBOR#
Binary format, optimized for scientific data:
from boltz_data import fs
# Write CBOR
fs.write_cbor("data.cbor", mol)
fs.write_cbor("data.cbor.gz", mol) # Recommended for BZMol
# Read CBOR
mol = fs.read_cbor("data.cbor")
Use CBOR when:
Storing BZMol objects
Binary efficiency is important
NumPy arrays are involved (RFC8746 support)
Pros: Fast, compact, preserves NumPy dtypes Cons: Not human-readable
Special feature: Full support for NumPy arrays via RFC8746:
import numpy as np
data = {
"coordinates": np.random.rand(1000, 3).astype(np.float32),
"labels": np.array([1, 2, 3], dtype=np.uint8)
}
# Array dtypes are preserved
fs.write_cbor("arrays.cbor.gz", data)
loaded = fs.read_cbor("arrays.cbor.gz")
assert loaded["coordinates"].dtype == np.float32
assert loaded["labels"].dtype == np.uint8
Pickle#
Python-specific binary format:
from boltz_data import fs
# Write pickle
fs.write_pickle("data.pkl", any_python_object)
fs.write_pickle("data.pkl.gz", any_python_object)
# Read pickle
obj = fs.read_pickle("data.pkl")
Use Pickle when:
Python-only workflows
Custom Python objects
Maximum compatibility with existing code
Pros: Handles any Python object Cons: Python-only, security concerns, version sensitivity
Polars DataFrames#
Efficient storage for tabular data:
from boltz_data import fs
import polars as pl
# Create DataFrame
df = pl.DataFrame({
"id": ["1abc", "2xyz"],
"resolution": [1.5, 2.0]
})
# Write as CSV or Parquet
fs.write_polars_df("data.csv", df)
fs.write_polars_df("data.parquet", df)
fs.write_polars_df("data.parquet.gz", df)
# Read back
df = fs.read_polars_df("data.csv")
df = fs.read_polars_df("data.parquet")
Use Polars when:
Tabular/structured data
Large datasets
Need SQL-like operations
Pros: Very fast, columnar format, great for analytics Cons: Requires Polars
Type-Safe I/O#
With Pydantic Models#
from boltz_data import fs
from boltz_data.definition import ProteinDefinition
# Create model
protein = ProteinDefinition(
type="protein",
sequence="MKFLKF"
)
# Write
fs.write_object("protein.yaml", protein)
# Read with type checking
protein = fs.read_object("protein.yaml", as_=ProteinDefinition)
# This will raise a validation error if the file doesn't match
With BZMol#
from boltz_data import fs
from boltz_data.mol import BZMol, bzmol_from_smiles
# Create molecule
mol = bzmol_from_smiles("CCO")
# Write (CBOR recommended for BZMol)
fs.write_object("molecule.cbor.gz", mol)
# Read with type hint
mol = fs.read_object("molecule.cbor.gz", as_=BZMol)
File Path Utilities#
Get Extension#
from boltz_data import fs
# Get extension, ignoring compression
ext = fs.get_extension("data.json.gz", ignore_compression=True)
# Returns: ".json"
ext = fs.get_extension("data.json.gz", ignore_compression=False)
# Returns: ".gz"
Copy Files#
from boltz_data import fs
# Copy file (handles remote sources/destinations)
fs.copy_file("source.cbor.gz", "destination.cbor.gz")
# Works with S3, GCS, etc.
fs.copy_file("s3://bucket/file.cbor.gz", "local.cbor.gz")
Remote File Support#
All I/O functions support remote files via smart_open:
S3 Example#
from boltz_data import fs
# Write to S3
fs.write_object("s3://my-bucket/molecule.cbor.gz", mol)
# Read from S3
mol = fs.read_object("s3://my-bucket/molecule.cbor.gz")
HTTP Example#
# Read from HTTP
data = fs.read_json("https://example.com/data.json")
GCS Example#
# Google Cloud Storage
fs.write_object("gs://my-bucket/data.cbor.gz", mol)
mol = fs.read_object("gs://my-bucket/data.cbor.gz")
Format Comparison#
Format |
Size |
Speed |
Human-Readable |
NumPy Support |
Use Case |
|---|---|---|---|---|---|
CBOR |
Small |
Fast |
No |
Excellent (RFC8746) |
BZMol, scientific data |
JSON |
Large |
Medium |
Yes |
Poor |
Debug, interchange |
YAML |
Large |
Slow |
Yes |
Poor |
Config, human editing |
Pickle |
Small |
Fast |
No |
Good |
Python-only workflows |
Parquet |
Small |
Very Fast |
No |
Good |
Tabular data, analytics |
Best Practices#
1. Choose the Right Format#
# For BZMol objects - use CBOR with compression
fs.write_object("molecule.cbor.gz", mol)
# For configuration - use YAML
fs.write_yaml("config.yaml", config)
# For tabular data - use Parquet
fs.write_polars_df("chains.parquet", chains_df)
# For debugging - use JSON
fs.write_json("debug.json", data)
2. Always Compress Large Files#
# Good - compressed
fs.write_object("large_mol.cbor.gz", large_mol)
# Bad - uncompressed wastes space
fs.write_object("large_mol.cbor", large_mol)
3. Use Type Hints#
from boltz_data.mol import BZMol
# Good - type checked
mol = fs.read_object("molecule.cbor.gz", as_=BZMol)
# Less safe - no type checking
mol = fs.read_object("molecule.cbor.gz")
4. Handle Errors#
from pathlib import Path
path = Path("molecule.cbor.gz")
if path.exists():
mol = fs.read_object(path)
else:
print(f"File not found: {path}")
Common Workflows#
Save Dataset of Molecules#
from boltz_data import fs
from boltz_data.mol import bzmol_from_smiles
from pathlib import Path
# Create molecules
molecules = {
"ethanol": bzmol_from_smiles("CCO"),
"methanol": bzmol_from_smiles("CO"),
"propanol": bzmol_from_smiles("CCCO"),
}
# Save each
output_dir = Path("molecules")
output_dir.mkdir(exist_ok=True)
for name, mol in molecules.items():
fs.write_object(output_dir / f"{name}.cbor.gz", mol)
# Load all back
loaded_molecules = {}
for path in output_dir.glob("*.cbor.gz"):
name = path.stem.replace(".cbor", "")
loaded_molecules[name] = fs.read_object(path)
Configuration Management#
# Save configuration
config = {
"model": {
"hidden_size": 256,
"num_layers": 4,
},
"training": {
"learning_rate": 0.001,
"batch_size": 32,
}
}
fs.write_yaml("config.yaml", config)
# Load and use
config = fs.read_yaml("config.yaml")
model = Model(hidden_size=config["model"]["hidden_size"])
Export Results#
import polars as pl
# Process results
results = []
for mol in molecules:
result = {
"name": mol.name,
"num_atoms": mol.num_atoms,
"mass": calculate_mass(mol),
}
results.append(result)
# Save as DataFrame
df = pl.DataFrame(results)
fs.write_polars_df("results.csv", df)
fs.write_polars_df("results.parquet.gz", df) # More efficient
Batch Processing with Checkpointing#
from pathlib import Path
from tqdm import tqdm
checkpoint_dir = Path("checkpoints")
checkpoint_dir.mkdir(exist_ok=True)
for i, structure_id in enumerate(tqdm(structure_ids)):
checkpoint_file = checkpoint_dir / f"{structure_id}.cbor.gz"
# Skip if already processed
if checkpoint_file.exists():
continue
# Process
bzmol = process_structure(structure_id)
# Save checkpoint
fs.write_object(checkpoint_file, bzmol)
Advanced: Direct CBOR Usage#
For advanced users needing fine control:
from boltz_data.cbor import dumps, loads
import numpy as np
# Serialize to bytes
data = {
"array": np.array([1, 2, 3], dtype=np.float32),
"value": 42
}
cbor_bytes = dumps(data)
# Deserialize from bytes
loaded = loads(cbor_bytes)
# Can be used with any transport (network, database, etc.)
Performance Tips#
Use CBOR for BZMol objects - 2-3x faster than JSON
Compress large files -
.gzadds minimal time, saves 70-90% spaceUse Parquet for DataFrames - Much faster than CSV for large data
Batch small files - Combine many small objects into one file
Stream large data - Process line-by-line instead of loading all
API Reference#
For detailed API documentation, see: