Polymer Sequences#

Biological polymers (proteins, DNA, RNA) are chains of monomers/residues. Each standard residue has a one-letter code (A, K, G) and a component ID (ALA, LYS, GLY).

Sequence Notation#

Proteins use single-letter codes (MKFLKF...), DNA uses ATCG, RNA uses AUCG. Non-standard residues use parentheses: MKF(MSE)KF.

Conversions#

Sequence to Residue Names#

from boltz_data.sequence import residue_names_from_sequence

residues = residue_names_from_sequence("ACDEF", polymer_type="protein")
# ['ALA', 'CYS', 'ASP', 'GLU', 'PHE']

residues = residue_names_from_sequence("ATCG", polymer_type="dna")
# ['DA', 'DT', 'DC', 'DG']

Residue Names to Sequence#

from boltz_data.sequence import sequence_from_residue_names

seq = sequence_from_residue_names(
    ['ALA', 'CYS', 'ASP'],
    polymer_type="protein"
)  # "ACD"

Non-Standard Residues#

Handle non-standard residues with nonstandard_handling:

  • "X": Replace with X

  • "parentheses": Wrap as (MSE)

  • "error": Raise exception

Mappings#

Available dictionaries: PROTEIN_ONE_LETTER_TO_COMP_ID, DNA_ONE_LETTER_TO_COMP_ID, RNA_ONE_LETTER_TO_COMP_ID and their inverse mappings.

Backbone Connectivity#

BACKBONE_ATOMS defines connecting atoms:

  • Proteins: C → N

  • DNA/RNA: O3’ → P

Extract Sequences from Structures#

from boltz_data.mol import bzmol_from_mmcif
from boltz_data.sequence import sequence_from_residue_names

bzmol = bzmol_from_mmcif("structure.cif")
residues = bzmol.residue_name[bzmol.residue_chain_id == "A"]
sequence = sequence_from_residue_names(list(residues), polymer_type="protein")

Reference#

Standard Amino Acids#

One-Letter

Three-Letter

Name

A

ALA

Alanine

C

CYS

Cysteine

D

ASP

Aspartic acid

E

GLU

Glutamic acid

F

PHE

Phenylalanine

G

GLY

Glycine

H

HIS

Histidine

I

ILE

Isoleucine

K

LYS

Lysine

L

LEU

Leucine

M

MET

Methionine

N

ASN

Asparagine

P

PRO

Proline

Q

GLN

Glutamine

R

ARG

Arginine

S

SER

Serine

T

THR

Threonine

V

VAL

Valine

W

TRP

Tryptophan

Y

TYR

Tyrosine

Standard Nucleotides#

DNA Nucleotides:

One-Letter

Component ID

Name

A

DA

Deoxyadenosine

C

DC

Deoxycytidine

G

DG

Deoxyguanosine

T

DT

Deoxythymidine

RNA Nucleotides:

One-Letter

Component ID

Name

A

A

Adenosine

C

C

Cytidine

G

G

Guanosine

U

U

Uridine

API Reference#

For detailed API documentation, see: