Polymer Sequences#

Biological polymers (proteins, DNA, RNA) are chains of monomers/residues. Each standard residue has a one-letter code (A, K, G) and a component ID (ALA, LYS, GLY).

Sequence Notation#

Proteins use single-letter codes (MKFLKF...), DNA uses ATCG, RNA uses AUCG. Non-standard residues use parentheses: MKF(MSE)KF.

Conversions#

Sequence to Residue Names#

from boltz_data.sequence import residue_names_from_sequence

residues = residue_names_from_sequence("ACDEF", polymer_type="protein")
# ['ALA', 'CYS', 'ASP', 'GLU', 'PHE']

residues = residue_names_from_sequence("ATCG", polymer_type="dna")
# ['DA', 'DT', 'DC', 'DG']

Residue Names to Sequence#

from boltz_data.sequence import sequence_from_residue_names

seq = sequence_from_residue_names(
    ['ALA', 'CYS', 'ASP'],
    polymer_type="protein"
)  # "ACD"

Non-Standard Residues#

Handle non-standard residues with nonstandard_handling:

"X": Replace with X
"parentheses": Wrap as (MSE)
"error": Raise exception

Mappings#

Available dictionaries: PROTEIN_ONE_LETTER_TO_COMP_ID, DNA_ONE_LETTER_TO_COMP_ID, RNA_ONE_LETTER_TO_COMP_ID and their inverse mappings.

Backbone Connectivity#

BACKBONE_ATOMS defines connecting atoms:

Proteins: C → N
DNA/RNA: O3’ → P

Extract Sequences from Structures#

from boltz_data.mol import bzmol_from_mmcif
from boltz_data.sequence import sequence_from_residue_names

bzmol = bzmol_from_mmcif("structure.cif")
residues = bzmol.residue_name[bzmol.residue_chain_id == "A"]
sequence = sequence_from_residue_names(list(residues), polymer_type="protein")

Reference#

Standard Amino Acids#

One-Letter	Three-Letter	Name
A	ALA	Alanine
C	CYS	Cysteine
D	ASP	Aspartic acid
E	GLU	Glutamic acid
F	PHE	Phenylalanine
G	GLY	Glycine
H	HIS	Histidine
I	ILE	Isoleucine
K	LYS	Lysine
L	LEU	Leucine
M	MET	Methionine
N	ASN	Asparagine
P	PRO	Proline
Q	GLN	Glutamine
R	ARG	Arginine
S	SER	Serine
T	THR	Threonine
V	VAL	Valine
W	TRP	Tryptophan
Y	TYR	Tyrosine

Standard Nucleotides#

DNA Nucleotides:

One-Letter	Component ID	Name
A	DA	Deoxyadenosine
C	DC	Deoxycytidine
G	DG	Deoxyguanosine
T	DT	Deoxythymidine

RNA Nucleotides:

One-Letter	Component ID	Name
A	A	Adenosine
C	C	Cytidine
G	G	Guanosine
U	U	Uridine

API Reference#

For detailed API documentation, see:

boltz_data.sequence API reference

Polymer Sequences

Contents