Polymer Sequences#
Biological polymers (proteins, DNA, RNA) are chains of monomers/residues. Each standard residue has a one-letter code (A, K, G) and a component ID (ALA, LYS, GLY).
Sequence Notation#
Proteins use single-letter codes (MKFLKF...), DNA uses ATCG, RNA uses AUCG. Non-standard residues use parentheses: MKF(MSE)KF.
Conversions#
Sequence to Residue Names#
from boltz_data.sequence import residue_names_from_sequence
residues = residue_names_from_sequence("ACDEF", polymer_type="protein")
# ['ALA', 'CYS', 'ASP', 'GLU', 'PHE']
residues = residue_names_from_sequence("ATCG", polymer_type="dna")
# ['DA', 'DT', 'DC', 'DG']
Residue Names to Sequence#
from boltz_data.sequence import sequence_from_residue_names
seq = sequence_from_residue_names(
['ALA', 'CYS', 'ASP'],
polymer_type="protein"
) # "ACD"
Non-Standard Residues#
Handle non-standard residues with nonstandard_handling:
"X": Replace with X"parentheses": Wrap as(MSE)"error": Raise exception
Mappings#
Available dictionaries: PROTEIN_ONE_LETTER_TO_COMP_ID, DNA_ONE_LETTER_TO_COMP_ID, RNA_ONE_LETTER_TO_COMP_ID and their inverse mappings.
Backbone Connectivity#
BACKBONE_ATOMS defines connecting atoms:
Proteins: C → N
DNA/RNA: O3’ → P
Extract Sequences from Structures#
from boltz_data.mol import bzmol_from_mmcif
from boltz_data.sequence import sequence_from_residue_names
bzmol = bzmol_from_mmcif("structure.cif")
residues = bzmol.residue_name[bzmol.residue_chain_id == "A"]
sequence = sequence_from_residue_names(list(residues), polymer_type="protein")
Reference#
Standard Amino Acids#
One-Letter |
Three-Letter |
Name |
|---|---|---|
A |
ALA |
Alanine |
C |
CYS |
Cysteine |
D |
ASP |
Aspartic acid |
E |
GLU |
Glutamic acid |
F |
PHE |
Phenylalanine |
G |
GLY |
Glycine |
H |
HIS |
Histidine |
I |
ILE |
Isoleucine |
K |
LYS |
Lysine |
L |
LEU |
Leucine |
M |
MET |
Methionine |
N |
ASN |
Asparagine |
P |
PRO |
Proline |
Q |
GLN |
Glutamine |
R |
ARG |
Arginine |
S |
SER |
Serine |
T |
THR |
Threonine |
V |
VAL |
Valine |
W |
TRP |
Tryptophan |
Y |
TYR |
Tyrosine |
Standard Nucleotides#
DNA Nucleotides:
One-Letter |
Component ID |
Name |
|---|---|---|
A |
DA |
Deoxyadenosine |
C |
DC |
Deoxycytidine |
G |
DG |
Deoxyguanosine |
T |
DT |
Deoxythymidine |
RNA Nucleotides:
One-Letter |
Component ID |
Name |
|---|---|---|
A |
A |
Adenosine |
C |
C |
Cytidine |
G |
G |
Guanosine |
U |
U |
Uridine |
API Reference#
For detailed API documentation, see: