boltz_data.sequence#

Code for parsing and handling biological sequences.

Functions

`cluster_sequences`(*, sequences[, ...])	Cluster sequences using MMseqs2.
`residue_names_from_sequence`(sequence, /, *, ...)	Convert a sequence string to a list of residue names.
`sequence_from_residue_names`(residue_names, ...)	Convert a list of residue names to a sequence string.

boltz_data.sequence.cluster_sequences(*, sequences, min_seq_id=0.4, polymer_type=None)[source]#

Cluster sequences using MMseqs2.

Parameters:

sequences (Collection[str]) – List of sequences to cluster.
min_seq_id (float) – Minimum sequence identity for clustering.
polymer_type (Optional[Literal['protein', 'rna', 'dna']]) – Type of sequences. One of “protein”, “dna”, or “rna”. If None, type is inferred from sequences.

Return type:

list[int]

Returns:

List of cluster IDs, in the same order as the input sequences.

boltz_data.sequence.residue_names_from_sequence(sequence, /, *, polymer_type)[source]#

Convert a sequence string to a list of residue names.

Parameters:

sequence (str) – The sequence string.
polymer_type (Literal['protein', 'dna', 'rna']) – Type of polymer. One of “protein”, “dna”, or “rna”.

Return type:

list[str]

Returns:

A list of residue names corresponding to the sequence.

boltz_data.sequence.sequence_from_residue_names(residue_names, /, *, polymer_type, nonstandard_handling)[source]#

Convert a list of residue names to a sequence string.

Parameters:

residue_names (list[str]) – List of residue names.
polymer_type (Literal['protein', 'dna', 'rna']) – Type of polymer. One of “protein”, “dna”, or “rna”.
nonstandard_handling (Literal['X', 'error', 'parentheses']) – How to handle non-standard residues. One of: - “X”: Replace non-standard residues with ‘X’. - “error”: Raise an error if a non-standard residue is encountered. - “parentheses”: Wrap non-standard residues in parentheses.

Return type:

str

Returns:

A string representing the sequence.

boltz_data.sequence