boltz_data.sequence#

Code for parsing and handling biological sequences.

Functions

cluster_sequences(*, sequences[, ...])

Cluster sequences using MMseqs2.

residue_names_from_sequence(sequence, /, *, ...)

Convert a sequence string to a list of residue names.

sequence_from_residue_names(residue_names, ...)

Convert a list of residue names to a sequence string.

boltz_data.sequence.cluster_sequences(*, sequences, min_seq_id=0.4, polymer_type=None)[source]#

Cluster sequences using MMseqs2.

Parameters:
  • sequences (Collection[str]) – List of sequences to cluster.

  • min_seq_id (float) – Minimum sequence identity for clustering.

  • polymer_type (Optional[Literal['protein', 'rna', 'dna']]) – Type of sequences. One of “protein”, “dna”, or “rna”. If None, type is inferred from sequences.

Return type:

list[int]

Returns:

List of cluster IDs, in the same order as the input sequences.

boltz_data.sequence.residue_names_from_sequence(sequence, /, *, polymer_type)[source]#

Convert a sequence string to a list of residue names.

Parameters:
  • sequence (str) – The sequence string.

  • polymer_type (Literal['protein', 'dna', 'rna']) – Type of polymer. One of “protein”, “dna”, or “rna”.

Return type:

list[str]

Returns:

A list of residue names corresponding to the sequence.

boltz_data.sequence.sequence_from_residue_names(residue_names, /, *, polymer_type, nonstandard_handling)[source]#

Convert a list of residue names to a sequence string.

Parameters:
  • residue_names (list[str]) – List of residue names.

  • polymer_type (Literal['protein', 'dna', 'rna']) – Type of polymer. One of “protein”, “dna”, or “rna”.

  • nonstandard_handling (Literal['X', 'error', 'parentheses']) – How to handle non-standard residues. One of: - “X”: Replace non-standard residues with ‘X’. - “error”: Raise an error if a non-standard residue is encountered. - “parentheses”: Wrap non-standard residues in parentheses.

Return type:

str

Returns:

A string representing the sequence.