Info

What is TEA?

TEA is a novel approach using contrastive learning to convert protein language model embeddings into a new 20-letter alphabet, enabling highly sensitive and efficient large-scale protein homology searches, without the need for structure. See the preprint for more.

What is STEAM?

Search with TEA against Many performs a fast search against large datasets of proteins translated to TEA.

When should I use STEAM instead of other search engines?

STEAM opens a new way to navigate the protein universe by leveraging protein language models (pLMs) representations (specifically ESM2 650M). These models have introduced a powerful new approach to express proteins through representation learning, compressing the vast sequence space into high-dimensional numerical vectors, or embeddings. These embeddings capture intricate evolutionary and structural information, allowing us to identify proteins that map to similar regions in a representation space.

Essentially, STEAM identifies proteins that are represented similarly by language models, reflecting deep structural ties that go beyond simple amino acid comparison.

In contrast to traditional sequence-based comparison tools, which rely on aligning amino acids to find exact or near-exact matches, STEAM excels in the "twilight zone" of low sequence identity. While alignment-based methods are effective for closely related proteins, their sensitivity drops sharply as evolutionary distance increases. STEAM bypasses this limitation by looking for proteins that are similar in “pLM structural syntax”. This allows for the detection of very remote homologs, recovering vital biological relationships that standard search engines often miss because the underlying sequences appear unrelated.

Furthermore, STEAM provides a powerful alternative to structural comparison methods. While it achieves performance levels comparable to structural alignment tools, it does so orthogonally and without the need for an actual 3D structure or a predicted model. This means you gain the sensitivity of structural search with the speed and ease of a sequence-based tool.

In short, STEAM finds proteins that share structural identities even when their sequences have drifted apart, providing a faster, structure-free path to deep biological discovery.

How do I use STEAM on this website?

The first step is to translate your protein sequence to TEA. This translation is used to search the TEA datasets.

The next step aligns the TEA sequence against several datasets and presents the top hits as a series of charts representing the alignment coverage of your query, as well as essential alignment details.

The colours of this chart depend on the TEA substitution matrix. Dark green represents the same TEA character in query and hit. For all cases where TEA characters in query and hit do not match, light green is a positive substitution score, olive is a negative score and orange represents the lowest substitution score.

The height of the coverage bar represents the entropy of the TEA alignment at that position. Low entropy (~ high confidence) is full height. High entropy (~ low confidence) shows a reduced height.

Clicking the coverage bar toggles the display of the full alignment. If the entropy colour scheme is selected, the same high entropy (~low confidence) columns will be highlighted in the alignment.

For hits which map directly to a pre-existing 3D structure (currently AlphaFold Version 4) the structure is displayed in a simple view, matching colours of the hit amino acid sequence. If the aligned region does not cover the entire hit, the N-terminus of the structure will appear slightly blue and the C-terminus is slightly red. Clicking/dragging on the hit sequence will zoom to the selected region. Ctrl/Cmd and shift allow multiple regions to be selected. Press Escape key to clear the selection and recenter the viewer. Making a selection on any sequence, not only the hit amino acid sequence, will copy those residues to the clipboard.

Each hit is associated with four metrics that describe its similarity to the query sequence. Sequence identity is reported both in amino acid space and in TEA space, providing a straightforward measure of similarity as the percentage of identical characters across aligned positions.

The alignment score corresponds to the raw Smith–Waterman algorithm score. It is computed from contributions of the TEA substitution matrix (MATCHA), the BLOSUM62 substitution matrix, and gap penalties.

The E-value offers a more intuitive interpretation of the score. It is derived from an empirical model optimized using proteins assigned to different folds in the CATH classification, which are treated as false positives. The E-value represents the expected number of such false-positive matches that would achieve a score equal to or better than the observed one by chance. Lower E-values indicate more significant matches and a reduced likelihood that the observed similarity arises between proteins with different structural folds.

Colour schemes

TEA Entropy

XXX

Hydrophobic

RKDENQHPYWSTGAMCFLVI

Size

GASPVTCLINDKQEMHFRYW

Charged

ED (Negative)

HKR (Positive)

Polar

STNQ

Proline

Ser/Thr

Cysteine

Aliphatic

ILV

Aromatic

FYWH

Clustal
This is an emulation of the default colourscheme used for alignments in Clustal X, a graphical interface for the ClustalW multiple sequence alignment program. Each residue in the alignment is assigned a colour if the amino acid profile of the alignment at that position meets some minimum criteria specific for the residue type.
The table below gives these criteria as clauses: {> X% xx,y }, where X is the threshold percentage presence for any of the xx (or y) residue types. For example, K or R is coloured red if the column includes more than 60% K or R (combined), or more than 80% of either K or R or Q (individually).

CategoryColourResidue at Position{Threshold, Residue group}Hydrophobic

A I L M F W V{ 60% WLVIMAFCYHP }

C{ 60% WLVIMAFCYHP }Positive charge

K R{ 60% KR }, { 80% K,R,Q }Negative charge

E{ 50% ED }, { 50% QE }, { 60% KR }, { 85% D,E,Q }

D{ 50% ED }, { 60% KR }, { 85% D,E,N }Polar

N{ 50% N }, { 85% N,Y }

Q{ 50% QE }, { 60% KR }, { 85% Q,E,K,R }

S T{ 50% TS }, { 60% WLVIMAFCYHP }, { 85% S,T }Cysteine

C{ 85% C }Glycine

G{ 0% G }Proline

P{ 0% P }Aromatic

H Y{ 60% WLVIMAFCYHP }, { 85% W,Y,A,C,P,Q,F,H,I,L,M,V }Unconserved

any / gapIf none of the above criteria are met