Info

What is TEA?

TEA is a novel approach using contrastive learning to convert protein language model embeddings into a new 20-letter alphabet, enabling highly sensitive and efficient large-scale protein homology searches, without the need for structure. See the preprint for more.


What is STEAM?

Search with TEA against Many performs a fast search against large datasets of proteins translated to TEA.


When should I use STEAM instead of other search engines?

STEAM opens a new way to navigate the protein universe by leveraging protein language models (pLMs) representations (specifically ESM2 650M). These models have introduced a powerful new approach to express proteins through representation learning, compressing the vast sequence space into high-dimensional numerical vectors, or embeddings. These embeddings capture intricate evolutionary and structural information, allowing us to identify proteins that map to similar regions in a representation space.

Essentially, STEAM identifies proteins that are represented similarly by language models, reflecting deep structural ties that go beyond simple amino acid comparison.

In contrast to traditional sequence-based comparison tools, which rely on aligning amino acids to find exact or near-exact matches, STEAM excels in the "twilight zone" of low sequence identity. While alignment-based methods are effective for closely related proteins, their sensitivity drops sharply as evolutionary distance increases. STEAM bypasses this limitation by looking for proteins that are similar in “pLM structural syntax”. This allows for the detection of very remote homologs, recovering vital biological relationships that standard search engines often miss because the underlying sequences appear unrelated.

Furthermore, STEAM provides a powerful alternative to structural comparison methods. While it achieves performance levels comparable to structural alignment tools, it does so orthogonally and without the need for an actual 3D structure or a predicted model. This means you gain the sensitivity of structural search with the speed and ease of a sequence-based tool.

In short, STEAM finds proteins that share structural identities even when their sequences have drifted apart, providing a faster, structure-free path to deep biological discovery.


How do I use STEAM on this website?

The first step is to translate your protein sequence to TEA. This translation is used to search the TEA datasets.

The next step aligns the TEA sequence against several datasets and presents the top hits as a series of charts representing the alignment coverage of your query, as well as essential alignment details.

The colours of this chart depend on the TEA substitution matrix. Dark green represents the same TEA character in query and hit. For all cases where TEA characters in query and hit do not match, light green is a positive substitution score, olive is a negative score and orange represents the lowest substitution score.

The height of the coverage bar represents the entropy of the TEA alignment at that position. Low entropy (~ high confidence) is full height. High entropy (~ low confidence) shows a reduced height.

Clicking the coverage bar toggles the display of the full alignment. If the entropy colour scheme is selected, the same high entropy (~low confidence) columns will be highlighted in the alignment.

For hits which map directly to a pre-existing 3D structure (currently AlphaFold Version 4) the structure is displayed in a simple view, matching colours of the hit amino acid sequence. If the aligned region does not cover the entire hit, the N-terminus of the structure will appear slightly blue and the C-terminus is slightly red. Clicking/dragging on the hit sequence will zoom to the selected region. Ctrl/Cmd and shift allow multiple regions to be selected. Press Escape key to clear the selection and recenter the viewer. Making a selection on any sequence, not only the hit amino acid sequence, will copy those residues to the clipboard.


Colour schemes

TEA Entropy
XXX
Hydrophobic
RKDENQHPYWSTGAMCFLVI
Size
GASPVTCLINDKQEMHFRYW
Charged
ED (Negative)
HKR (Positive)
Polar
STNQ
Proline
P
Ser/Thr
ST
Cysteine
C
Aliphatic
ILV
Aromatic
FYWH

Clustal
This is an emulation of the default colourscheme used for alignments in Clustal X, a graphical interface for the ClustalW multiple sequence alignment program. Each residue in the alignment is assigned a colour if the amino acid profile of the alignment at that position meets some minimum criteria specific for the residue type.
The table below gives these criteria as clauses: {> X% xx,y }, where X is the threshold percentage presence for any of the xx (or y) residue types. For example, K or R is coloured red if the column includes more than 60% K or R (combined), or more than 80% of either K or R or Q (individually).

CategoryColourResidue at Position{Threshold, Residue group}Hydrophobic
 
A I L M F W V{ 60% WLVIMAFCYHP }
 
C{ 60% WLVIMAFCYHP }Positive charge
 
K R{ 60% KR }, { 80% K,R,Q }Negative charge
 
E{ 50% ED }, { 50% QE }, { 60% KR }, { 85% D,E,Q }
 
D{ 50% ED }, { 60% KR }, { 85% D,E,N }Polar
 
N{ 50% N }, { 85% N,Y }
 
Q{ 50% QE }, { 60% KR }, { 85% Q,E,K,R }
 
S T{ 50% TS }, { 60% WLVIMAFCYHP }, { 85% S,T }Cysteine
 
C{ 85% C }Glycine
 
G{ 0% G }Proline
 
P{ 0% P }Aromatic
 
H Y{ 60% WLVIMAFCYHP }, { 85% W,Y,A,C,P,Q,F,H,I,L,M,V }Unconserved
 
any / gapIf none of the above criteria are met