Jump to content

HH-suite: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.5
 
(23 intermediate revisions by 11 users not shown)
Line 1: Line 1:
{{merge from|HHpred / HHsearch}}
{{Multiple issues|
{{Multiple issues|
{{COI|date=August 2018}}
{{COI|date=August 2018}}
{{primary sources|date=July 2012}}
{{primary sources|date=July 2012}}
{{technical|date=July 2012}}
}}
}}


{{Infobox Software
{{Infobox software
| name = HH-suite
| name = HH-suite
| developer = Johannes Söding, Michael Remmert, Andreas Biegert, Andreas Hauser, Markus Meier, Martin Steinegger
| developer = Johannes Söding, Michael Remmert, Andreas Biegert, Andreas Hauser, Markus Meier, Martin Steinegger
| latest_release_version = 3.1.0
| latest_release_date = {{release date|2019|02|25|df=yes}}
| programming language = [[C++]]
| programming language = [[C++]]
| latest_release_version = 3.3.0
| latest_release_date = {{release date|2020|08|25|df=yes}}
| language = [[English language|English]]
| language = [[English language|English]]
| genre = [[Bioinformatics]] tool
| genre = [[Bioinformatics]] tool
| license = [[GNU General Public License|GPL v3]]
| license = [[GNU General Public License|GPL v3]]
| website = https://github.com/soedinglab/hh-suite
| website = https://github.com/soedinglab/hh-suite
| operating_system = [[Unix-like]]; [[Debian]] package available<ref>[http://packages.debian.org/unstable/science/hhsuite Debian hhsuite package]</ref>
}}
}}


The '''HH-suite''' is an open-source software package for sensitive [[protein]] sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences.
The '''HH-suite''' is an [[open-source software]] package for sensitive [[protein]] sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. '''HHsearch''' and '''HHblits''' are two main programs in the package and the entry point to its search function, the latter being a faster iteration.<ref name="hhsearch">{{ cite journal | author = Söding J | title = Protein homology detection by HMM-HMM comparison | journal = Bioinformatics | year = 2005 | volume = 21 | issue = 7 | pages = 951–960 | pmid = 15531603 | doi = 10.1093/bioinformatics/bti125|doi-access=free| hdl = 11858/00-001M-0000-0017-EC7A-F | hdl-access = free }}</ref><ref name="hhblits">{{ cite journal |vauthors=Remmert M, Biegert A, Hauser A, Söding J | title = HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. | journal = Nat. Methods | year = 2011 | volume = 9 | issue = 2 | pages = 173–175 | pmid = 22198341 | doi = 10.1038/NMETH.1818| hdl = 11858/00-001M-0000-0015-8D56-A | s2cid = 205420247 |url=http://wwwuser.gwdg.de/~compbiol/pdf/Preprint-2011-HHblits-inkl-Suppl2.pdf| hdl-access = free }}</ref> '''HHpred''' is an online server for [[protein structure prediction]] that uses homology information from HH-suite.<ref name="hhpred">{{cite journal |vauthors=Söding J, Biegert A, Lupas AN | title = The HHpred interactive server for protein homology detection and structure prediction | journal = Nucleic Acids Research | year = 2005 | volume = 33 | issue = Web Server issue | pages = W244–248 | pmid = 15980461 | doi = 10.1093/nar/gki408 | pmc = 1160169}}</ref>


The HH-suite searches for sequences using [[hidden Markov model]]s (HMMs). The name comes from the fact that it performs HMM-HMM alignments. Among the most popular methods for protein sequence matching, the programs have been cited more than 5000 times total according to [[Google Scholar]].<ref>[https://scholar.google.de/scholar?cluster=1978364893787489896 Citations to HHpred], [https://scholar.google.de/scholar?cluster=14407278965620614178 to HHsearch], [https://scholar.google.de/scholar?cluster=12710838460360732088 to HHblits]</ref>
==Sequence searches in biology==


== Background ==
Proteins are central players in all of life's processes. To understand how life in cells is organised, we have to understand what each of the proteins involved in these molecular processes does. This is particularly important in order to understand the origin of diseases. But for a large fraction of the approximately 20 000 human proteins the structures and functions remain unknown. Many proteins have been investigated in model organisms such as many bacteria, baker's yeast, fruit flies, zebra fish or mice, for which experiments can be often done more easily than with human cells. To predict the function, structure, or other properties of a protein for which only its sequence of amino acids is known, the protein sequence is compared to the sequences of other proteins in public databases. If a protein with sufficiently similar sequence is found, the two proteins are likely to be evolutionarily related ([[Homology (biology)#Sequence homology|"homologous"]]). In that case, they are likely to share similar structures and functions. Therefore, if a protein with a sufficiently similar sequence and with known functions and/or structure can be found by the sequence search, the unknown protein's functions, structure, and domain composition can be predicted. Such predictions greatly facilitate the determination of the function or structure by targeted validation experiments.
Proteins are central players in all of life's processes. Understanding them is central to understanding molecular processes in cells. This is particularly important in order to understand the origin of diseases. But for a large fraction of the approximately 20 000 human proteins the structures and functions remain unknown. Many proteins have been investigated in model organisms such as many bacteria, baker's yeast, fruit flies, zebra fish or mice, for which experiments can be often done more easily than with human cells. To predict the function, structure, or other properties of a protein for which only its sequence of amino acids is known, the protein sequence is compared to the sequences of other proteins in public databases. If a protein with sufficiently similar sequence is found, the two proteins are likely to be evolutionarily related ([[Homology (biology)#Sequence homology|"homologous"]]). In that case, they are likely to share similar structures and functions. Therefore, if a protein with a sufficiently similar sequence and with known functions and/or structure can be found by the sequence search, the unknown protein's functions, structure, and domain composition can be predicted. Such predictions greatly facilitate the determination of the function or structure by targeted validation experiments.


Sequence searches are frequently performed by biologists to infer the function of an unknown protein from its sequence. For this purpose, the protein's sequence is compared to the sequences of other proteins in public databases and its function is deduced from those of the most similar sequences. Often, no sequences with annotated functions can be found in such a search. In this case, more sensitive methods are required to identify more remotely related proteins or [[protein family|protein families]]. From these relationships, hypotheses about the protein's functions, [[Protein structure prediction|structure]], and [[Protein domain|domain composition]] can be inferred. HHsearch performs searches with a protein sequence through databases. The HHpred server and the HH-suite software package offer many popular, regularly updated databases, such as the [[Protein Data Bank]], as well as the [[InterPro]], [[Pfam]], [[Clusters of Orthologous Groups|COG]], and [[Structural Classification of Proteins|SCOP]] databases.
==Description==


== Algorithm ==
The HH-suite [[HHpred / HHsearch|HHsearch]] contains HHsearch
[[File:HHblits-Schematic.png|thumb|Iterative sequence search scheme of HHblits]]
<ref name="pmid15531603">{{ cite journal | author = Söding J | title = Protein homology detection by HMM-HMM comparison | journal = Bioinformatics | year = 2005 | volume = 21 | issue = 7 | pages = 951–960 | pmid = 15531603 | doi = 10.1093/bioinformatics/bti125}}</ref>
Modern sensitive methods for protein search utilize sequence profiles. They may be used to compare a sequence to a profile, or in more advanced cases such as HH-suite, to match among profiles.<ref name="hhsearch"/><ref name="pmid10975570">{{cite journal |vauthors=Jaroszewski L, Rychlewski L, Godzik A | title = Improving the quality of twilight-zone alignments | journal = Protein Science | year = 2000 | volume = 9 | issue = 8 | pages = 1487–1496 | pmid = 10975570 | doi = 10.1110/ps.9.8.1487 | pmc = 2144727}}</ref><ref>{{cite journal |vauthors=Sadreyev RI, Baker D, Grishin NV | title = Profile–profile comparisons by COMPASS predict intricate homologies between protein families | journal = Protein Science | year = 2003 | volume = 12 | issue = 10 | pages = 2262–2272 | pmid = 14500884 | doi = 10.1110/ps.03197403 | pmc = 2366929}}</ref><ref>{{cite journal | author = Dunbrack RL Jr | title = Sequence comparison and protein structure prediction | journal = Current Opinion in Structural Biology | year = 2006 | volume = 16 | issue = 3 | pages = 374–384 | pmid = 16713709 | doi = 10.1016/j.sbi.2006.05.006}}</ref> Profiles and alignments are themselves derived from matches, using for example [[BLAST (biotechnology)|PSI-BLAST]] or HHblits. A [[position-specific scoring matrix]] (PSSM) profile contains for each position in the query sequence the similarity score for the 20 amino acids. The profiles are derived from [[multiple sequence alignment]]s (MSAs), in which related proteins are written together (aligned), such that the frequencies of amino acids in each position can be interpreted as probabilities for amino acids in new related proteins, and be used to derive the "similarity scores". Because profiles contain much more information than a single sequence (e.g. the position-specific degree of conservation), profile-profile comparison methods are much more powerful than sequence-sequence comparison methods like [[BLAST (biotechnology)|BLAST]] or profile-sequence comparison methods like PSI-BLAST.<ref name="pmid10975570" />
and HHblits
<ref name="pmid22198341">{{ cite journal |vauthors=Remmert M, Biegert A, Hauser A, Söding J | title = HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. | journal = Nat. Methods | year = 2011 | volume = 9 | issue = 2 | pages = 173–175 | pmid = 22198341 | doi = 10.1038/NMETH.1818}}</ref>
among other programs and utilities. HHsearch is among the most popular methods for the detection of remotely related sequences and for protein structure prediction, having been cited over 2000 times in Google Scholar.<ref>[https://scholar.google.de/scholar?cluster=1978364893787489896 Number of citations to HHsearch on Google Scholar]</ref> The HHsearch and HHblits programs owe their power to the fact that both the query and the database sequences are represented by [[multiple sequence alignment]]s (MSAs). In these MSAs, the query or database sequence is written in a table together with homologous (related) sequences in such a way that each column contains homologous amino acid residues, that is, residues that have descended from the same residue in the ancestral sequence. The frequencies of amino acids in the columns of such an MSA can be interpreted as probabilities to observe an amino acid in a further homologous sequence at that position. To facilitate automatic scoring of potential sequences for their relatedness to the sequences in the MSA, the MSAs are succinctly described by profile [[hidden Markov model]]s (HMMs). These are extensions of [[position-specific scoring matrix|''position-specific scoring matrices'']] (PSSMs). The core algorithms for '''H'''MM-'''H'''MM alignment give HH-suite its name.


HHpred and HHsearch represent query and database proteins by [[hidden Markov model|profile hidden Markov models]] (HMMs), an extension of PSSM sequence profiles that also records position-specific amino acid insertion and deletion frequencies. HHsearch searches a database of HMMs with a query HMM. Before starting the search through the actual database of HMMs, HHsearch/HHpred builds a [[multiple sequence alignment]] of sequences related to the query sequence/MSA using the HHblits program. From this alignment, a profile HMM is calculated. The databases contain HMMs that are precalculated in the same fashion using PSI-BLAST. The output of HHpred and HHsearch is a ranked list of database matches (including E-values and probabilities for a true relationship) and the pairwise query-database sequence alignments.
[[HHpred / HHsearch|HHsearch]] takes as input a [[multiple sequence alignment]] or a profile [[hidden Markov Model]] (HMM) and searches a database of profile HMMs for homologous (related) proteins.
[[HHpred / HHsearch|HHsearch]] is often used for [[homology modeling]], that is, to build a model of the structure of a query protein for which only the sequence is known: For that purpose, a database of proteins with known structures such as the [[Protein Data Bank|protein data bank]] is searched for "template" proteins similar to the query protein. If such a template protein is found, the structure of the protein of interest can be predicted based on a pairwise [[multiple sequence alignment|sequence alignment]] of the query with the template protein sequence. In the [[CASP| CASP9]] protein structure prediction competition in 2010, a fully automated version of HHpred based on HHsearch and HHblits was ranked best out of 81 servers in template-based structure prediction [http://predictioncenter.org/casp9/groups_analysis.cgi?type=server&tbm=on&tbmfm=on&submit=Filter CASP9 TBM/FM].


HHblits, a part of the HH-suite since 2001, builds high-quality [[multiple sequence alignment]]s (MSAs) starting from a single query sequence or a MSA. As in PSI-BLAST, it works iteratively, repeatedly constructing new query profiles by adding the results found in the previous round. It matches against a pre-built HMM databases derived from protein sequence databases, each representing a "cluster" of related proteins. In the case of HHblits, such matches are done on the level of HMM-HMM profiles, which grants additional sensitivity. Its prefiltering reduces the tens of millions HMMs to match against to a few thousands of them, thus speeding up the slow HMM-HMM comparison process.<ref name="hhblits"/>
[[File:HHblits-Schematic.png|thumb|Iterative sequence search scheme of HHblits]]
HHblits was added to the HH-suite in 2011. It can build high-quality [[multiple sequence alignment]]s (MSAs) starting from a single query sequence or MSA. From the query, a profile HMM can be calculated. By using MSAs instead of single sequences, the sensitivity of sequence searches and the quality of the resulting sequence alignments can be improved dramatically {{Citation needed|date=February 2019}}. MSAs are also the starting point for a multitude of downstream computational methods, such as methods to predict the secondary and tertiary structure of proteins, to predict their molecular functions or cellular pathways, to predict the positions in their sequence or structure that contribute to enzymatic activity or ligand-binding, to predict evolutionarily conserved residues, disease-causing versus neutral mutations, the proteins' cellular localization and many more. This explains the importance to produce MSAs of the highest quality.


The HH-suite comes with a number of pre-built profile HMMs that can be searched using HHblits and HHsearch, among them a clustered version of the [[UniProt]] database, of the [[Protein Data Bank]] of proteins with known structures, of [[Pfam]] protein family alignments, of [[Structural Classification of Proteins database|SCOP]] structural protein domains, and many more.<ref>{{cite web |last1=Li |first1=Zhaoyu |title=Some Notes about HHSuite |url=https://zhaoyu.li/post/hhsuite_notes/ |accessdate=3 April 2019 |archive-date=3 April 2019 |archive-url=https://web.archive.org/web/20190403162551/https://zhaoyu.li/post/hhsuite_notes/ |url-status=dead }}</ref>
HHblits works similarly to [[PSI-BLAST]], the most popular{{Citation needed|date=February 2019}} iterative sequence search method. HHblits generates a profile HMM from the query sequence and iteratively searches through a large database of profile HMMs, such as HH-suite's uniprot20 database. The uniprot20 database contains all public, high-quality protein sequences that are collected in the [[UniProt]] database. These sequences are clustered and aligned into multiple sequence alignments, from which the profile HMMs in uniprot20 are generated. Significantly similar sequences from the previous search are added to the query profile HMM for the next search iteration. Compared to [[PSI-BLAST]] and [[HMMER]], HHblits is faster, up to twice as sensitive and produces more accurate alignments.<ref name="pmid22198341"/> HHblits uses the same HMM-HMM alignment algorithms as HHsearch, but it employs a fast prefilter that reduces the number of database HMMs for which to perform the slow HMM-HMM comparison from tens of millions to a few thousands.


== Applications ==
The HH-suite comes with a number of useful databases of profile HMMs that can be searched using HHblits and HHsearch, among them a clustered version of the [[UniProt| UniProt database]], HMMs for the [[Protein Data Bank|protein data bank]] of protein structures, for the [[Pfam| Pfam database]] of protein family alignments, the [[Structural Classification of Proteins database|SCOP database]] of structural protein domains, and many more.
Applications of HHpred and HHsearch include protein structure prediction, complex structure prediction, function prediction, domain prediction, domain boundary prediction, and evolutionary classification of proteins.<ref>{{cite journal |vauthors=Guerler A, Govindarajoo B, Zhang Y | title = Mapping Monomeric Threading to Protein–Protein Structure Prediction | journal = Journal of Chemical Information and Modeling | year = 2013 | doi = 10.1021/ci300579r| pmc = 4076494 | pmid=23413988 | volume=53 | issue = 3 | pages=717–25}}</ref>


HHsearch is often used for [[homology modeling]], that is, to build a model of the structure of a query protein for which only the sequence is known: For that purpose, a database of proteins with known structures such as the [[Protein Data Bank|protein data bank]] is searched for "template" proteins similar to the query protein. If such a template protein is found, the structure of the protein of interest can be predicted based on a pairwise [[multiple sequence alignment|sequence alignment]] of the query with the template protein sequence. For example, a search through the PDB database of proteins with solved 3D structure takes a few minutes. If a significant match with a protein of known structure (a "template") is found in the PDB database, HHpred allows the user to build a homology model using the [[MODELLER]] software, starting from the pairwise query-template alignment.
The HH-suite runs on most Linux and Unix distributions, including RedHat, Debian, Ubuntu, and Mac OS X. A [[Debian]] package is available.<ref>[http://packages.debian.org/unstable/science/hhsuite Debian hhsuite package]</ref>


HHpred servers have been ranked among the best servers during [[CASP]]7, 8, and 9, for blind protein structure prediction experiments. In CASP9, HHpredA, B, and C were ranked 1st, 2nd, and 3rd out of 81 participating automatic structure prediction servers in template-based modeling<ref>[http://predictioncenter.org/casp9/CD/data/html/groups.server.tbm.html Official CASP9 results for the template-based modeling category (121 targets)]</ref> and 6th, 7th, 8th on all 147 targets, while being much faster than the best 20 servers.<ref>[http://predictioncenter.org/casp9/CD/data/html/groups.2.html Official CASP9 results for all 147 targets]</ref> In [[CASP]]8, HHpred was ranked 7th on all targets and 2nd on the subset of single domain proteins, while still being more than 50 times faster than the top-ranked servers.<ref name=hhpred/>
The HMM-HMM alignment algorithm of HHblits and HHsearch was significantly accelerated using vector instruction in version 3 of the HH-suite<ref name="bioRxiv560029">{{ cite journal | vauthors=Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S, Söding J | title = HH-suite3 for fast remote homology detection and deep protein annotation | journal = bioRxiv | year = 2019 | doi = 10.1101/560029}}</ref>.


== Contents ==
== Overview of programs in HH-suite ==


In addition to HHsearch and HHblits, the HH-suite contains programs and perl scripts for format conversion, filtering of MSAs, generation of profile HMMs, the addition of secondary structure predictions to MSAs, the extraction of alignments from program output, and the generation of customized databases.
In addition to HHsearch and HHblits, the HH-suite contains programs and perl scripts for format conversion, filtering of MSAs, generation of profile HMMs, the addition of secondary structure predictions to MSAs, the extraction of alignments from program output, and the generation of customized databases.


{| class=wikitable
{| border="0" cellpadding="0"
|-
|-
! hhblits
| hhblits || (Iteratively) search an HHblits database with a query sequence or MSA
| (Iteratively) search an HHblits database with a query sequence or MSA
|-
|-
| hhsearch || Search an HHsearch database of HMMs with a query MSA or HMM
! hhsearch
| Search an HHsearch database of HMMs with a query MSA or HMM
|-
|-
| hhmake || Build an HMM from an input MSA
! hhmake
| Build an HMM from an input MSA
|-
|-
| hhfilter || Filter an MSA by maximum sequence identity, coverage, and other criteria
! hhfilter
| Filter an MSA by maximum sequence identity, coverage, and other criteria
|-
|-
| hhalign || Calculate pairwise alignments, dot plots etc. for two HMMs/MSAs
! hhalign
| Calculate pairwise alignments, dot plots etc. for two HMMs/MSAs
|-
|-
| reformat.pl || Reformat one or many MSAs
! reformat.pl
| Reformat one or many MSAs
|-
|-
|addss.pl || Add [[Psipred]] predicted secondary structure to an MSA or HHM file
! addss.pl
| Add [[Psipred]] predicted secondary structure to an MSA or HHM file
|-
|-
| hhmakemodel.pl || Generate MSAs or coarse 3D models from HHsearch or HHblits results
! hhmakemodel.pl
| Generate MSAs or coarse 3D models from HHsearch or HHblits results
|-
|-
| hhblitsdb.pl || Build HHblits database with prefiltering, packed MSA/HMM, and index files
! hhblitsdb.pl
| Build HHblits database with prefiltering, packed MSA/HMM, and index files
|-
|-
| multithread.pl || Run a command for many files in parallel using multiple threads
! multithread.pl
| Run a command for many files in parallel using multiple threads
|-
|-
| splitfasta.pl || Split a multiple-sequence FASTA file into multiple single-sequence files
! splitfasta.pl
| Split a multiple-sequence FASTA file into multiple single-sequence files
|-
|-
| renumberpdb.pl || Generate PDB file with indices renumbered to match input sequence indices
! renumberpdb.pl
| Generate PDB file with indices renumbered to match input sequence indices
|-
|-
|}
|}

The HMM-HMM alignment algorithm of HHblits and HHsearch was significantly accelerated using [[SIMD|vector instructions]] in version 3 of the HH-suite.<ref name="bioRxiv560029">{{ cite journal | vauthors=Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S, Söding J | title = HH-suite3 for fast remote homology detection and deep protein annotation | year = 2019 | journal = BMC Bioinformatics | pmid = 31521110 | doi = 10.1186/s12859-019-3019-7 |doi-access=free | volume=20 | issue = 1 | pmc=6744700 | page=473}}</ref>

== See also ==
*[[Sequence alignment software]]
*[[Protein structure prediction]]
*[[Position-specific scoring matrix]]
*[[Multiple sequence alignment]]
*[[CASP|CASP - Critical Assessment of Techniques for Protein Structure Prediction]]
*[[BLAST (biotechnology)| BLAST (Basic Local Alignment Search Tool)]]
*[[CS-BLAST|Context-specific BLAST (CS-BLAST)]]


== References ==
== References ==
Line 83: Line 103:
== External links ==
== External links ==
*[http://www.mpibpc.mpg.de/soeding Soeding Lab] at Max-Planck Institute in Göttingen - HH-suite developers
*[http://www.mpibpc.mpg.de/soeding Soeding Lab] at Max-Planck Institute in Göttingen - HH-suite developers
*[https://github.com/soedinglab/hh-suite HH-suite source code] download from github
*[http://wwwuser.gwdg.de/~compbiol/data/hhsuite/ Precompiled HH-suite binaries and databases] download from developers
*[http://wwwuser.gwdg.de/~compbiol/data/hhsuite/ Precompiled HH-suite binaries and databases] download from developers
*[http://toolkit.tuebingen.mpg.de/hhpred HHpred] &mdash; free server at Max-Planck Institute in Tuebingen
*[http://arquivo.pt/wayback/20160514083149/http%3A//toolkit.tuebingen.mpg.de/hhpred HHpred] &mdash; free server at Max-Planck Institute in Tuebingen
*[http://toolkit.tuebingen.mpg.de/hhblits HHblits] &mdash; free server at Max-Planck Institute in Tuebingen
*[http://toolkit.tuebingen.mpg.de/hhblits HHblits] &mdash; free server at Max-Planck Institute in Tuebingen
*[http://predictioncenter.org/ CASP website]
*[http://predictioncenter.org/ CASP website]

Latest revision as of 12:49, 3 July 2024

HH-suite
Developer(s)Johannes Söding, Michael Remmert, Andreas Biegert, Andreas Hauser, Markus Meier, Martin Steinegger
Stable release
3.3.0 / 25 August 2020 (2020-08-25)
Repository
Written inC++
Operating systemUnix-like; Debian package available[1]
Available inEnglish
TypeBioinformatics tool
LicenseGPL v3
Websitehttps://github.com/soedinglab/hh-suite

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration.[2][3] HHpred is an online server for protein structure prediction that uses homology information from HH-suite.[4]

The HH-suite searches for sequences using hidden Markov models (HMMs). The name comes from the fact that it performs HMM-HMM alignments. Among the most popular methods for protein sequence matching, the programs have been cited more than 5000 times total according to Google Scholar.[5]

Background

[edit]

Proteins are central players in all of life's processes. Understanding them is central to understanding molecular processes in cells. This is particularly important in order to understand the origin of diseases. But for a large fraction of the approximately 20 000 human proteins the structures and functions remain unknown. Many proteins have been investigated in model organisms such as many bacteria, baker's yeast, fruit flies, zebra fish or mice, for which experiments can be often done more easily than with human cells. To predict the function, structure, or other properties of a protein for which only its sequence of amino acids is known, the protein sequence is compared to the sequences of other proteins in public databases. If a protein with sufficiently similar sequence is found, the two proteins are likely to be evolutionarily related ("homologous"). In that case, they are likely to share similar structures and functions. Therefore, if a protein with a sufficiently similar sequence and with known functions and/or structure can be found by the sequence search, the unknown protein's functions, structure, and domain composition can be predicted. Such predictions greatly facilitate the determination of the function or structure by targeted validation experiments.

Sequence searches are frequently performed by biologists to infer the function of an unknown protein from its sequence. For this purpose, the protein's sequence is compared to the sequences of other proteins in public databases and its function is deduced from those of the most similar sequences. Often, no sequences with annotated functions can be found in such a search. In this case, more sensitive methods are required to identify more remotely related proteins or protein families. From these relationships, hypotheses about the protein's functions, structure, and domain composition can be inferred. HHsearch performs searches with a protein sequence through databases. The HHpred server and the HH-suite software package offer many popular, regularly updated databases, such as the Protein Data Bank, as well as the InterPro, Pfam, COG, and SCOP databases.

Algorithm

[edit]
Iterative sequence search scheme of HHblits

Modern sensitive methods for protein search utilize sequence profiles. They may be used to compare a sequence to a profile, or in more advanced cases such as HH-suite, to match among profiles.[2][6][7][8] Profiles and alignments are themselves derived from matches, using for example PSI-BLAST or HHblits. A position-specific scoring matrix (PSSM) profile contains for each position in the query sequence the similarity score for the 20 amino acids. The profiles are derived from multiple sequence alignments (MSAs), in which related proteins are written together (aligned), such that the frequencies of amino acids in each position can be interpreted as probabilities for amino acids in new related proteins, and be used to derive the "similarity scores". Because profiles contain much more information than a single sequence (e.g. the position-specific degree of conservation), profile-profile comparison methods are much more powerful than sequence-sequence comparison methods like BLAST or profile-sequence comparison methods like PSI-BLAST.[6]

HHpred and HHsearch represent query and database proteins by profile hidden Markov models (HMMs), an extension of PSSM sequence profiles that also records position-specific amino acid insertion and deletion frequencies. HHsearch searches a database of HMMs with a query HMM. Before starting the search through the actual database of HMMs, HHsearch/HHpred builds a multiple sequence alignment of sequences related to the query sequence/MSA using the HHblits program. From this alignment, a profile HMM is calculated. The databases contain HMMs that are precalculated in the same fashion using PSI-BLAST. The output of HHpred and HHsearch is a ranked list of database matches (including E-values and probabilities for a true relationship) and the pairwise query-database sequence alignments.

HHblits, a part of the HH-suite since 2001, builds high-quality multiple sequence alignments (MSAs) starting from a single query sequence or a MSA. As in PSI-BLAST, it works iteratively, repeatedly constructing new query profiles by adding the results found in the previous round. It matches against a pre-built HMM databases derived from protein sequence databases, each representing a "cluster" of related proteins. In the case of HHblits, such matches are done on the level of HMM-HMM profiles, which grants additional sensitivity. Its prefiltering reduces the tens of millions HMMs to match against to a few thousands of them, thus speeding up the slow HMM-HMM comparison process.[3]

The HH-suite comes with a number of pre-built profile HMMs that can be searched using HHblits and HHsearch, among them a clustered version of the UniProt database, of the Protein Data Bank of proteins with known structures, of Pfam protein family alignments, of SCOP structural protein domains, and many more.[9]

Applications

[edit]

Applications of HHpred and HHsearch include protein structure prediction, complex structure prediction, function prediction, domain prediction, domain boundary prediction, and evolutionary classification of proteins.[10]

HHsearch is often used for homology modeling, that is, to build a model of the structure of a query protein for which only the sequence is known: For that purpose, a database of proteins with known structures such as the protein data bank is searched for "template" proteins similar to the query protein. If such a template protein is found, the structure of the protein of interest can be predicted based on a pairwise sequence alignment of the query with the template protein sequence. For example, a search through the PDB database of proteins with solved 3D structure takes a few minutes. If a significant match with a protein of known structure (a "template") is found in the PDB database, HHpred allows the user to build a homology model using the MODELLER software, starting from the pairwise query-template alignment.

HHpred servers have been ranked among the best servers during CASP7, 8, and 9, for blind protein structure prediction experiments. In CASP9, HHpredA, B, and C were ranked 1st, 2nd, and 3rd out of 81 participating automatic structure prediction servers in template-based modeling[11] and 6th, 7th, 8th on all 147 targets, while being much faster than the best 20 servers.[12] In CASP8, HHpred was ranked 7th on all targets and 2nd on the subset of single domain proteins, while still being more than 50 times faster than the top-ranked servers.[4]

Contents

[edit]

In addition to HHsearch and HHblits, the HH-suite contains programs and perl scripts for format conversion, filtering of MSAs, generation of profile HMMs, the addition of secondary structure predictions to MSAs, the extraction of alignments from program output, and the generation of customized databases.

hhblits (Iteratively) search an HHblits database with a query sequence or MSA
hhsearch Search an HHsearch database of HMMs with a query MSA or HMM
hhmake Build an HMM from an input MSA
hhfilter Filter an MSA by maximum sequence identity, coverage, and other criteria
hhalign Calculate pairwise alignments, dot plots etc. for two HMMs/MSAs
reformat.pl Reformat one or many MSAs
addss.pl Add Psipred predicted secondary structure to an MSA or HHM file
hhmakemodel.pl Generate MSAs or coarse 3D models from HHsearch or HHblits results
hhblitsdb.pl Build HHblits database with prefiltering, packed MSA/HMM, and index files
multithread.pl Run a command for many files in parallel using multiple threads
splitfasta.pl Split a multiple-sequence FASTA file into multiple single-sequence files
renumberpdb.pl Generate PDB file with indices renumbered to match input sequence indices

The HMM-HMM alignment algorithm of HHblits and HHsearch was significantly accelerated using vector instructions in version 3 of the HH-suite.[13]

See also

[edit]

References

[edit]
  1. ^ Debian hhsuite package
  2. ^ a b Söding J (2005). "Protein homology detection by HMM-HMM comparison". Bioinformatics. 21 (7): 951–960. doi:10.1093/bioinformatics/bti125. hdl:11858/00-001M-0000-0017-EC7A-F. PMID 15531603.
  3. ^ a b Remmert M, Biegert A, Hauser A, Söding J (2011). "HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment" (PDF). Nat. Methods. 9 (2): 173–175. doi:10.1038/NMETH.1818. hdl:11858/00-001M-0000-0015-8D56-A. PMID 22198341. S2CID 205420247.
  4. ^ a b Söding J, Biegert A, Lupas AN (2005). "The HHpred interactive server for protein homology detection and structure prediction". Nucleic Acids Research. 33 (Web Server issue): W244–248. doi:10.1093/nar/gki408. PMC 1160169. PMID 15980461.
  5. ^ Citations to HHpred, to HHsearch, to HHblits
  6. ^ a b Jaroszewski L, Rychlewski L, Godzik A (2000). "Improving the quality of twilight-zone alignments". Protein Science. 9 (8): 1487–1496. doi:10.1110/ps.9.8.1487. PMC 2144727. PMID 10975570.
  7. ^ Sadreyev RI, Baker D, Grishin NV (2003). "Profile–profile comparisons by COMPASS predict intricate homologies between protein families". Protein Science. 12 (10): 2262–2272. doi:10.1110/ps.03197403. PMC 2366929. PMID 14500884.
  8. ^ Dunbrack RL Jr (2006). "Sequence comparison and protein structure prediction". Current Opinion in Structural Biology. 16 (3): 374–384. doi:10.1016/j.sbi.2006.05.006. PMID 16713709.
  9. ^ Li, Zhaoyu. "Some Notes about HHSuite". Archived from the original on 3 April 2019. Retrieved 3 April 2019.
  10. ^ Guerler A, Govindarajoo B, Zhang Y (2013). "Mapping Monomeric Threading to Protein–Protein Structure Prediction". Journal of Chemical Information and Modeling. 53 (3): 717–25. doi:10.1021/ci300579r. PMC 4076494. PMID 23413988.
  11. ^ Official CASP9 results for the template-based modeling category (121 targets)
  12. ^ Official CASP9 results for all 147 targets
  13. ^ Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S, Söding J (2019). "HH-suite3 for fast remote homology detection and deep protein annotation". BMC Bioinformatics. 20 (1): 473. doi:10.1186/s12859-019-3019-7. PMC 6744700. PMID 31521110.
[edit]