Jump to content

Talk:Smith–Waterman algorithm: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Implementing WP:PIQA (Task 26)
 
(21 intermediate revisions by 13 users not shown)
Line 1: Line 1:
{{WikiProject banner shell|class=B|
{{WikiProjectBannerShell|1=
{{WikiProject Computational Biology|importance=top|class=B}}
{{WikiProject Molecular Biology|MCB=yes|MCB-importance=Mid|COMPBIO=yes|COMPBIO-importance=top}}
{{WikiProject Computer science|importance=mid}}
{{WikiProject Computer science|importance=mid}}
{{WikiProject Molecular and Cellular Biology|importance=Mid|class=B}}
}}
}}


== Wrong formula for matrix? ==
== Wrong formula for matrix? ==
Referring back to the 1981 paper of Smith and Waterman (http://www.cmb.usc.edu/papers/msw_papers/msw-042.pdf, formula (1)), the definition formula for the cells of the matrix H is not the same as in this article. The gap scores do not only depend on the row above and the column to the left; they depend on all rows above and all columns to the left. There are examples where this makes a difference to the final score, so the definition here is not correct.
Referring back to the 1981 paper of Smith and Waterman (ftp://150.128.97.71/pub/Bioinformatica/msw-042.pdf, formula (1)), the definition formula for the cells of the matrix H is not the same as in this article. The gap scores do not only depend on the row above and the column to the left; they depend on all rows above and all columns to the left. There are examples where this makes a difference to the final score, so the definition here is not correct.


An example: Reference:
An example: Reference:
Line 15: Line 14:
Scoring +2 for match, -6 for mismatch, -5 for gap open, -3 for gap extend, the overall score is 66 by Smith-Waterman's algorithm and 62 by the algorithm presented here. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/217.33.180.66|217.33.180.66]] ([[User talk:217.33.180.66|talk]]) 13:15, 17 March 2014 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->
Scoring +2 for match, -6 for mismatch, -5 for gap open, -3 for gap extend, the overall score is 66 by Smith-Waterman's algorithm and 62 by the algorithm presented here. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/217.33.180.66|217.33.180.66]] ([[User talk:217.33.180.66|talk]]) 13:15, 17 March 2014 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->


I have now corrected this on the page. <small><span class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:BarrierBank|BarrierBank]] ([[User talk:BarrierBank|talk]] • [[Special:Contributions/BarrierBank|contribs]]) 10:49, 21 March 2014 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot-->
I have now corrected this on the page.

I updated the algorithm using the original notation. I also explained why this difference exists in the gap penalty section.
[[User:Yz cs5160|Yz cs5160]] ([[User talk:Yz cs5160|talk]]) 07:20, 8 January 2017 (UTC)


==WPMED==
==WPMED==
Line 48: Line 50:


Smith-Waterman is for local alignment, which means that if H(i, j) is 0, then it's best for the alignment to start there. I think the example is really confusing, because it doesn't illustrate this fact, so you get an optimum local alignment that is also a global alignment. [[Special:Contributions/69.218.212.242|69.218.212.242]] ([[User talk:69.218.212.242|talk]]) 17:20, 8 August 2009 (UTC)
Smith-Waterman is for local alignment, which means that if H(i, j) is 0, then it's best for the alignment to start there. I think the example is really confusing, because it doesn't illustrate this fact, so you get an optimum local alignment that is also a global alignment. [[Special:Contributions/69.218.212.242|69.218.212.242]] ([[User talk:69.218.212.242|talk]]) 17:20, 8 August 2009 (UTC)

See section "Comparison with the Needleman–Wunsch algorithm" for the explanation. I also redid the example completely. [[User:Yz cs5160|Yz cs5160]] ([[User talk:Yz cs5160|talk]]) 07:35, 8 January 2017 (UTC)


== Performance reports ==
== Performance reports ==
Line 63: Line 67:
==Example needs to totally be redone==
==Example needs to totally be redone==
As a commenter above points out, this example is not good. The example in the article currently results in a local alignment which is also a global alignment. As the whole point of Smith-Waterman is to find the best local alignment, the example should find the best alignment of the best matching substring(s) in the 2 strings. I don't have time to fix this right now, but wanted point that out. --[[User:Rajah|Rajah]] ([[User talk:Rajah|talk]]) 03:46, 23 March 2010 (UTC)
As a commenter above points out, this example is not good. The example in the article currently results in a local alignment which is also a global alignment. As the whole point of Smith-Waterman is to find the best local alignment, the example should find the best alignment of the best matching substring(s) in the 2 strings. I don't have time to fix this right now, but wanted point that out. --[[User:Rajah|Rajah]] ([[User talk:Rajah|talk]]) 03:46, 23 March 2010 (UTC)

:I'd also like a better example, or perhaps two examples, a simple one, and a longer one. I am having a hard time understanding how the algorithm works. What I have gathered so far is that it will build a matrix, and this matrix has the optimal alignment. But from this part, how to get the optimal sequence, is not completely clear to me. [[Special:Contributions/2A02:8388:1641:8D00:BE5F:F4FF:FECD:7CB2|2A02:8388:1641:8D00:BE5F:F4FF:FECD:7CB2]] ([[User talk:2A02:8388:1641:8D00:BE5F:F4FF:FECD:7CB2|talk]]) 15:44, 28 May 2016 (UTC)

The example was totally redone. An animated image was also added to demonstrate the algorithm. Please let me know if it helps. [[User:Yz cs5160|Yz cs5160]] ([[User talk:Yz cs5160|talk]]) 07:39, 8 January 2017 (UTC)


== Acceleration ==
== Acceleration ==
Line 74: Line 82:


The first reference to the paper in PDF seems to be dead, it looks like the university blocked it. There is another copy in here http://www.cmb.usc.edu/papers/msw_papers/msw-042.pdf but I'm not sure if putting that (or for that matter the link to the PDF that was here before) would involve copyright infringement. Can somebody enlighten me about this? [[User:Jorgeng87|Jorgeng87]] ([[User talk:Jorgeng87|talk]]) 19:21, 13 November 2012 (UTC)
The first reference to the paper in PDF seems to be dead, it looks like the university blocked it. There is another copy in here http://www.cmb.usc.edu/papers/msw_papers/msw-042.pdf but I'm not sure if putting that (or for that matter the link to the PDF that was here before) would involve copyright infringement. Can somebody enlighten me about this? [[User:Jorgeng87|Jorgeng87]] ([[User talk:Jorgeng87|talk]]) 19:21, 13 November 2012 (UTC)

== External links modified ==

Hello fellow Wikipedians,

I have just added archive links to {{plural:1|one external link|1 external links}} on [[Smith–Waterman algorithm]]. Please take a moment to review [https://en.wikipedia.org/enwiki/w/index.php?diff=prev&oldid=704686663 my edit]. If necessary, add {{tlx|cbignore}} after the link to keep me from modifying it. Alternatively, you can add {{tlx|nobots|deny{{=}}InternetArchiveBot}} to keep me off the page altogether. I made the following changes:
*Added archive https://web.archive.org/20080513152726/http://www.genomequest.com/contact-bioinformatics-ht.html to http://www.genomequest.com/contact-bioinformatics-ht.html

When you have finished reviewing my changes, please set the ''checked'' parameter below to '''true''' to let others know.

{{sourcecheck|checked=false}}

Cheers.—[[User:Cyberbot II|<sup style="color:green;font-family:Courier;">cyberbot II</sup>]]<small><sub style="margin-left:-14.9ex;color:green;font-family:Comic Sans MS;">[[User talk:Cyberbot II|<span style="color:green;">Talk to my owner</span>]]:Online</sub></small> 23:37, 12 February 2016 (UTC)

== Alignment construction explanation needs improvement ==

"Then, go backwards to one of positions (i − 1,j), (i, j − 1), and (i − 1, j − 1) depending on the direction of movement used to construct the matrix. "
This doesn't explain which direction the movement should go. I assume that the movement is diagonal if the value in the cell comes from a match. I'm trying to figure out this algorithm for some work I'm doing so if I figure it out, I'll contribute a clearer explanation.
[[User:Stochphon|Stochphon]] ([[User talk:Stochphon|talk]]) 05:35, 8 October 2016 (UTC)

: I updated the explanation. The movement depends on the source of each score, i.e., if match/mismatch from <math>H_{i-1,j-1}</math> gives the highest score, then go diagonal; if adding gap(s) horizontally or vertically gives the highest score, then go left or up; if none of them are positive, then this cell gets 0 and has no source, and traceback will end here if this cell is encountered. Note that:
:: Alternative paths are generated if a cell being traced back has more than one source.
:: Scores come not necessarily from cells directly adjacent to them (<math>H_{i-1,j}</math> and <math>H_{i,j-1}</math>). Instead, the gaps may come directly from all positions on that row or column that are left (<math>H_{i,j-l}</math>) or up (<math>H_{i-k,j}</math>) to the cell being scored.
: [[User:Yz cs5160|Yz cs5160]] ([[User talk:Yz cs5160|talk]]) 08:12, 8 January 2017 (UTC)

== External links modified ==

Hello fellow Wikipedians,

I have just modified 5 external links on [[Smith–Waterman algorithm]]. Please take a moment to review [https://en.wikipedia.org/enwiki/w/index.php?diff=prev&oldid=803739609 my edit]. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit [[User:Cyberpower678/FaQs#InternetArchiveBot|this simple FaQ]] for additional information. I made the following changes:
*Added archive https://web.archive.org/web/20080705191338/http://ft.ornl.gov/~olaf/pubs/OlafRSSI2July07.pdf to http://ft.ornl.gov/~olaf/pubs/OlafRSSI2July07.pdf
*Added archive https://web.archive.org/web/20080705191314/http://ft.ornl.gov/~olaf/pubs/CUG07Olaf17M07.pdf to http://ft.ornl.gov/~olaf/pubs/CUG07Olaf17M07.pdf
*Added archive https://web.archive.org/web/20110720080855/http://ft.ornl.gov/~olaf/pubs/RSSIOlafDave.pdf to http://ft.ornl.gov/~olaf/pubs/RSSIOlafDave.pdf
*Corrected formatting/usage for http://www.genomequest.com/contact-bioinformatics-ht.html
*Added archive https://web.archive.org/web/20070811101052/http://www.clccell.com/download.html to http://www.clccell.com/download.html

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

{{sourcecheck|checked=false|needhelp=}}

Cheers.—[[User:InternetArchiveBot|'''<span style="color:darkgrey;font-family:monospace">InternetArchiveBot</span>''']] <span style="color:green;font-family:Rockwell">([[User talk:InternetArchiveBot|Report bug]])</span> 09:37, 4 October 2017 (UTC)

== Implementation ==

The implementation section needs to be redone (maybe merged with the section "Accelerated versions"). Unfortunately I'm not familiar with those implementations. I would appreciate it if someone can help rewrite those two sections. It'll help people better use the algorithm.
[[User:Yz cs5160|Yz cs5160]] ([[User talk:Yz cs5160|talk]]) 18:17, 23 October 2017 (UTC)

== Running time: quadratic or qubic?? ==

The intro paragraph says "cubic", which seems to match the pseudocode.
The box on the right says <math>O(mn)</math>.
The history mentions cubic again, but points out to *different* algorithms that run in quadratic time (Gotoh and Altschul et al.).
Finally, under "Gap Penalty" one can see that the latter algorithms only solve an (important) special case of the problem, and the former doesn't even necessarily solve it!
[[User:Aviad.rubinstein|Aviad.rubinstein]] ([[User talk:Aviad.rubinstein|talk]]) 21:28, 7 September 2018 (UTC)

Latest revision as of 08:01, 17 February 2024

Wrong formula for matrix?

[edit]

Referring back to the 1981 paper of Smith and Waterman (ftp://150.128.97.71/pub/Bioinformatica/msw-042.pdf, formula (1)), the definition formula for the cells of the matrix H is not the same as in this article. The gap scores do not only depend on the row above and the column to the left; they depend on all rows above and all columns to the left. There are examples where this makes a difference to the final score, so the definition here is not correct.

An example: Reference: AACCCGAGTGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAAAGCAATGGCAAGAATTGGAATGGAAAGGAATC Read: GAATGGAATGGAATGCAATGGAATGGAATCATCCGTAATGGAATGGAAAGCAATGGAATG

Scoring +2 for match, -6 for mismatch, -5 for gap open, -3 for gap extend, the overall score is 66 by Smith-Waterman's algorithm and 62 by the algorithm presented here. — Preceding unsigned comment added by 217.33.180.66 (talk) 13:15, 17 March 2014 (UTC)[reply]

I have now corrected this on the page. — Preceding unsigned comment added by BarrierBank (talkcontribs) 10:49, 21 March 2014 (UTC)[reply]

I updated the algorithm using the original notation. I also explained why this difference exists in the gap penalty section. Yz cs5160 (talk) 07:20, 8 January 2017 (UTC)[reply]

WPMED

[edit]

I am not sure this article falls within the scope of WP:MED. Anyone who agrees, please delete the WPMED tag on this page. --Una Smith (talk) 04:29, 29 December 2007 (UTC) Biology - Yes, Medicine - well, there is very remote connection - but no. Crenshaw (talk) 02:03, 19 January 2010 (UTC)[reply]

Pseudocode of Algorithm and Remarks to Code are missing

[edit]

There is the simple but efficient pseudocode of the algorithm missing. It is tailored to fit to the properties of genes in common genomes. Maybe someone can link the properties of common (for example eucaryotic cell) genes having introns and exons to the properties of Smith-Waterman or pairwise Smith-Waterman pSW --84.157.243.81 (talk) 09:32, 24 July 2008 (UTC)[reply]

Eq. -- why the 0?

[edit]

Shouldn't be

What's that zero doing in the B-matrix? —Preceding unsigned comment added by 129.170.215.143 (talk) 23:38, 18 June 2009 (UTC)[reply]

Smith-Waterman is for local alignment, which means that if H(i, j) is 0, then it's best for the alignment to start there. I think the example is really confusing, because it doesn't illustrate this fact, so you get an optimum local alignment that is also a global alignment. 69.218.212.242 (talk) 17:20, 8 August 2009 (UTC)[reply]

See section "Comparison with the Needleman–Wunsch algorithm" for the explanation. I also redid the example completely. Yz cs5160 (talk) 07:35, 8 January 2017 (UTC)[reply]

Performance reports

[edit]

I find this article similar to post with "me faster" signs. As someone who actually takes part in "competition" of speeding up Smith-Waterman for different architectures I think it's a bit NPOV. (disclaimer: i haven't add mine implementation yet to the article but I intend to do so).


Performance results should be reported not as a single value but a graph with performance plotted against query size and database used. For example Fastflow implementation achieves 35GCUPS for __very__ long queries that are barely if ever used in practice. I think it's a bit misleading.

I guess that I'll rewrite this article and put pefrormance graph for different algorithms. But it would be nice if someone reviewed the article afterwards (coz I'll add my own work on SW too) to eliminate potential NPOV.

Crenshaw (talk) —Preceding undated comment added 22:20, 30 January 2010 (UTC).[reply]

Agree. The "754x speedup" is meaningless - compared to what? Ketil (talk) 21:32, 11 December 2011 (UTC)[reply]

Example needs to totally be redone

[edit]

As a commenter above points out, this example is not good. The example in the article currently results in a local alignment which is also a global alignment. As the whole point of Smith-Waterman is to find the best local alignment, the example should find the best alignment of the best matching substring(s) in the 2 strings. I don't have time to fix this right now, but wanted point that out. --Rajah (talk) 03:46, 23 March 2010 (UTC)[reply]

I'd also like a better example, or perhaps two examples, a simple one, and a longer one. I am having a hard time understanding how the algorithm works. What I have gathered so far is that it will build a matrix, and this matrix has the optimal alignment. But from this part, how to get the optimal sequence, is not completely clear to me. 2A02:8388:1641:8D00:BE5F:F4FF:FECD:7CB2 (talk) 15:44, 28 May 2016 (UTC)[reply]

The example was totally redone. An animated image was also added to demonstrate the algorithm. Please let me know if it helps. Yz cs5160 (talk) 07:39, 8 January 2017 (UTC)[reply]

Acceleration

[edit]

It seems to me that "shows considerable promise for both speed of the algorithm and a more accessible programming model" is just marketing with no numbers and no substance.

en-dashes

[edit]

Please: As you see, and en-dash rather than a hyphen is used in the title of this article, and of course should also be used in the body of the article. I just cleaned up a lot of that. Similarly, "Needleman-Wunsch" is wrong and "Needleman–Wunsch" is right. (Also in ranges of pages, years, etc., e.g.: pp. 305–42. See WP:MOS. Michael Hardy (talk) 17:25, 7 August 2012 (UTC)[reply]

Dead reference

[edit]

The first reference to the paper in PDF seems to be dead, it looks like the university blocked it. There is another copy in here http://www.cmb.usc.edu/papers/msw_papers/msw-042.pdf but I'm not sure if putting that (or for that matter the link to the PDF that was here before) would involve copyright infringement. Can somebody enlighten me about this? Jorgeng87 (talk) 19:21, 13 November 2012 (UTC)[reply]

[edit]

Hello fellow Wikipedians,

I have just added archive links to one external link on Smith–Waterman algorithm. Please take a moment to review my edit. If necessary, add {{cbignore}} after the link to keep me from modifying it. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true to let others know.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 23:37, 12 February 2016 (UTC)[reply]

Alignment construction explanation needs improvement

[edit]

"Then, go backwards to one of positions (i − 1,j), (i, j − 1), and (i − 1, j − 1) depending on the direction of movement used to construct the matrix. " This doesn't explain which direction the movement should go. I assume that the movement is diagonal if the value in the cell comes from a match. I'm trying to figure out this algorithm for some work I'm doing so if I figure it out, I'll contribute a clearer explanation. Stochphon (talk) 05:35, 8 October 2016 (UTC)[reply]

I updated the explanation. The movement depends on the source of each score, i.e., if match/mismatch from gives the highest score, then go diagonal; if adding gap(s) horizontally or vertically gives the highest score, then go left or up; if none of them are positive, then this cell gets 0 and has no source, and traceback will end here if this cell is encountered. Note that:
Alternative paths are generated if a cell being traced back has more than one source.
Scores come not necessarily from cells directly adjacent to them ( and ). Instead, the gaps may come directly from all positions on that row or column that are left () or up () to the cell being scored.
Yz cs5160 (talk) 08:12, 8 January 2017 (UTC)[reply]
[edit]

Hello fellow Wikipedians,

I have just modified 5 external links on Smith–Waterman algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 09:37, 4 October 2017 (UTC)[reply]

Implementation

[edit]

The implementation section needs to be redone (maybe merged with the section "Accelerated versions"). Unfortunately I'm not familiar with those implementations. I would appreciate it if someone can help rewrite those two sections. It'll help people better use the algorithm. Yz cs5160 (talk) 18:17, 23 October 2017 (UTC)[reply]

Running time: quadratic or qubic??

[edit]

The intro paragraph says "cubic", which seems to match the pseudocode. The box on the right says . The history mentions cubic again, but points out to *different* algorithms that run in quadratic time (Gotoh and Altschul et al.). Finally, under "Gap Penalty" one can see that the latter algorithms only solve an (important) special case of the problem, and the former doesn't even necessarily solve it! Aviad.rubinstein (talk) 21:28, 7 September 2018 (UTC)[reply]