Jump to content

Talk:FASTA format

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 202.131.227.149 (talk) at 10:56, 13 November 2009. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Template:Wikiproject MCB

Merge with Fasta Sequence?

  1. support- Yeah, the article "Fasta Sequence" should just be merged into the article "FASTA format". A FASTA sequence isn't a good term anyway!
Merged. Changed FASTA sequence for a redirect. I tried to keep most of the material so that no previous work is lost. Anybody who cares, feel free to cut. Miguel Andrade 04:09, 30 March 2006 (UTC)[reply]
What does FASTA stand for???? — Preceding unsigned comment added by 71.157.135.130 (talkcontribs) 06:15, 12 December 2006 (UTC)[reply]
Snipped from FASTA page : "FASTA is pronounced 'FAST-Aye', and stands for 'FAST-All', because it works with any alphabet, an extension of 'FAST-P' (protein) and 'FAST-N' (nucleotide) alignment." Paul Cotney 13:30, 15 May 2007 (UTC)[reply]

Comments in a FASTA record

I removed most of the useless and distracting references to comments in the FASTA file because no one supports them and haven't for over 15 years. For details, see my research -- Andrew Dalke

Not true, comments starting with a semi-colon are supported for instance by the function read.fasta in the seqinR package and by the function readFASTA in the Biostrings package. 90.42.8.116 (talk)

Okay, then "almost no one". Who publishes/provides data in FASTA format with comments? What was the driving motivation for adding that support? Where's the push for Bioperl and other tools to support it? -- Andrew Dalke

As scientists we care about backward compatibility, which is essential for reproducibility. The fact that new tools such as Bioperl (or other recent me-too products in the bioinformatic world) do not support reproducibility is by no way a good argument to remove the documentation about the original format, which is sourced. This is Wikipedia, not Bioperl point of view. 90.48.97.179 (talk) 21:28, 4 February 2009 (UTC)[reply]

Where's the data set where you have backwards compatibility concerns? I haven't seen such data outside of a tests set from FASTA itself. I'm not arguing from a Bioperl point of view - I'm talking about the consensus viewpoint of what the FASTA file is by the large majority of tools that exist. If that isn't the Wikipedia point of view, then I don't know what is. Or as I point out on that page which summarizes my research, the world uses something much closer to the NCBI FASTA format than the Pearson FASTA format. - Andrew Dalke —Preceding unsigned comment added by 83.248.213.42 (talk) 12:36, 12 August 2009 (UTC) [reply]

I think you are missing the real point: backwards compatibility is not a problem, as comment lines are to be ignored, they don't matter that much. But "forward compatibility" does matter: the problem is that with comments one may include all needed information so that future programs may rely on them to use it, but without them you are bound to reinvent the wheel and design a new, non-compatible format to be able to add more information.

If you leave the comments details, you allow future developers to add any information they want without breaking backwards compatibility. If they donot know of their existence they are bound to reinvent them in a most probably incompatible way, thus breaking all pre-existing programs. If you want an example: any "FASTA" sequence with comments starting with hash marks "#" is NOT a FASTA sequence and will break all existing software. So, hiding information only suits selfish egos and creates problems for the future. Since Pearson was wise enough to prepare for the future, ignoring it is not only silly, it is also preposterous! -- Jose R Valverde

Copyright?

Looks like the text under "Format" is copied from the NCBI website. Either that, or the NCBI copied from here. Does this need to be changed? —Preceding unsigned comment added by 128.193.214.112 (talk) 22:07, 3 October 2007 (UTC)[reply]

The NCBI website, however, is a US Federal Government website...and, as it is the author of the appropriate text, it is public domain. However, it should be cited as being such. --AEMoreira042281 15:19, 8 November 2007 (UTC)[reply]

HUPO-PSI Format to another page

I suggest moving this addition on HUPO-PSI Format to another page, as it confuses the basic article, and opens the door to n other detailed sequence record proposals.

This is very worthy as a new sequence format proposal, but it is not Fasta format. HUPO-PSI is one of several variants that have been proposed or built to bring documentation and record structure back into Fasta (which was designed for its simplicity). It would be worth referring to other such formats in this article. NCBI's defline format the current most common variant of Fasta. Add why FastA isn't the solution to documented, structured sequence records. Dongilbert (talk) 02:47, 18 February 2008 (UTC)[reply]

I do agree, the HUPO-SPI section should be moved to another page, here it's confusing for someone who is looking for the FASTA format. I'd like to do this but I'm unsure on how to fix it without making trouble. 90.42.137.27 (talk) 19:58, 28 March 2009 (UTC)[reply]

Yes! move the HUPO-PSI section to its own page 90.53.111.75 (talk) 22:46, 30 April 2009 (UTC).[reply]

OK, I'm going to delete the disturbing HUPO-PSI stuff, any objection? 83.205.188.235 (talk) 17:56, 26 October 2009 (UTC)[reply]

Done 90.14.223.244 (talk) 22:01, 9 November 2009 (UTC)[reply]

Great page

I'm not sure if this the appropriate place for this, but I just wanted to say that as a molecular biology student I found this article extremely useful and very reader friendly. Cheers everybody, great work! --Ar-Pharazôn (talk) 19:34, 24 June 2009 (UTC)[reply]

Strange comment/question in first section

The first section ("Format") contains the following line:

"-- If FASTA files can contain multiple sequences, as suggested by the text below, that is a critical part of the format specification and should be described up front here please. If they cannot contain multiple sequences, this point should be clarified here."

However, right above that, the main text addresses that very question.

"The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format [follows]:" (emphasis mine)

This seems crystal-clear to me; in fact, the question is the most jarring part of the section. I'm tempted just to remove the question/comment, but I'm new around here so I don't want to overstep!

Jdrum00 (talk) 05:36, 2 September 2009 (UTC)[reply]

GenBank                           gi|gi-number|gb|accession|locus
EMBL Data Library                 gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
NBRF PIR                          pir||entry
Protein Research Foundation       prf||name
SWISS-PROT                        sp|accession|name
Brookhaven Protein Data Bank (1)  pdb|entry|chain
Brookhaven Protein Data Bank (2)  entry:chain|PDBID|CHAIN|SEQUENCE
Patents                           pat|country|number 
GenInfo Backbone Id               bbs|number 
General database identifier       gnl|database|identifier
NCBI Reference Sequence           ref|accession|locus
Local Sequence identifier         lcl|identifier