Jump to content

General feature format: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
KolbertBot (talk | contribs)
m Bot: HTTP→HTTPS (v485)
added bioinfo template
 
(44 intermediate revisions by 27 users not shown)
Line 1: Line 1:
{{Short description|File format for genomic features}}
The '''general feature format''' ('''gene-finding format''', '''generic feature format''', '''GFF''') is a [[file format]] used for describing [[gene]]s and other features of [[DNA]], [[RNA]] and [[protein]] sequences. The [[filename extension]] associated with such files is <code>.GFF</code> and the [[content type]] associated with them is <code>text/x-gff3</code>.
{{Infobox file format
| name = General feature format
| extensions = <code>.gff</code>, <code>.gff3</code>
| mime = {{code|text/gff3}}
| uniform_type =
| conforms_to =
| magic =
| developer = Sanger Centre (v2), Sequence Ontology Project (v3)
| released =
| latest_release_version =
| latest_release_date = <!-- {{start date and age|YYYY|mm|dd|df=yes/no}} -->
| genre = [[Bioinformatics]]
| standard = <!-- or: | standards = -->
| open = yes
| extended_from = [[Tab-separated values]]
| url = {{URL|https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md}}
}}


In [[bioinformatics]], the '''general feature format''' ('''gene-finding format''', '''generic feature format''', '''GFF''') is a [[file format]] used for describing [[gene]]s and other features of [[DNA]], [[RNA]] and [[protein]] sequences.
There are two versions of the GFF file format in general use:
* [http://www.sanger.ac.uk/resources/software/gff/spec.html General Feature Format Version 2 (Sanger Institute)]
* [https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Generic Feature Format Version 3 (Sequence Ontology Project)]

Servers that generate this format:

{| class="wikitable"
! Server !! Example file
|-
| [[UniProt]] || [https://www.uniprot.org/uniprot/P0A7B8.gff]
|-
|}

Clients that use this format:

{| class="wikitable"
! Name !! Description !! Links
|-
| GBrowse || GMOD genome viewer || [http://gmod.org/wiki/Gbrowse GBrowse]
|-
| IGB || Integrated Genome Browser || [[Integrated Genome Browser]]
|-
| Jalview || A multiple sequence alignment editor & viewer || [[Jalview]]
|-
| STRAP || Underlining sequence features in multiple alignments. Example output: [https://web.archive.org/web/20090613045440/http://www.charite.de/bioinf/strap/exampleOutput.html] || [http://3d-alignment.eu/]
|-
| JBrowse || JBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5 || [http://jbrowse.org JBrowse.org]
|-
| ZENBU || A collaborative, omics data integration and interactive visualization system || [http://fantom.gsc.riken.jp/zenbu/]
|}


==GFF Versions==
==GFF Versions==
The following versions of GFF exist:
[http://gmod.org/wiki/GFF2 GFF Version 2] has a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon.
* [http://gmod.org/wiki/GFF2 General Feature Format Version 2], generally deprecated
GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.
** [http://mblab.wustl.edu/GTF22.html Gene Transfer Format 2.2], a derivative used by Ensembl
*[https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Generic Feature Format Version 3]
** [https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md Genome Variation Format], with additional pragmas and attributes for sequence_alteration features


GFF2/GTF had a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.
The [[Gene transfer format]] (GTF) is a refinement of GFF Version 2 and is sometimes referred to as GFF2.5.<ref>http://gmod.org/wiki/GFF3</ref>

The [[Gene transfer format|GTF]] is identical to GFF, version 2.<ref>{{Cite web |title=GFF/GTF File Format |url=https://useast.ensembl.org/info/website/upload/gff.html |url-status=live |archive-url=https://web.archive.org/web/20220615180935/https://useast.ensembl.org/info/website/upload/gff.html |archive-date=2022-06-15 |access-date=2023-11-04 |website=[[Ensembl]]}}</ref>


==GFF general structure==
==GFF general structure==
All GFF formats (GFF2, GFF3 and GTF) are tabular files with 9 fields per line, separated by tabs. They all share the same structure for the first 7 fields, while differing in the definition of the ''eighth field'' and in the content and format of the ''ninth field''. The general structure is as follows:
All GFF formats (GFF2, GFF3 and GTF) are [[tab key|tab]] delimited with 9 fields per line. They all share the same structure for the first 7 fields, while differing in the content and format of the ''ninth field''. Some field names have been changed in GFF3 to avoid confusion. For example, the "seqid" field was formerly referred to as "sequence", which may be confused with a nucleotide or amino acid chain. The general structure is as follows:


{| class="wikitable"
{| class="wikitable"
|+ General GFF structure
|+ General GFF3 structure
|-
|-
! Position index
! Position index
Line 49: Line 42:
|-
|-
| 1
| 1
| seqid
| sequence
| The name of the sequence where the feature is located.
| The name of the sequence where the feature is located.
|-
|-
| 2
| 2
| source
| source
| The algorithm or procedure that generated the feature. This is typically the name of a software or database.
| Keyword identifying the source of the feature, like a program (e.g. [[Augustus]] or [[RepeatMasker]]) or an organization (like [[The Arabidopsis Information Resource|TAIR]]).
|-
|-
| 3
| 3
| feature
| type
| The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the [http://www.sequenceontology.org/gff3.shtml standards released by the Sequence Ontology Project].
| The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the [http://www.sequenceontology.org/gff3.shtml standards released by the Sequence Ontology Project].
|-
|-
| 4
| 4
| start
| start
| Genomic start of the feature, with a '''1-base offset'''. This is in contrast with other 0-offset half-open sequence formats, like [[BED files]].
| Genomic start of the feature, with a '''1-base offset'''. This is in contrast with other 0-offset half-open sequence formats, like [[BED (file format)|BED]].
|-
|-
| 5
| 5
| end
| end
| Genomic end of the feature, with a '''1-base offset'''. This is the same end coordinate as it is in 0-offset half-open sequence formats, like [[BED files]].{{Citation needed|reason=Asserts that GFF is closed rather than itself half-open — see talk page|date=May 2017}}
| Genomic end of the feature, with a '''1-base offset'''. This is the same end coordinate as it is in 0-offset half-open sequence formats, like [[BED (file format)|BED]].{{Citation needed|reason=Asserts that GFF is closed rather than itself half-open — see talk page|date=May 2017}}
|-
|-
| 6
| 6
| score
| score
| Numeric value that generally indicates the confidence of the source on the annotated feature. A value of "." (a dot) is used to define a null value.
| Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
|-
|-
| 7
| 7
| strand
| strand
| Single character that indicates the [[Sense (molecular biology) strand]] of the feature; it can assume the values of "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined).
| Single character that indicates the [[Sense_(molecular_biology)#DNA_sense|strand]] of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
|-
|-
| 8
| 8
| phase
| frame (GTF, GFF2) '''''or''''' phase (GFF3)
| Frame or phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). Frame and Phase are '''not''' the same, See following subsection.
| phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
|-
|-
| 9
| 9
| attributes
| Attributes.
| A list of tag-value pairs separated by a semicolon with additional information about the feature.
| All the other information pertaining to this feature. The format, structure and content of this field is the one which varies the most between the three competing file formats.
|}
|}


===The 8th field: frame or phase of CDS features===
===The 8th field: phase of CDS features===


Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). According to the '''GFF3''' specification:<ref>{{Cite web |date=2018-11-24 |title=GFF3 specification |url=https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md |url-status=live |archive-url=https://web.archive.org/web/20230704211817/https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md |archive-date=2023-07-04 |website=[[GitHub]]}}</ref><ref>{{Cite web |date=2016-07-12 |title=GFF3 |url=http://gmod.org/wiki/GFF3 |url-status=live |archive-url=https://web.archive.org/web/20230825143502/http://gmod.org/wiki/GFF3 |archive-date=2023-08-25 |website=GMOD}}</ref>
In '''GFF2''' and '''GTF''', the 8th field indicates the '''frame''' of the feature, that is, whether the first base of the CDS segment is the first (frame 0), second (frame 1) or third (frame 2) in the codon of the ORF. The formula to derive this attribute is therefore (sum of previous features) '''mod''' 3.


{{Quote|text=For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. |author=|source=}}
Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). In '''GFF3''', the 8th field indicates instead the '''phase''' of the CDS feature, i.e. according to SO:
{{Quote|text=where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. |author=http://gmod.org/wiki/GFF3}}. [N.B.: can't find a reference to this in SO][Found this reference, but don't know how to add it: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md ]


=== Meta Directives ===
It is therefore the '''reverse''' of the frame: (3 - (sum of previous features) '''mod''' 3) '''mod''' 3 = (3 - phase) '''mod''' 3.
In GFF files, additional meta information can be included and follows after the ## directive. This meta information can detail GFF version, sequence region, or species (full list of meta data types can be found at [https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Sequence Ontology specifications]).


==Validation==
==GFF software==

The [[modENCODE]] project hosts an [http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online online GFF3 validation tool] with generous limits of 286.10 MB and 15 million lines.
===Servers===
Servers that generate this format:

{| class="wikitable"
! Server !! Example file
|-
| [[UniProt]] || [https://www.uniprot.org/uniprot/P0A7B8.gff]
|-
|}

===Clients===
Clients that use this format:

{| class="wikitable"
! Name !! Description !! Links
|-
| GBrowse || GMOD genome viewer || [http://gmod.org/wiki/Gbrowse GBrowse]
|-
| IGB || Integrated Genome Browser || [[Integrated Genome Browser]]
|-
| Jalview || A multiple sequence alignment editor & viewer || [[Jalview]]
|-
| STRAP || Underlining sequence features in multiple alignments. Example output: [https://web.archive.org/web/20090613045440/http://www.charite.de/bioinf/strap/exampleOutput.html]|| [http://3d-alignment.eu/]
|-
| JBrowse || JBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5 || [http://jbrowse.org JBrowse.org]
|-
| ZENBU || A collaborative, omics data integration and interactive visualization system || [http://fantom.gsc.riken.jp/zenbu/]
|}

===Validation===
The [[modENCODE]] project hosts an [https://archive.today/20121211073849/http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online online GFF3 validation tool] with generous limits of 286.10 MB and 15 million lines.


The Genome Tools software collection contains a ''gff3validator'' tool that can be used offline to validate and possibly tidy GFF3 files. An [http://genometools.org/cgi-bin/gff3validator.cgi online validation service] is also available.
The Genome Tools software collection contains a ''gff3validator'' tool that can be used offline to validate and possibly tidy GFF3 files. An [http://genometools.org/cgi-bin/gff3validator.cgi online validation service] is also available.
Line 106: Line 130:
==References==
==References==
<references/>
<references/>

{{Bioinformatics}}


{{DEFAULTSORT:General Feature Format}}
{{DEFAULTSORT:General Feature Format}}
[[Category:Computer file formats]]
[[Category:Bioinformatics]]
[[Category:Bioinformatics]]
[[Category:Biological sequence format]]

Latest revision as of 18:31, 5 June 2024

General feature format
Filename extensions
.gff, .gff3
Internet media type
text/gff3
Developed bySanger Centre (v2), Sequence Ontology Project (v3)
Type of formatBioinformatics
Extended fromTab-separated values
Open format?yes
Websitegithub.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

In bioinformatics, the general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences.

GFF Versions

[edit]

The following versions of GFF exist:

GFF2/GTF had a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.

The GTF is identical to GFF, version 2.[1]

GFF general structure

[edit]

All GFF formats (GFF2, GFF3 and GTF) are tab delimited with 9 fields per line. They all share the same structure for the first 7 fields, while differing in the content and format of the ninth field. Some field names have been changed in GFF3 to avoid confusion. For example, the "seqid" field was formerly referred to as "sequence", which may be confused with a nucleotide or amino acid chain. The general structure is as follows:

General GFF3 structure
Position index Position name Description
1 seqid The name of the sequence where the feature is located.
2 source The algorithm or procedure that generated the feature. This is typically the name of a software or database.
3 type The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4 start Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5 end Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[citation needed]
6 score Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7 strand Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8 phase phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9 attributes A list of tag-value pairs separated by a semicolon with additional information about the feature.

The 8th field: phase of CDS features

[edit]

Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). According to the GFF3 specification:[2][3]

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.

Meta Directives

[edit]

In GFF files, additional meta information can be included and follows after the ## directive. This meta information can detail GFF version, sequence region, or species (full list of meta data types can be found at Sequence Ontology specifications).

GFF software

[edit]

Servers

[edit]

Servers that generate this format:

Server Example file
UniProt [1]

Clients

[edit]

Clients that use this format:

Name Description Links
GBrowse GMOD genome viewer GBrowse
IGB Integrated Genome Browser Integrated Genome Browser
Jalview A multiple sequence alignment editor & viewer Jalview
STRAP Underlining sequence features in multiple alignments. Example output: [2] [3]
JBrowse JBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5 JBrowse.org
ZENBU A collaborative, omics data integration and interactive visualization system [4]

Validation

[edit]

The modENCODE project hosts an online GFF3 validation tool with generous limits of 286.10 MB and 15 million lines.

The Genome Tools software collection contains a gff3validator tool that can be used offline to validate and possibly tidy GFF3 files. An online validation service is also available.

See also

[edit]

References

[edit]
  1. ^ "GFF/GTF File Format". Ensembl. Archived from the original on 2022-06-15. Retrieved 2023-11-04.
  2. ^ "GFF3 specification". GitHub. 2018-11-24. Archived from the original on 2023-07-04.
  3. ^ "GFF3". GMOD. 2016-07-12. Archived from the original on 2023-08-25.