Pileup format: Difference between revisions

Pileup
Filename extensions	.msf, .pup, .pileup
Developed by	Tony Cox and Zemin Ning
Type of format	Bioinformatics
Extended from	Tab separated values
Website	www.htslib.org/doc/samtools-mpileup.html

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Inline

Latest revision as of 02:25, 27 December 2023

Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, and became widely known through its implementation within the SAMtools software suite. ^[1]

Format

Example

Sequence	Position	Reference Base	Read Count	Read Results	Quality
seq1	272	T	24	,.$.....,,.,.,...,,,.,..^+.	`<<<+;<<<<<<<<<<<=<;<;7<&`
seq1	273	T	23	,.....,,.,.,...,,,.,..A	`<<<;<<<<<<<<<3<=<<<;<<+`
seq1	274	T	23	,.$....,,.,.,...,,,.,...	`7<7;<;<<<<<<<<<=<;<;<<6`
seq1	275	A	23	,$....,,.,.,...,,,.,...^l.	`<+;9*<<<<<<<<<=<<:;<<<<`
seq1	276	G	22	...T,,.,.,...,,,.,....	`33;+<<7=7<<7<&<<1;<<6<`
seq1	277	T	22	....,,.,.,.C.,,,.,..G.	`+7<;<<<<<<<&<=<<:;<<&<`
seq1	278	G	23	....,,.,.,...,,,.,....^k.	`%38*<<;<7<<7<=<<<;<<<<<`
seq1	279	C	23	A..T,,.,.,...,,,.,.....	`75&<<<<<<<<<=<<<9<<:<<<`

The columns

Each line consists of 5 (or optionally 6) tab-separated columns:

Sequence identifier
Position in sequence (starting from 1)
Reference nucleotide at that position
Number of aligned reads covering that position (depth of coverage)
Bases at that position from aligned reads
Phred Quality of those bases, represented in ASCII with -33 offset (OPTIONAL)

Column 5: The bases string

. (dot) means a base that matched the reference on the forward strand
, (comma) means a base that matched the reference on the reverse strand
</> (less-/greater-than sign) denotes a reference skip. This occurs, for example, if a base in the reference genome is intronic and a read maps to two flanking exons. If quality scores are given in a sixth column, they refer to the quality of the read and not the specific base.
AGTCN (upper case) denotes a base that did not match the reference on the forward strand
agtcn (lower case) denotes a base that did not match the reference on the reverse strand
A sequence matching the regular expression \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position. For example, +2AG means insertion of AG in the forward strand
A sequence matching the regular expression \-[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position. For example, -2ct means deletion of CT in the reverse strand
^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality
$ (dollar) marks the end of a read segment
* (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the -[0-9]+[ACGTNacgtn]+ notation

Column 6: The base quality string

This is an optional column. If present, the ASCII value of the character minus 33 gives the mapping Phred quality of each of the bases in the previous column 5. This is similar to quality encoding in the FASTQ format.

File extension

There is no standard file extension for a Pileup file, but .msf (multiple sequence file), .pup^[2] and .pileup^[3]^[4] are used.

References

^ Li H.; Handsaker B.; Wysoker A.; Fennell T.; Ruan J.; Homer N.; Marth G.; Abecasis G.; Durbin R; 1000 Genome Project Data Processing Subgroup (2009) (2009). "The Sequence alignment/map (SAM) format and SAMtools". Bioinformatics. 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. PMC 2723002. PMID 19505943.{{cite journal}}: CS1 maint: numeric names: authors list (link)
^ Accelrys (1998-10-02). "QUANTA: Protein Design. 3. Reading and Writing Sequence Data Files". Université de Montréal. Retrieved 2020-03-27.
^ Glez-Peña, Daniel; Gómez-López, Gonzalo; Reboiro-Jato, Miguel; Fdez-Riverola, Florentino; Pisano, David G (2011-01-24). "PileLine: a toolbox to handle genome position information in next-generation sequencing studies". BMC Bioinformatics. 12: 31. doi:10.1186/1471-2105-12-31. ISSN 1471-2105. PMC 3037855. PMID 21261974.
^ Chisom, Halimat (2023-03-31). "File Formats Every Bioinformatician — Established or Upcoming — Must Know (and then some)". Medium. Retrieved 2023-11-11.

External links

[Li_et_al_2009-1] Li H.; Handsaker B.; Wysoker A.; Fennell T.; Ruan J.; Homer N.; Marth G.; Abecasis G.; Durbin R; 1000 Genome Project Data Processing Subgroup (2009) (2009). "The Sequence alignment/map (SAM) format and SAMtools". Bioinformatics. 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. PMC 2723002. PMID 19505943.{{cite journal}}: CS1 maint: numeric names: authors list (link)

[2] Accelrys (1998-10-02). "QUANTA: Protein Design. 3. Reading and Writing Sequence Data Files". Université de Montréal. Retrieved 2020-03-27.

[3] Glez-Peña, Daniel; Gómez-López, Gonzalo; Reboiro-Jato, Miguel; Fdez-Riverola, Florentino; Pisano, David G (2011-01-24). "PileLine: a toolbox to handle genome position information in next-generation sequencing studies". BMC Bioinformatics. 12: 31. doi:10.1186/1471-2105-12-31. ISSN 1471-2105. PMC 3037855. PMID 21261974.

[4] Chisom, Halimat (2023-03-31). "File Formats Every Bioinformatician — Established or Upcoming — Must Know (and then some)". Medium. Retrieved 2023-11-11.

[1]

[2]

[3]

[4]

@@ Line 1: / Line 1: @@
+{{Short description|File format for sequence data}}
+{{Infobox file format
+| name          = Pileup
+| extensions    = .msf, .pup, .pileup
+| developer     = Tony Cox and Zemin Ning
+| type          = [[Bioinformatics]]
+| extended_from = [[Tab separated values]]
+| url           = {{URL|http://www.htslib.org/doc/samtools-mpileup.html}}
+}}
 '''Pileup format''' is a text-based [[File format|format]] for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of [[Single-nucleotide polymorphism|SNP]]/indel calling and alignment. It was first used by
-Tony Cox and Zemin Ning at the [[Wellcome Trust Sanger Institute]], but became widely known through its implementation within the [[SAMtools]] software suite.
+Tony Cox and Zemin Ning at the [[Wellcome Trust Sanger Institute]], and became widely known through its implementation within the [[SAMtools]] software suite.
-<ref name="Li et al 2009">
+<ref name="Li et al 2009">{{cite journal
+|doi=10.1093/bioinformatics/btp352 |date=2009 |journal = Bioinformatics |volume=25 |pages=2078–2079 |title=The Sequence alignment/map (SAM) format and SAMtools
-Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) ''The Sequence alignment/map (SAM) format and SAMtools''. '''Bioinformatics''', 25:2078-9. [http://www.ncbi.nlm.nih.gov/pubmed/19505943 PubMed]
+|author1=Li H. |author2= Handsaker B. |author3= Wysoker A. |author4= Fennell T. |author5= Ruan J.
-</ref>
+|author6= Homer N. |author7=Marth G. |author8= Abecasis G. |author9= Durbin R | author10= 1000 Genome Project Data Processing Subgroup (2009)|issue=16 |pmid=19505943 |pmc=2723002 }}</ref>
 ==Format==
 ===Example===
+{| class="wikitable"
-<pre>seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
+|-
-seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
+! Sequence !! Position !! Reference Base !! Read Count !! Read Results !! Quality
-seq1 274 T 23  ,.$....,,.,.,...,,,.,...    7<7;<;<<<<<<<<<=<;<;<<6
+|- style="font-family: monospace;"
-seq1 275 A 23  ,$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<<
-seq1 276 G 22  ...T,,.,.,...,,,.,....  33;+<<7=7<<7<&<<1;<<6<
+| seq1 || 272 || T || 24|| ,.$.....,,.,.,...,,,.,..^+. || {{code|2=bf|1=<<<+;<<<<<<<<<<<=<;<;7<&}}
+|- style="font-family: monospace;"
-seq1 277 T 22  ....,,.,.,.C.,,,.,..G.  +7<;<<<<<<<&<=<<:;<<&<
-seq1 278 G 23  ....,,.,.,...,,,.,....^k.   %38*<<;<7<<7<=<<<;<<<<<
+| seq1 || 273 || T || 23 || ,.....,,.,.,...,,,.,..A ||  {{code|2=bf|1=<<<;<<<<<<<<<3<=<<<;<<+}}
+|- style="font-family: monospace;"
-seq1 279 C 23  A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<<
+| seq1 || 274 || T || 23 || ,.$....,,.,.,...,,,.,... || {{code|2=bf|1=7<7;<;<<<<<<<<<=<;<;<<6}}
-</pre>
+|- style="font-family: monospace;"
+| seq1 || 275 || A|| 23 || ,$....,,.,.,...,,,.,...^l. || {{code|2=bf|1=<+;9*<<<<<<<<<=<<:;<<<<}}
+|- style="font-family: monospace;"
+| seq1 || 276 || G || 22 || ...T,,.,.,...,,,.,.... || {{code|2=bf|1=33;+<<7=7<<7<&<<1;<<6<}}
+|- style="font-family: monospace;"
+| seq1 || 277 || T || 22 || ....,,.,.,.C.,,,.,..G. || {{code|2=bf|1=+7<;<<<<<<<&<=<<:;<<&<}}
+|- style="font-family: monospace;"
+| seq1 || 278 || G || 23 || ....,,.,.,...,,,.,....^k. || {{code|2=bf|1=%38*<<;<7<<7<=<<<;<<<<<}}
+|- style="font-family: monospace;"
+| seq1 || 279 || C || 23 || A..T,,.,.,...,,,.,..... || {{code|2=bf|1=75&<<<<<<<<<=<<<9<<:<<<}}
+|}
 ===The columns===
@@ Line 25: / Line 47: @@
 #Number of aligned reads covering that position (depth of coverage)
 #Bases at that position from aligned reads
-#quality of those bases (OPTIONAL)
+#Phred Quality of those bases, represented in ASCII with -33 offset (OPTIONAL)
 ===Column 5: The bases string===
 *. (dot) means a base that matched the reference on the forward strand
 *, (comma) means a base that matched the reference on the reverse strand
+*</> (less-/greater-than sign) denotes a reference skip. This occurs, for example, if a base in the reference genome is intronic and a read maps to two flanking exons. If quality scores are given in a [[Pileup_format#Column_6:_The_base_quality_string|sixth column]], they refer to the quality of the read and not the specific base.
-*AGTCN denotes a base that did not match the reference on the forward strand
-*agtcn denotes a base that did not match the reference on the reverse strand
+*AGTCN (upper case) denotes a base that did not match the reference on the forward strand
+*agtcn (lower case) denotes a base that did not match the reference on the reverse strand
-*A sequence matching the [[regular expression]] \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position
-*A sequence matching the regular expression -[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position
+*A sequence matching the [[regular expression]] {{code|2=ragel|\+[0-9]+[ACGTNacgtn]+}} denotes an insertion of one or more bases starting from the next position. For example, +2AG means insertion of AG in the forward strand
+*A sequence matching the regular expression {{code|2=ragel|\-[0-9]+[ACGTNacgtn]+}} denotes a deletion of one or more bases starting from the next position. For example, -2ct means deletion of CT in the reverse strand
 *^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality
 *$ (dollar) marks the end of a read segment
-* * (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the -[0-9]+[ACGTNacgtn]+ notation
+* * (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the {{code|2=ragel|-[0-9]+[ACGTNacgtn]+}} notation
-*< (less-than sign) reference skip
-*> (greater-than sign) reference skip
 ===Column 6: The base quality string===
@@ Line 44: / Line 65: @@
 ==File extension==
+There is no standard [[file extension]] for a Pileup file, but .msf (multiple sequence file), .pup<ref>{{cite web |url=http://www.esi.umontreal.ca/accelrys/life/quanta2K/protein/03_Sequence_data_files.html |title=QUANTA: Protein Design. 3. Reading and Writing Sequence Data Files |author=[[Accelrys]] |date=1998-10-02 |publisher=[[Université de Montréal]] |access-date=2020-03-27}}</ref> and .pileup<ref>{{Cite journal |last1=Glez-Peña |first1=Daniel |last2=Gómez-López |first2=Gonzalo |last3=Reboiro-Jato |first3=Miguel |last4=Fdez-Riverola |first4=Florentino |last5=Pisano |first5=David G |date=2011-01-24 |title=PileLine: a toolbox to handle genome position information in next-generation sequencing studies |journal=BMC Bioinformatics |volume=12 |pages=31 |doi=10.1186/1471-2105-12-31 |issn=1471-2105 |pmc=3037855 |pmid=21261974 |doi-access=free }}</ref><ref>{{Cite web |last=Chisom |first=Halimat |date=2023-03-31 |title=File Formats Every Bioinformatician — Established or Upcoming — Must Know (and then some) |url=https://medium.com/@gearthdexter/bioinformatics-file-formats-3919a26b7679 |access-date=2023-11-11 |website=Medium |language=en}}</ref> are used.
-There is no standard [[file extension]] for a Pileup file, but .pileup is commonly used.
 ==See also==
@@ Line 58: / Line 79: @@
 *[https://github.com/wwood/bioruby-pileup_iterator bioruby-pileup_iterator (A Ruby pileup parser)]
 *[http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/usage.html pysam (A Python pileup parser)]
+{{Bioinformatics}}
 [[Category:Bioinformatics]]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons