Pileup format: Difference between revisions

Content deleted Content added

Inline

Revision as of 10:20, 9 January 2016

Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, but became widely known through its implementation within the SAMtools software suite. ^[1]

Format

Example

seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23  ,.$....,,.,.,...,,,.,...    7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23  ,$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22  ...T,,.,.,...,,,.,....  33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22  ....,,.,.,.C.,,,.,..G.  +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23  ....,,.,.,...,,,.,....^k.   %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23  A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<<

The columns

Each line consists of 5 (or optionally 6) tab-separated columns:

Sequence identifier
Position in sequence (starting from 1)
Reference nucleotide at that position
Number of aligned reads covering that position (depth of coverage)
Bases at that position from aligned reads
quality of those bases (OPTIONAL)

Column 5: The bases string

. (dot) means a base that matched the reference on the forward strand
, (comma) means a base that matched the reference on the reverse strand
AGTCN denotes a base that did not match the reference on the forward strand
agtcn denotes a base that did not match the reference on the reverse strand
A sequence matching the regular expression \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position
A sequence matching the regular expression -[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position
^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality
$ (dollar) marks the end of a read segment
* (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the -[0-9]+[ACGTNacgtn]+ notation
< (less-than sign) reference skip
> (greater-than sign) reference skip

Column 6: The base quality string

This is an optional column. If present, the ASCII value of the character minus 33 gives the mapping Phred quality of each of the bases in the previous column 5. This is similar to quality encoding in the FASTQ format.

File extension

There is no standard file extension for a Pileup file, but .pileup is commonly used.

References

^ Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25:2078-9. PubMed

External links

[Li_et_al_2009-1] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25:2078-9. PubMed

[1]

@@ Line 32: / Line 32: @@
 *AGTCN denotes a base that did not match the reference on the forward strand
 *agtcn denotes a base that did not match the reference on the reverse strand
-*A sequence matching the [[regular expression]] \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases
+*A sequence matching the [[regular expression]] \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position
-*A sequence matching the regular expression -[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases
+*A sequence matching the regular expression -[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position
 *^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality
 *$ (dollar) marks the end of a read segment