Jump to content

Chemical file format: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m link to NYU Library of 3-D Molecular Structures
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation)
 
(36 intermediate revisions by 26 users not shown)
Line 1: Line 1:
{{Short description|File format that stores chemical formulae and structures}}
This article discusses some common '''molecular file formats''', including usage and converting between them.
{{lead too short|date=August 2022}}

A '''chemical file format''' is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the [[chemical table file]] format, which is similar to [[Chemical table file#SDF|''Structure Data Format'' (SDF)]] files. They are text files that represent multiple chemical structure records and associated data fields. The [[XYZ file format]] is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates. The [[Protein Data Bank (file format)|Protein Data Bank Format]] is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.


==Distinguishing formats==
==Distinguishing formats==
Chemical information is usually provided as [[Computer file|files]] or [[Stream (computing)|streams]] and many formats have been created, with varying degrees of documentation. The format is indicated in three ways (see chemical MIME section)
Chemical information is usually provided as [[Computer file|files]] or [[Stream (computing)|streams]] and many formats have been created, with varying degrees of documentation. The format is indicated in three ways:<br>(see {{slink||The Chemical MIME Project}})
* ''file extension'' (usually 3 letters). This is widely used, but fragile as common suffixes such as ".mol" and ".dat" are used by many systems, including non-chemical ones.
* ''file extension'' (usually 3 letters). This is widely used, but fragile as common suffixes such as ''<code>.mol</code>'' and ''<code>.dat</code>'' are used by many systems, including non-chemical ones.
* ''self-describing files'' where the format information is included in the file. Examples are CIF and CML.
* ''self-describing files'' where the format information is included in the file. Examples are CIF and CML.
* ''chemical/MIME type'' added by a chemically-aware server.
* ''chemical/MIME type'' added by a chemically aware server.


== Chemical Markup Language ==
== Chemical Markup Language ==
Line 11: Line 14:
[[Chemical Markup Language]] (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including [[JChemPaint]], [[Jmol]], [[XDrawChem]] and MarvinView.
[[Chemical Markup Language]] (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including [[JChemPaint]], [[Jmol]], [[XDrawChem]] and MarvinView.


== Protein Data Bank Format ==
==Protein Data Bank Format==
The [[Protein Data Bank (file format)|Protein Data Bank Format]] is an obsolete format for protein structures developed in 1972.<ref>{{Cite web |last=wwPDB.org |title=wwPDB: File Format |url=https://www.wwpdb.org/documentation/file-format |access-date=2024-06-13 |website=www.wwpdb.org |language=en}}</ref> It is a [[Flat-file database#Fixed-width_formats|fixed-width format]] and thus limited to a maximum number of atoms, residues, and chains; this resulted in splitting very large structures such as [[ribosome|ribosomes]] into multiple files. For example, the E. coli 70S was represented as 4 PDB files in 2009: [http://www.rcsb.org/pdb/explore/obsolete.do?obsoleteId=3I1M 3I1M] {{Webarchive|url=https://web.archive.org/web/20161005064735/http://www.rcsb.org/pdb/explore/obsolete.do?obsoleteId=3I1M |date=2016-10-05 }}, [http://www.rcsb.org/pdb/explore/obsolete.do?obsoleteId=3I1N 3I1N] {{Webarchive|url=https://web.archive.org/web/20161016204903/http://www.rcsb.org/pdb/explore/obsolete.do?obsoleteId=3I1N |date=2016-10-16 }}, 3I1O, and 3I1P. In 2014, they were consolidated into a single file, [http://www.rcsb.org/pdb/explore.do?structureId=4V6C 4V6C].
In 2014, the PDB format was officially replaced with mmCIF, and newer PDB structures may not have PDB files available.


Some PDB files contained an optional section describing atom connectivity as well as position. Because these files were sometimes used to describe macromolecular assemblies or molecules represented in [[Molecular mechanics#Environment and solvation|explicit solvent]], they could grow very large and were often compressed. Some tools, such as Jmol and KiNG,<ref>{{cite journal|author=Chen, V.B.|year=2009|title= KING (Kinemage, Next Generation): A versatile interactive molecular and scientific visualization program | journal=Protein Science|pmid=19768809|volume=18|issue=11|pmc=2788294|pages=2403–2409|doi= 10.1002/pro.250|display-authors=etal}}</ref> could read PDB files in gzipped format. The wwPDB maintained the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification (to version 3.0) in August 2007, and a remediation of many file problems in the existing database.<ref>{{cite journal|author=Henrick, K.|year=2008|title=Remediation of the protein data bank archive|journal=Nucleic Acids Research|pmid=18073189|volume=36|issue=Database issue|pmc=2238854|pages=D426–D433|doi=10.1093/nar/gkm937|display-authors=etal}}</ref> The typical file extension for a PDB file was ''<code>.pdb</code>'', although some older files used ''<code>.ent</code>'' or ''<code>.brk</code>''. Some molecular modeling tools wrote nonstandard PDB-style files that adapted the basic format to their own needs.
The [[Protein Data Bank (file format)|Protein Data Bank Format]] is commonly used for proteins but it can be used for other types of molecules as well. It was originally designed as, and continues to be, a fixed-column-width format and thus officially has a built-in maximum number of atoms, of residues, and of chains; this resulted in splitting very large structures such as ribosomes into multiple files. However, many tools can read files that exceed those limits. For example, the E. coli 70S [[ribosome]] was represented as 4 PDB files in 2009: [http://www.rcsb.org/pdb/explore/obsolete.do?obsoleteId=3I1M 3I1M], [http://www.rcsb.org/pdb/explore/obsolete.do?obsoleteId=3I1N 3I1N], 3I1O and 3I1P. In 2014 they were consolidated into a single file, [http://www.rcsb.org/pdb/explore.do?structureId=4V6C 4V6C].


==GROMACS format==
Some PDB files contain an optional section describing atom connectivity as well as position. Because these files are sometimes used to describe macromolecular assemblies or molecules represented in [[Molecular mechanics#Environment and solvation|explicit solvent]], they can grow very large and are often compressed. Some tools, such as Jmol and KiNG,<ref>{{cite journal|author=Chen, V.B.|year=2009|title= KING (Kinemage, Next Generation): A versatile interactive molecular and scientific visualization program | journal=Protein Science|pmid=19768809|volume=18|issue=11|pmc=2788294|pages=2403–2409|doi= 10.1002/pro.250|display-authors=etal}}</ref> can read PDB files in gzipped format. The wwPDB maintains the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification (to version 3.0) in August 2007, and a remediation of many file problems in the existing database.<ref>{{cite journal|author=Henrick, K.|year=2008|title=Remediation of the protein data bank archive|journal=Nucleic Acids Research|pmid=18073189|volume=36|issue=Database issue|pmc=2238854|pages=D426–D433|doi=10.1093/nar/gkm937|display-authors=etal}}</ref> The typical file extension for a PDB file is ''.pdb'', although some older files use ''.ent'' or ''.brk''. Some molecular modeling tools write nonstandard PDB-style files that adapt the basic format to their own needs.
The GROMACS file format family was created for use with the molecular simulation software package [[GROMACS]]. It closely resembles the PDB format but was designed for storing output from [[molecular dynamics]] simulations, so it allows for additional numerical precision and optionally retains information about particle [[velocity]] as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is ''<code>.gro</code>''.


== GROMACS format ==
==CHARMM format==
The [[CHARMM]] molecular dynamics package<ref>{{cite journal |author= Brooks, B.M. | year= 1983 | title= CHARMM: A program for macromolecular energy, minimization, and dynamics calculations | journal= J. Comput. Chem. | volume= 4 | issue= 2 | pages= 187–217 | doi= 10.1002/jcc.540040211| s2cid= 91559650 |display-authors=etal}}</ref> can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF ([[protein structure]] file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are ''<code>.crd</code>'' and ''<code>.psf</code>'' respectively.
The GROMACS file format family was created for use with the molecular simulation software package [[GROMACS]]. It closely resembles the PDB format but was designed for storing output from [[molecular dynamics]] simulations, so it allows for additional numerical precision and optionally retains information about particle [[velocity]] as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is ''.gro''.


== CHARMM format ==
==GSD format==
The General Simulation Data (GSD) file format created for efficient reading / writing of generic particle simulations, primarily - but not restricted to - those from [[HOOMD-blue]]. The package also contains a python module that reads and writes HOOMD schema gsd files with an easy to use syntax.[https://bitbucket.org/glotzer/gsd]
The [[CHARMM]] molecular dynamics package<ref>{{cite journal |author= Brooks, B.M. | year= 1983 | title= CHARMM: A program for macromolecular energy, minimization, and dynamics calculations | journal= J Comp Chem | volume= 4 | pages= 187–217 | doi= 10.1002/jcc.540040211|display-authors=etal}}</ref> can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF ([[protein structure]] file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are ''.crd'' and ''.psf'' respectively.


== Ghemical file format ==
==Ghemical file format==
The [[Ghemical]] software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (!Header, !Info, !Atoms, !Bonds, !Coord, !PartialCharges and !End).
The [[Ghemical]] software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (<code>!Header</code>, <code>!Info</code>, <code>!Atoms</code>, <code>!Bonds</code>, <code>!Coord</code>, <code>!PartialCharges</code> and <code>!End</code>).


The proposed MIME type for this format is ''application/x-ghemical''.
The proposed MIME type for this format is ''application/x-ghemical''.


== SYBYL Line Notation ==
==SYBYL Line Notation==
[[SYBYL Line Notation]] (SLN) is a chemical [[line notation]]. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of [[Markush]] queries. The syntax also supports the specification of combinatorial libraries of CD.
[[SYBYL Line Notation]] (SLN) is a chemical [[line notation]]. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of [[Markush structure]] queries. The syntax also supports the specification of combinatorial libraries of ChemDraw.


:{| class="wikitable"
Example SLNs
|+Example SLNs

{| class="wikitable"
|-
! Description
! Description
! SLN String
! SLN string
|-
|-
| [[Benzene]]
| [[Benzene]]
| C[1]H:CH:CH:CH:CH:CH:@1
| <code>C[1]H:CH:CH:CH:CH:CH:@1</code>
|-
|-
| [[Alanine]]
| [[Alanine]]
| NH2C[s=n]H(CH3)C(=O)OH
| <code>NH2C[s=n]H(CH3)C(=O)OH</code>
|-
|-
| Query showing R sidechain
| Query showing R sidechain
| R1[hac>1]C[1]:C:C:C:C:C:@1
| <code>R1[hac>1]C[1]:C:C:C:C:C:@1</code>
|-
|-
| Query for amide/sulfamide
| Query for amide/sulfamide
| NHC=M1{M1:O,S}
| <code>NHC=M1{M1:O,S}</code>
|}
|}


== SMILES ==
== SMILES ==


The [[Simplified molecular input line entry specification|'''S'''implified '''M'''olecular '''I'''nput '''L'''ine '''E'''ntry '''S'''pecification]] (SMILES) is a [[line notation]] for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates.
The [[simplified molecular input line entry system]], or SMILES,<ref>{{cite journal|title=SMILES, a Chemical Language and Information System: 1: Introduction to Methodology and Encoding Rules|author=Weininger, David|journal=Journal of Chemical Information and Modeling|year=1988|volume=28|issue=1|pages=31–36|doi=10.1021/ci00057a005|s2cid=5445756 }}</ref> is a [[line notation]] for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates.


[[Hydrogen atoms]] are not represented. Other atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. The symbol "=" represents double bonds and "#" represents triple bonds. Branching is indicated by (). Rings are indicated by pairs of digits.
[[Hydrogen atoms]] are not represented. Other atoms are represented by their element symbols <code>B</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>F</code>, <code>P</code>, <code>S</code>, <code>Cl</code>, <code>Br</code>, and <code>I</code>. The symbol <code>=</code> represents double bonds and <code>#</code> represents triple bonds. Branching is indicated by <code>( )</code>. Rings are indicated by pairs of digits.


Some examples are
Some examples are


{| class="wikitable"
:{| class="wikitable"
|-
|-
! Name
! Name
! Formula
! Formula
! SMILES String
! SMILES string
|-
|-
| [[Methane]]
| [[Methane]]
| CH<sub>4</sub>
| CH<sub>4</sub>
| <code>C</code>
| C
|-
|-
| [[Ethanol]]
| [[Ethanol]]
| C<sub>2</sub>H<sub>6</sub>O
| C<sub>2</sub>H<sub>6</sub>O
| CCO
| <code>CCO</code>
|-
|-
| [[Benzene]]
| [[Benzene]]
| C<sub>6</sub>H<sub>6</sub>
| C<sub>6</sub>H<sub>6</sub>
| <span class="moldetails_text">C1=CC=CC=C1 or c1ccccc1</span>
| <span class="moldetails_text"><code>C1=CC=CC=C1</code> or <code>c1ccccc1</code></span>
|-
|-
| [[Ethylene]]
| [[Ethylene]]
| C<sub>2</sub>H<sub>4</sub>
| C<sub>2</sub>H<sub>4</sub>
| C=C
| <code>C=C</code>
|}
|}


Line 88: Line 92:
The MDL number contains a unique identification number for each reaction and variation. The format is RXXXnnnnnnnn. R indicates a reaction, XXX indicates which database contains the reaction record. The numeric portion, nnnnnnnn, is an 8-digit number.
The MDL number contains a unique identification number for each reaction and variation. The format is RXXXnnnnnnnn. R indicates a reaction, XXX indicates which database contains the reaction record. The numeric portion, nnnnnnnn, is an 8-digit number.


== Other Common Formats ==
==Other common formats==


One of the most widely used industry standards are [[chemical table file]] formats, like the ''Structure Data Format'' (SDF) files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). MOL is another file format from MDL. It is documented in Chapter 4 of ''CTfile Formats''.<ref>{{Harvnb|MDL Information Systems|2005}}</ref>
One of the most widely used industry standards are [[chemical table file]] formats, like the ''Structure Data Format'' (SDF) files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). MOL is another file format from MDL. It is documented in Chapter 4 of ''CTfile Formats''.<ref>{{Harvnb|MDL Information Systems|2005}}</ref>
Line 96: Line 100:
There are a large number of other formats listed in the table below
There are a large number of other formats listed in the table below


== Converting Between Formats ==
==Converting between formats==

[[OpenBabel]] and [[JOELib]] are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables.
[[OpenBabel]] and [[JOELib]] are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables.

:<code>babel -i ''input_format'' ''input_file'' -o ''output_format'' ''output_file''</code>
:<code>obabel -i ''input_format'' ''input_file'' -o ''output_format'' ''output_file''</code>
For example, to convert the file epinephrine.sdf in SDF to CML use the command
For example, to convert the file epinephrine.sdf in SDF to CML use the command

:<code>babel -i sdf epinephrine.sdf -o cml epinephrine.cml</code>
:<code>obabel -i sdf epinephrine.sdf -o cml epinephrine.cml</code>

The resulting file is epinephrine.cml.
The resulting file is epinephrine.cml.

[https://github.com/theochem/iodata IOData] is a free and open-source Python library for parsing, storing, and converting various file formats commonly used by quantum chemistry, molecular dynamics, and plane-wave density-functional-theory software programs. It also supports a flexible framework for generating input files for various software packages. For a complete list of supported formats, please go to https://iodata.readthedocs.io/en/latest/formats.html.


A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools [[JChemPaint]] (based on the [[Chemistry Development Kit]]), [[XDrawChem]] (based on [[OpenBabel]]), [[MDL Chime|Chime]], [[Jmol]], Mol2mol<ref>[http://www.gunda.hu/mol2mol Mol2mol homepage]</ref>{{Citation needed|date=October 2010}}<!-- does Mol2mol merit coverage? --> and [[Discovery Studio]] fit into this category.
A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools [[JChemPaint]] (based on the [[Chemistry Development Kit]]), [[XDrawChem]] (based on [[OpenBabel]]), [[MDL Chime|Chime]], [[Jmol]], Mol2mol<ref>[http://www.gunda.hu/mol2mol Mol2mol homepage]</ref>{{Citation needed|date=October 2010}}<!-- does Mol2mol merit coverage? --> and [[Discovery Studio]] fit into this category.


== The Chemical MIME Project ==
==The Chemical MIME Project==
{{anchor|Chemical MIME}}


"Chemical MIME" is a de facto approach for adding [[MIME]] types to chemical streams.
"Chemical MIME" is a de facto approach for adding [[MIME]] types to chemical streams.
<blockquote>
<blockquote>
This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. ... The first version of an Internet draft was published during May–October 1994, and the second revised version during April–September 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion.<ref>[http://www.ch.ic.ac.uk/chemime/ The Chemical MIME Home Page] (accessed 2013-January-24)</ref></blockquote>
This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. ... The first version of an Internet draft was published during May–October 1994, and the second revised version during April–September 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion.<ref>[http://www.ch.ic.ac.uk/chemime/ The Chemical MIME Home Page] (accessed 2013-January-24)</ref></blockquote>
In 1998 the work was formally published in the [[JCIM]].<ref>{{Cite journal | last1 = Rzepa | first1 = H. S. | last2 = Murray-Rust | first2 = P. | last3 = Whitaker | first3 = B. J. | doi = 10.1021/ci9803233 | title = The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange | journal = Journal of Chemical Information and Modeling | volume = 38 | issue = 6 | pages = 976 | year = 1998 | pmid = | pmc = }}</ref>
In 1998 the work was formally published in the [[JCIM]].<ref>{{Cite journal | last1 = Rzepa | first1 = H. S. | last2 = Murray-Rust | first2 = P. | last3 = Whitaker | first3 = B. J. | doi = 10.1021/ci9803233 | title = The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange | journal = Journal of Chemical Information and Modeling | volume = 38 | issue = 6 | pages = 976 | year = 1998 }}</ref>


{| class="wikitable"
:{| class="wikitable"
! [[Filename extension|File Extension]]
! [[Filename extension|File extension]]
! [[MIME]] Type
! [[MIME]] Type
! Proper Name
! Proper Name
! Description
! Description
|-
|-
| alc
| <code>.alc</code>
| chemical/x-alchemy
| chemical/x-alchemy
| Alchemy Format
| Alchemy Format
|
|
|-
|-
| csf
| <code>.csf</code>
| chemical/x-cache-csf
| chemical/x-cache-csf
| CAChe MolStruct CSF
| CAChe MolStruct CSF
|
|
|-
|-
| cbin, cascii, ctab
| <code>.cbin</code>, <code>.cascii</code>, <code>.ctab</code>
| chemical/x-cactvs-binary
| chemical/x-cactvs-binary
| CACTVS format
| CACTVS format
|
|
|-
|-
| cdx
| <code>.cdx</code>
| chemical/x-cdx
| chemical/x-cdx
| ChemDraw eXchange file
| ChemDraw eXchange file
|
|
|-
|-
| cer
| <code>.cer</code>
| chemical/x-cerius
| chemical/x-cerius
| MSI Cerius II format
| MSI Cerius II format
|
|
|-
|-
| c3d
| <code>.c3d</code>
| chemical/x-chem3d
| chemical/x-chem3d
| Chem3D Format
| Chem3D Format
|
|
|-
|-
| chm
| <code>.chm</code>
| chemical/x-chemdraw
| chemical/x-chemdraw
| ChemDraw file
| ChemDraw file
|
|
|-
|-
| cif
| <code>.cif</code>
| chemical/x-cif
| chemical/x-cif
| [[Crystallographic Information File]], Crystallographic Information Framework
| [[Crystallographic Information File]], Crystallographic Information Framework
| Promulgated by the International Union of Crystallography
| Promulgated by the International Union of Crystallography
|-
|-
| cmdf
| <code>.cmdf</code>
| chemical/x-cmdf
| chemical/x-cmdf
| CrystalMaker Data format
| CrystalMaker Data format
|
|
|-
|-
| cml
| <code>.cml</code>
| chemical/x-cml
| chemical/x-cml
| [[Chemical Markup Language]]
| [[Chemical Markup Language]]
| [[XML]] based [[Chemical Markup Language]].
| [[XML]] based [[Chemical Markup Language]].
|-
|-
| cpa
| <code>.cpa</code>
| chemical/x-compass
| chemical/x-compass
| Compass program of the Takahashi
| Compass program of the Takahashi
|
|
|-
|-
| bsd
| <code>.bsd</code>
| chemical/x-crossfire
| chemical/x-crossfire
| Crossfire file
| Crossfire file
|
|
|-
|-
| csm, csml
| <code>.csm</code>, <code>.csml</code>
| chemical/x-csml
| chemical/x-csml
| Chemical Style Markup Language
| Chemical Style Markup Language
|
|
|-
|-
| ctx
| <code>.ctx</code>
| chemical/x-ctx
| chemical/x-ctx
| Gasteiger group CTX file format
| Gasteiger group CTX file format
|
|
|-
|-
| cxf, cef
| <code>.cxf</code>, <code>.cef</code>
| chemical/x-cxf
| chemical/x-cxf
| Chemical eXchange Format
| Chemical eXchange Format
|
|
|-
|-
| emb, embl
| <code>.emb</code>, <code>.embl</code>
| chemical/x-embl-dl-nucleotide
| chemical/x-embl-dl-nucleotide
| EMBL Nucleotide Format
| EMBL Nucleotide Format
|
|
|-
|-
| spc
| <code>.spc</code>
| chemical/x-galactic-spc
| chemical/x-galactic-spc
| SPC format for spectral and chromatographic data
| SPC format for spectral and chromatographic data
|
|
|-
|-
| inp, gam, gamin
| <code>.inp</code>, <code>.gam</code>, <code>.gamin</code>
| chemical/x-gamess-input
| chemical/x-gamess-input
| GAMESS Input format
| GAMESS Input format
|
|
|-
|-
| fch, fchk
| <code>.fch</code>, <code>.fchk</code>
| chemical/x-gaussian-checkpoint
| chemical/x-gaussian-checkpoint
| [[Gaussian (software)|Gaussian]] Checkpoint Format
| [[Gaussian (software)|Gaussian]] Checkpoint Format
|
|
|-
|-
| cub
| <code>.cub</code>
| chemical/x-gaussian-cube
| chemical/x-gaussian-cube
| [[Gaussian (software)|Gaussian]] Cube (Wavefunction) Format
| [[Gaussian (software)|Gaussian]] Cube (Wavefunction) Format
|
|
|-
|-
| gau, gjc, gjf, com
| <code>.gau</code>, <code>.gjc</code>, <code>.gjf</code>, <code>.com</code>
| chemical/x-gaussian-input
| chemical/x-gaussian-input
| [[Gaussian (software)|Gaussian]] Input Format
| [[Gaussian (software)|Gaussian]] Input Format
|
|
|-
|-
| gcg
| <code>.gcg</code>
| chemical/x-gcg8-sequence
| chemical/x-gcg8-sequence
| Protein Sequence Format
| Protein Sequence Format
|
|
|-
|-
| gen
| <code>.gen</code>
| chemical/x-genbank
| chemical/x-genbank
| ToGenBank Format
| ToGenBank Format
|
|
|-
|-
| istr,ist
| <code>.istr</code>, <code>.ist</code>
| chemical/x-isostar
| chemical/x-isostar
| IsoStar Library of Intermolecular Interactions
| IsoStar Library of Intermolecular Interactions
|
|
|-
|-
| jdx, dx
| <code>.jdx</code>, <code>.dx</code>
| chemical/x-jcamp-dx
| chemical/x-jcamp-dx
| [[JCAMP]] Spectroscopic Data Exchange Format
| [[JCAMP]] Spectroscopic Data Exchange Format
|
|
|-
|-
| kin
| <code>.kin</code>
| chemical/x-kinemage
| chemical/x-kinemage
| Kinetic (Protein Structure) Images; [[Kinemage]]
| Kinetic (Protein Structure) Images; [[Kinemage]]
|
|
|-
|-
| mcm
| <code>.mcm</code>
| chemical/x-macmolecule
| chemical/x-macmolecule
| MacMolecule File Format
| MacMolecule File Format
|
|
|-
|-
| mmd, mmod
| <code>.mmd</code>, <code>.mmod</code>
| chemical/x-macromodel-input
| chemical/x-macromodel-input
| [[MacroModel]] Molecular Mechanics
| [[MacroModel]] Molecular Mechanics
|
|
|-
|-
| mol
| <code>.mol</code>
| chemical/x-mdl-molfile
| chemical/x-mdl-molfile
| [[MDL Molfile]]
| [[MDL Molfile]]
|
|
|-
|-
| smiles, smi
| <code>.smiles</code>, <code>.smi</code>
| chemical/x-daylight-smiles
| chemical/x-daylight-smiles
| [[Simplified molecular input line entry specification]]
| [[Simplified molecular input line entry specification]]
| A line notation for molecules.
| A line notation for molecules.
|-
|-
| sdf
| <code>.sdf</code>
| chemical/x-mdl-sdfile
| chemical/x-mdl-sdfile
| [[SD format|Structure-Data File]]
| [[SD format|Structure-Data File]]
|
|
|-
|-
| el
| <code>.el</code>
| chemical/x-sketchel
| chemical/x-sketchel
| [[SketchEl]] Molecule
| SketchEl Molecule
|
|
|-
|-
| ds
| <code>.ds</code>
| chemical/x-datasheet
| chemical/x-datasheet
| SketchEl XML DataSheet
| SketchEl XML DataSheet
|
|
|-
|-
| inchi
| <code>.inchi</code>
| chemical/x-inchi
| chemical/x-inchi
| The IUPAC International Chemical Identifier
| IUPAC [[International Chemical Identifier]] (InChI)
|
|
|-
|-
| jsd, jsdraw
| <code>.jsd</code>, <code>.jsdraw</code>
| chemical/x-jsdraw
| chemical/x-jsdraw
| JSDraw native file format
| JSDraw native file format
|
|
|-
|-
| helm, ihelm
| <code>.helm</code>, <code>.ihelm</code>
| chemical/x-helm
| chemical/x-helm
| Pistoia Alliance [[Hierarchical Editing Language for Macromolecules|HELM]] string
| Pistoia Alliance [[Hierarchical Editing Language for Macromolecules|HELM]] string
| A line notation for biological molecules
| A line notation for biological molecules
|-
|-
| xhelm
| <code>.xhelm</code>
| chemical/x-xhelm
| chemical/x-xhelm
| Pistoia Alliance XHELM XML file
| Pistoia Alliance XHELM XML file
Line 306: Line 317:
===Support===
===Support===


For Linux/Unix, configuration files are available as a "''chemical-mime-data''" package in [[Deb (file format)|.deb]], [[RPM Package Manager|RPM]] and tar.gz formats to register chemical MIME types on a web server.<ref>http://packages.debian.org/search?keywords=chemical-mime</ref><ref>http://downloads.sourceforge.net/chemical-mime/</ref> Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.
For Linux/Unix, configuration files are available as a "''chemical-mime-data''" package in [[Deb (file format)|.deb]], [[RPM Package Manager|RPM]] and tar.gz formats to register chemical MIME types on a web server.<ref>{{Cite web|url=http://packages.debian.org/search?keywords=chemical-mime|title =Package Search Results for "chemical-mime" {{!}} Debian}}</ref><ref>{{cite web |url=http://downloads.sourceforge.net/chemical-mime/ |title = Why Use SourceForge? Features and Benefits}}</ref> Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.


== Sources of Chemical Data ==
==Sources of chemical data==


Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below.
Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below.


# The US [[National Institute of Health]] [[PubChem]] database is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats.
# The US [[National Institute of Health]] [[PubChem]] database is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats.
#The worldwide Protein Data Bank ([http://www.wwPDB.org/ wwPDB])<ref>{{cite journal |doi = 10.1038/nsb1203-980 |author = Berman, H.M. | year = 2003 | title = Announcing the worldwide Protein Data Bank | journal = Nature Structural Biology | volume = 10 |issue = 12 | pages = 980 |pmid = 14634627|display-authors=etal}}</ref> is an excellent source of protein and nucleic acid molecular coordinate data. The data is three-dimensional and provided in Protein Data Bank (PDB) format.
#The worldwide Protein Data Bank ([http://www.wwPDB.org/ wwPDB])<ref>{{cite journal |doi = 10.1038/nsb1203-980 |author = Berman, H.M. | year = 2003 | title = Announcing the worldwide Protein Data Bank | journal = Nature Structural Biology | volume = 10 |issue = 12 | pages = 980 |pmid = 14634627|display-authors=etal|doi-access = free }}</ref> is an excellent source of protein and nucleic acid molecular coordinate data. The data is three-dimensional and provided in Protein Data Bank (PDB) format.
#[[eMolecules]] is a commercial database for molecular data. The data includes a two-dimensional structure diagram and a smiles string for each compound. eMolecules supports fast substructure searching based on parts of the molecular structure.
#{{Proper name|eMolecules}} is a commercial database for molecular data. The data includes a two-dimensional structure diagram and a smiles string for each compound. {{Proper name|eMolecules}} supports fast substructure searching based on parts of the molecular structure.
#[[ChemExper]] is a commercial data base for molecular data. The search results include a two-dimensional structure diagram and a mole file for many compounds.
#[[ChemExper]] is a commercial data base for molecular data. The search results include a two-dimensional structure diagram and a mole file for many compounds.
#[[New York University]] [https://www.nyu.edu/pages/mathmol/library/ Library of 3-D Molecular Structures].
#[[New York University]] [https://www.nyu.edu/pages/mathmol/library/ Library of 3-D Molecular Structures].
Line 332: Line 343:


==External links==
==External links==
* {{Citation |author=[[MDL Information Systems]] |title=CTFile Formats |date=June 2005 |publisher=[[MDL Information Systems]] |location=San Leandro, California, United States |url=http://www.mdl.com/downloads/public/ctfile/ctfile.pdf |format=PDF |archiveurl=https://web.archive.org/web/20070630061308/http://www.mdl.com/downloads/public/ctfile/ctfile.pdf |archivedate=June 30, 2007 }}
* {{Citation |author=MDL Information Systems |author-link=MDL Information Systems |title=CTFile Formats |date=June 2005 |publisher=[[MDL Information Systems]] |location=San Leandro, California, United States |url=http://www.mdl.com/downloads/public/ctfile/ctfile.pdf |archive-url=https://web.archive.org/web/20070630061308/http://www.mdl.com/downloads/public/ctfile/ctfile.pdf |archive-date=June 30, 2007 }}
*{{cite web |url= http://cactus.nci.nih.gov/blog/?p=68 |title= Resolve a structure identifier as SDF, CML, MRV, PDB |date= July 2009 |publisher= CADD Group Chemoinformatics Tools and User Services (CACTUS) |location= [[NIH]] |work= [[National Cancer Institute|NCI]] }}
*{{cite web |url= http://cactus.nci.nih.gov/blog/?p=68 |title= Resolve a structure identifier as SDF, CML, MRV, PDB |date= July 2009 |publisher= CADD Group Chemoinformatics Tools and User Services (CACTUS)|location= [[NIH]]|work=[[National Cancer Institute|NCI]]}}


{{DEFAULTSORT:Chemical File Format}}
{{DEFAULTSORT:Chemical File Format}}

Latest revision as of 06:10, 19 July 2024

A chemical file format is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the chemical table file format, which is similar to Structure Data Format (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.

Distinguishing formats

[edit]

Chemical information is usually provided as files or streams and many formats have been created, with varying degrees of documentation. The format is indicated in three ways:
(see § The Chemical MIME Project)

  • file extension (usually 3 letters). This is widely used, but fragile as common suffixes such as .mol and .dat are used by many systems, including non-chemical ones.
  • self-describing files where the format information is included in the file. Examples are CIF and CML.
  • chemical/MIME type added by a chemically aware server.

Chemical Markup Language

[edit]

Chemical Markup Language (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint, Jmol, XDrawChem and MarvinView.

Protein Data Bank Format

[edit]

The Protein Data Bank Format is an obsolete format for protein structures developed in 1972.[1] It is a fixed-width format and thus limited to a maximum number of atoms, residues, and chains; this resulted in splitting very large structures such as ribosomes into multiple files. For example, the E. coli 70S was represented as 4 PDB files in 2009: 3I1M Archived 2016-10-05 at the Wayback Machine, 3I1N Archived 2016-10-16 at the Wayback Machine, 3I1O, and 3I1P. In 2014, they were consolidated into a single file, 4V6C. In 2014, the PDB format was officially replaced with mmCIF, and newer PDB structures may not have PDB files available.

Some PDB files contained an optional section describing atom connectivity as well as position. Because these files were sometimes used to describe macromolecular assemblies or molecules represented in explicit solvent, they could grow very large and were often compressed. Some tools, such as Jmol and KiNG,[2] could read PDB files in gzipped format. The wwPDB maintained the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification (to version 3.0) in August 2007, and a remediation of many file problems in the existing database.[3] The typical file extension for a PDB file was .pdb, although some older files used .ent or .brk. Some molecular modeling tools wrote nonstandard PDB-style files that adapted the basic format to their own needs.

GROMACS format

[edit]

The GROMACS file format family was created for use with the molecular simulation software package GROMACS. It closely resembles the PDB format but was designed for storing output from molecular dynamics simulations, so it allows for additional numerical precision and optionally retains information about particle velocity as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is .gro.

CHARMM format

[edit]

The CHARMM molecular dynamics package[4] can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF (protein structure file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are .crd and .psf respectively.

GSD format

[edit]

The General Simulation Data (GSD) file format created for efficient reading / writing of generic particle simulations, primarily - but not restricted to - those from HOOMD-blue. The package also contains a python module that reads and writes HOOMD schema gsd files with an easy to use syntax.[1]

Ghemical file format

[edit]

The Ghemical software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (!Header, !Info, !Atoms, !Bonds, !Coord, !PartialCharges and !End).

The proposed MIME type for this format is application/x-ghemical.

SYBYL Line Notation

[edit]

SYBYL Line Notation (SLN) is a chemical line notation. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of Markush structure queries. The syntax also supports the specification of combinatorial libraries of ChemDraw.

Example SLNs
Description SLN string
Benzene C[1]H:CH:CH:CH:CH:CH:@1
Alanine NH2C[s=n]H(CH3)C(=O)OH
Query showing R sidechain R1[hac>1]C[1]:C:C:C:C:C:@1
Query for amide/sulfamide NHC=M1{M1:O,S}

SMILES

[edit]

The simplified molecular input line entry system, or SMILES,[5] is a line notation for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates.

Hydrogen atoms are not represented. Other atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. The symbol = represents double bonds and # represents triple bonds. Branching is indicated by ( ). Rings are indicated by pairs of digits.

Some examples are

Name Formula SMILES string
Methane CH4 C
Ethanol C2H6O CCO
Benzene C6H6 C1=CC=CC=C1 or c1ccccc1
Ethylene C2H4 C=C

XYZ

[edit]

The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates.

MDL number

[edit]

The MDL number contains a unique identification number for each reaction and variation. The format is RXXXnnnnnnnn. R indicates a reaction, XXX indicates which database contains the reaction record. The numeric portion, nnnnnnnn, is an 8-digit number.

Other common formats

[edit]

One of the most widely used industry standards are chemical table file formats, like the Structure Data Format (SDF) files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). MOL is another file format from MDL. It is documented in Chapter 4 of CTfile Formats.[6]

PubChem also has XML and ASN1 file formats, which are export options from the PubChem online database. They are both text based (ASN1 is most often a binary format).

There are a large number of other formats listed in the table below

Converting between formats

[edit]

OpenBabel and JOELib are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables.

obabel -i input_format input_file -o output_format output_file

For example, to convert the file epinephrine.sdf in SDF to CML use the command

obabel -i sdf epinephrine.sdf -o cml epinephrine.cml

The resulting file is epinephrine.cml.

IOData is a free and open-source Python library for parsing, storing, and converting various file formats commonly used by quantum chemistry, molecular dynamics, and plane-wave density-functional-theory software programs. It also supports a flexible framework for generating input files for various software packages. For a complete list of supported formats, please go to https://iodata.readthedocs.io/en/latest/formats.html.

A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools JChemPaint (based on the Chemistry Development Kit), XDrawChem (based on OpenBabel), Chime, Jmol, Mol2mol[7][citation needed] and Discovery Studio fit into this category.

The Chemical MIME Project

[edit]

"Chemical MIME" is a de facto approach for adding MIME types to chemical streams.

This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. ... The first version of an Internet draft was published during May–October 1994, and the second revised version during April–September 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion.[8]

In 1998 the work was formally published in the JCIM.[9]

File extension MIME Type Proper Name Description
.alc chemical/x-alchemy Alchemy Format
.csf chemical/x-cache-csf CAChe MolStruct CSF
.cbin, .cascii, .ctab chemical/x-cactvs-binary CACTVS format
.cdx chemical/x-cdx ChemDraw eXchange file
.cer chemical/x-cerius MSI Cerius II format
.c3d chemical/x-chem3d Chem3D Format
.chm chemical/x-chemdraw ChemDraw file
.cif chemical/x-cif Crystallographic Information File, Crystallographic Information Framework Promulgated by the International Union of Crystallography
.cmdf chemical/x-cmdf CrystalMaker Data format
.cml chemical/x-cml Chemical Markup Language XML based Chemical Markup Language.
.cpa chemical/x-compass Compass program of the Takahashi
.bsd chemical/x-crossfire Crossfire file
.csm, .csml chemical/x-csml Chemical Style Markup Language
.ctx chemical/x-ctx Gasteiger group CTX file format
.cxf, .cef chemical/x-cxf Chemical eXchange Format
.emb, .embl chemical/x-embl-dl-nucleotide EMBL Nucleotide Format
.spc chemical/x-galactic-spc SPC format for spectral and chromatographic data
.inp, .gam, .gamin chemical/x-gamess-input GAMESS Input format
.fch, .fchk chemical/x-gaussian-checkpoint Gaussian Checkpoint Format
.cub chemical/x-gaussian-cube Gaussian Cube (Wavefunction) Format
.gau, .gjc, .gjf, .com chemical/x-gaussian-input Gaussian Input Format
.gcg chemical/x-gcg8-sequence Protein Sequence Format
.gen chemical/x-genbank ToGenBank Format
.istr, .ist chemical/x-isostar IsoStar Library of Intermolecular Interactions
.jdx, .dx chemical/x-jcamp-dx JCAMP Spectroscopic Data Exchange Format
.kin chemical/x-kinemage Kinetic (Protein Structure) Images; Kinemage
.mcm chemical/x-macmolecule MacMolecule File Format
.mmd, .mmod chemical/x-macromodel-input MacroModel Molecular Mechanics
.mol chemical/x-mdl-molfile MDL Molfile
.smiles, .smi chemical/x-daylight-smiles Simplified molecular input line entry specification A line notation for molecules.
.sdf chemical/x-mdl-sdfile Structure-Data File
.el chemical/x-sketchel SketchEl Molecule
.ds chemical/x-datasheet SketchEl XML DataSheet
.inchi chemical/x-inchi IUPAC International Chemical Identifier (InChI)
.jsd, .jsdraw chemical/x-jsdraw JSDraw native file format
.helm, .ihelm chemical/x-helm Pistoia Alliance HELM string A line notation for biological molecules
.xhelm chemical/x-xhelm Pistoia Alliance XHELM XML file XML based HELM including monomer definitions

Support

[edit]

For Linux/Unix, configuration files are available as a "chemical-mime-data" package in .deb, RPM and tar.gz formats to register chemical MIME types on a web server.[10][11] Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.

Sources of chemical data

[edit]

Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below.

  1. The US National Institute of Health PubChem database is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats.
  2. The worldwide Protein Data Bank (wwPDB)[12] is an excellent source of protein and nucleic acid molecular coordinate data. The data is three-dimensional and provided in Protein Data Bank (PDB) format.
  3. eMolecules is a commercial database for molecular data. The data includes a two-dimensional structure diagram and a smiles string for each compound. eMolecules supports fast substructure searching based on parts of the molecular structure.
  4. ChemExper is a commercial data base for molecular data. The search results include a two-dimensional structure diagram and a mole file for many compounds.
  5. New York University Library of 3-D Molecular Structures.
  6. The US Environmental Protection Agency's The Distributed Structure-Searchable Toxicity (DSSTox) Database Network is a project of EPA's Computational Toxicology Program. The database provides SDF molecular files with a focus on carcinogenic and otherwise toxic substances.

See also

[edit]

References

[edit]
  1. ^ wwPDB.org. "wwPDB: File Format". www.wwpdb.org. Retrieved 2024-06-13.
  2. ^ Chen, V.B.; et al. (2009). "KING (Kinemage, Next Generation): A versatile interactive molecular and scientific visualization program". Protein Science. 18 (11): 2403–2409. doi:10.1002/pro.250. PMC 2788294. PMID 19768809.
  3. ^ Henrick, K.; et al. (2008). "Remediation of the protein data bank archive". Nucleic Acids Research. 36 (Database issue): D426–D433. doi:10.1093/nar/gkm937. PMC 2238854. PMID 18073189.
  4. ^ Brooks, B.M.; et al. (1983). "CHARMM: A program for macromolecular energy, minimization, and dynamics calculations". J. Comput. Chem. 4 (2): 187–217. doi:10.1002/jcc.540040211. S2CID 91559650.
  5. ^ Weininger, David (1988). "SMILES, a Chemical Language and Information System: 1: Introduction to Methodology and Encoding Rules". Journal of Chemical Information and Modeling. 28 (1): 31–36. doi:10.1021/ci00057a005. S2CID 5445756.
  6. ^ MDL Information Systems 2005
  7. ^ Mol2mol homepage
  8. ^ The Chemical MIME Home Page (accessed 2013-January-24)
  9. ^ Rzepa, H. S.; Murray-Rust, P.; Whitaker, B. J. (1998). "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange". Journal of Chemical Information and Modeling. 38 (6): 976. doi:10.1021/ci9803233.
  10. ^ "Package Search Results for "chemical-mime" | Debian".
  11. ^ "Why Use SourceForge? Features and Benefits".
  12. ^ Berman, H.M.; et al. (2003). "Announcing the worldwide Protein Data Bank". Nature Structural Biology. 10 (12): 980. doi:10.1038/nsb1203-980. PMID 14634627.
[edit]