Digraphs and trigraphs (programming): Difference between revisions
Line 76: | Line 76: | ||
%:%: ## |
%:%: ## |
||
Unlike trigraphs, digraphs are |
Unlike trigraphs, digraphs are handled during [[tokenization]]. A digraph must always represent a full token by itself, and will not be replaced, when it occurs inside another token, like a quoted string, or a character constant. |
||
==References== |
==References== |
Revision as of 21:44, 9 November 2007
In the C family of programming languages, a trigraph is a sequence of three characters, the first two of which are both question marks, that represents a single character.
The reason for their existence is that the basic character set of C (a subset of the ASCII character set) includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code if the keyboard being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that supported any version of the ISO 646 character set. Non-ASCII ISO 646 character sets are not much used today, but trigraphs remain in the C99 standard[1].
Trigraphs may also be useful with some EBCDIC code pages that lack characters such as {
and }
.
Trigraphs are not commonly encountered outside compiler test suites. Some compilers either have an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor, to be used only when trigraph processing is desired.
Trigraph sequences
The C preprocessor replaces all occurrences of the following nine trigraph sequences by their single-character equivalents before any other processing.
Trigraph Equivalent ======== ========== ??= # ??/ \ ??' ^ ??( [ ??) ] ??! | ??< { ??> } ??- ~
Note that ???
is not a trigraph sequence.
Note also that the problematic characters are nevertheless required to exist within the implementation, in both the source and execution character sets.
The ??/
trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:
// Will the next line be executed????????????????/ a++;
which is a single logical comment line, and
/??/ * A comment *??/ /
which is a correctly formed block comment.
Example
An example of a C program that uses all the defined trigraphs:
??=include <stdio.h> /* # */ int main(void) ??< /* { */ char n??(5??); /* [ and ] */ n??(4??) = '0' - (??-0 ??' 1 ??! 2); /* ~, ^ and | */ printf("%c??/n", n??(4??)); /* ??/ = \ */ return 0; ??> /* } */
Disambiguation
A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two subsequent ?
tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literals, and comments. To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..."
or an escape sequence "...?\?..."
.
Alternatives
In 1994 a normative amendment to the C standard, included in C99, supplied so-called digraphs as more readable alternatives to trigraphs. They are:
Digraph Equivalent ======= ========== <: [ :> ] <% { %> } %: # %:%: ##
Unlike trigraphs, digraphs are handled during tokenization. A digraph must always represent a full token by itself, and will not be replaced, when it occurs inside another token, like a quoted string, or a character constant.