Jump to content

Syntax (programming languages): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
Blanked the page
Line 1: Line 1:
{{Refimprove|date=August 2013}}
[[Image:Python add5 syntax.svg|thumb|right|292px|[[Syntax highlighting]] and [[indent style]] are often used to aid programmers in recognizing elements of source code. Color coded highlighting is used in this piece of code written in [[Python (programming language)|Python]].]]

In [[computer science]], the '''syntax''' of a [[computer language]] is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language. This applies both to [[programming language]]s, where the document represents [[source code]], and [[markup language]]s, where the document represents data. The syntax of a language defines its surface form.<ref name="eopl">{{cite book|last=Friedman|first=Daniel P.|author2=Mitchell Wand |author3=Christopher T. Haynes |title=Essentials of Programming Languages|publisher=The MIT Press|date=1992|edition=1st|isbn=0-262-06145-7}}</ref> Text-based computer languages are based on sequences of characters, while [[visual programming languages]] are based on the spatial layout and connections between symbols (which may be textual or graphical). Documents that are syntactically invalid are said to have a [[syntax error]].

Syntax – the form – is contrasted with [[semantics (computer science)|semantics]] – the meaning. In processing computer languages, semantic processing generally comes after syntactic processing, but in some cases semantic processing is necessary for complete syntactic analysis, and these are done together or [[concurrency (computer science)|concurrently]]. In a [[compiler]], the syntactic analysis comprises the [[compiler frontend|frontend]], while semantic analysis comprises the [[compiler backend|backend]] (and middle end, if this phase is distinguished).

==Levels of syntax==
Computer language 'syntax' is generally distinguished into three levels:
* Words – the lexical level, determining how characters form tokens;
* Phrases – the granma level, narrowly speaking, determining how tokens form phrases;
* Context – determining what objects or variables names refer to, if types are valid, etc.

Distinguishing in this way yields modularity, allowing each level to be described and processed separately, and often independently. First a lexer turns the linear sequence of characters into a linear sequence of [[Token (parser)|token]]s; this is known as "[[lexical analysis]]" or "lexing". Second the parser turns the linear sequence of tokens into a hierarchical syntax tree; this is known as "[[parsing]]" narrowly speaking. Thirdly the contextual analysis resolves names and checks types. This modularity is sometimes possible, but in many real-world languages an earlier step depends on a later step – for example, [[the lexer hack]] in C is because tokenization depends on context. Even in these cases, syntactical analysis is often seen as approximating this ideal model.

The parsing stage itself can be divided into two parts: the [[parse tree]] or "concrete syntax tree" which is determined by the grammar, but is generally far too detailed for practical use, and the [[abstract syntax tree]] (AST), which simplifies this into a usable form. The AST and contextual analysis steps can be considered a form of semantic analysis, as they are adding meaning and interpretation to the syntax, or alternatively as informal, manual implementations of syntactical rules that would be difficult or awkward to describe or implement formally.

The levels generally correspond to levels in the [[Chomsky hierarchy]]. Words are in a [[regular language]], specified in the [[lexical grammar]], which is a Type-3 grammar, generally given as [[regular expression]]s. Phrases are in a [[context-free language]] (CFL), generally a [[deterministic context-free language]] (DCFL), specified in a [[phrase structure grammar]], which is a Type-2 grammar, generally given as [[Production (computer science)|production rules]] in [[Backus–Naur form]] (BNF). Phrase grammars are often specified in much more constrained grammars than full [[context-free grammar]]s, in order to make them easier to parse; while the [[LR parser]] can parse any DCFL in linear time, the simple [[LALR parser]] and even simpler [[LL parser]] are more efficient, but can only parse grammars whose production rules are constrained. Contextual structure can in principle be described by a [[context-sensitive grammar]], and automatically analyzed by means such as [[attribute grammar]]s, though in general this step is done manually, via [[Name resolution (programming languages)|name resolution]] rules and [[type checking]], and implemented via a [[symbol table]] which stores names and types for each scope.

Tools have been written that automatically generate a lexer from a lexical specification written in regular expressions and a parser from the phrase grammar written in BNF: this allows one to use [[declarative programming]], rather than need to have procedural or functional programming. A notable example is the [[Lex (software)|lex]]-[[yacc]] pair. These automatically produce a ''concrete'' syntax tree; the parser writer must then manually write code describing how this is converted to an ''abstract'' syntax tree. Contextual analysis is also generally implemented manually. Despite the existence of these automatic tools, parsing is often implemented manually, for various reasons – perhaps the phrase structure is not context-free, or an alternative implementation improves performance or error-reporting, or allows the grammar to be changed more easily. Parsers are often written in functional languages, such as Haskell, in scripting languages, such as Python or Perl, or in C or C++.

===Examples of errors===
{{main|Syntax error}}
As an example, <code>(add 1 1)</code> is a syntactically valid Lisp program (assuming the 'add' function exists, else name resolution fails), adding 1 and 1. However, the following are invalid:
(_ 1 1) lexical error: '_' is not valid
(add 1 1 parsing error: missing closing ')'
Note that the lexer is unable to identify the first error – all it knows is that, after producing the token LEFT_PAREN, '(' the remainder of the program is invalid, since no word rule begins with '_'. The second error is detected at the parsing stage: The parser has identified the "list" production rule due to the '(' token (as the only match), and thus can give an error message; in general it may be ambiguous.

Type errors and undeclared variable errors are sometimes considered to be syntax errors when they are detected at compile-time (which is usually the case when compiling strongly-typed languages), though it is common to classify these kinds of error as [[Programming language#Semantics|semantic]] errors instead.<ref>{{cite book|last=Aho|first=Alfred V.|author2=Monica S. Lam|author3=Ravi Sethi|author4=Jeffrey D. Ullman|title=Compilers: Principles, Techniques, and Tools|publisher=Addison Wesley|date=2007|edition=2nd|isbn=0-321-48681-1}}Section 4.1.3: Syntax Error Handling, pp.194&ndash;195.</ref><ref>{{cite book|last=Louden|first=Kenneth C.|title=Compiler Construction: Principles and Practice|publisher=Brooks/Cole|date=1997|isbn=981-243-694-4}} Exercise 1.3, pp.27&ndash;28.</ref><ref name="uninitialized var">[http://www.dummies.com/how-to/content/semantic-errors-in-java.html Semantic Errors in Java]</ref>

As an example, the Python code
'a' + 1
contains a type error because it adds a string literal to an integer literal. Type errors of this kind can be detected at compile-time: They can be detected during parsing (phrase analysis) if the compiler uses separate rules that allow "integerLiteral + integerLiteral" but not "stringLiteral + integerLiteral", though it is more likely that the compiler will use a parsing rule that allows all expressions of the form "LiteralOrIdentifier + LiteralOrIdentifier" and then the error will be detected during contextual analysis (when type checking occurs). In some cases this validation is not done by the compiler, and these errors are only detected at runtime.

In a dynamically typed language, where type can only be determined at runtime, many type errors can only be detected at runtime. For example, the Python code
a + b
is syntactically valid at the phrase level, but the correctness of the types of a and b can only be determined at runtime, as variables do not have types in Python, only values do. Whereas there is disagreement about whether a type error detected by the compiler should be called a syntax error (rather than a [[Programming language#Static semantics|static semantic]] error), type errors which can only be detected at program execution time are always regarded as semantic rather than syntax errors.

== Syntax definition ==

[[Image:Python add5 parse.svg|thumb|right|396px|[[Parse tree]] of Python code with inset tokenization]]

The syntax of textual programming languages is usually defined using a combination of [[regular expression]]s (for [[lexical analysis|lexical]] structure) and [[Backus–Naur form]] (for [[context-free grammar|grammatical]] structure) to inductively specify [[syntactic category|syntactic categories]] (nonterminals) and ''terminal'' symbols. Syntactic categories are defined by rules called ''productions'', which specify the values that belong to a particular syntactic category.<ref name="eopl"/> Terminal symbols are the concrete characters or strings of characters (for example [[Keyword (computer programming)|keyword]]s such as ''define'', ''if'', ''let'', or ''void'') from which syntactically valid programs are constructed.

A language can have different equivalent grammars, such as equivalent regular expressions (at the lexical levels), or different phrase rules which generate the same language. Using a broader category of grammars, such as LR grammars, can allow shorter or simpler grammars compared with more restricted categories, such as LL grammar, which may require longer grammars with more rules. Different but equivalent phrase grammars yield different parse trees, though the underlying language (set of valid documents) is the same.

===Example: Lisp===
Below is a simple grammar, defined using the notation of regular expressions and [[Extended Backus–Naur form]]. It describes the syntax of [[Lisp programming language|Lisp]], which defines productions for the syntactic categories ''expression'', ''atom'', ''number'', ''symbol'', and ''list'':

<source lang="ebnf">
expression = atom | list
atom = number | symbol
number = [+-]?['0'-'9']+
symbol = ['A'-'Z''a'-'z'].*
list = '(', expression*, ')'
</source>

This grammar specifies the following:
* an ''expression'' is either an ''atom'' or a ''list'';
* an ''atom'' is either a ''number'' or a ''symbol'';
* a ''number'' is an unbroken sequence of one or more decimal digits, optionally preceded by a plus or minus sign;
* a ''symbol'' is a letter followed by zero or more of any characters (excluding whitespace); and
* a ''list'' is a matched pair of parentheses, with zero or more ''expressions'' inside it.
Here the decimal digits, upper- and lower-case characters, and parentheses are terminal symbols.

The following are examples of well-formed token sequences in this grammar: '<code>12345</code>', '<code>()</code>', '<code>(a b c232 (1))</code>'

===Complex grammars===
The grammar needed to specify a programming language can be classified by its position in the [[Chomsky hierarchy]]. The phrase grammar of most programming languages can be specified using a Type-2 grammar, i.e., they are [[context-free grammar]]s,<ref>{{cite book|author = [[Michael Sipser]] | year = 1997 | title = Introduction to the Theory of Computation | publisher = PWS Publishing | isbn = 0-534-94728-X}} Section 2.2: Pushdown Automata, pp.101&ndash;114.</ref> though the overall syntax is context-sensitive (due to variable declarations and nested scopes), hence Type-1. However, there are exceptions, and for some languages the phrase grammar is Type-0 (Turing-complete).

In some languages like Perl and Lisp the specification (or implementation) of the language allows constructs that execute during the parsing phase. Furthermore, these languages have constructs that allow the programmer to alter the behavior of the parser. This combination effectively blurs the distinction between parsing and execution, and makes syntax analysis an [[undecidable problem]] in these languages, meaning that the parsing phase may not finish. For example, in Perl it is possible to execute code during parsing using a <code>BEGIN</code> statement, and Perl function prototypes may alter the syntactic interpretation, and possibly even the syntactic validity of the remaining code.<ref>The following discussions give examples:
* [http://www.jeffreykegler.com/Home/perl-and-undecidability Perl and Undecidability]
* [http://lambda-the-ultimate.org/node/3564#comment-50578 LtU comment clarifying that the undecidable problem is membership in the class of Perl programs]
* [http://www.modernperlbooks.com/mt/2009/08/on-parsing-perl-5.html chromatic's example of Perl code that gives a syntax error depending on the value of random variable]<!-- chromatic is a published author; see http://www.oreillynet.com/pub/au/176 -->
</ref> Colloquially this is referred to as "only Perl can parse Perl" (because code must be executed during parsing, and can modify the grammar), or more strongly "even Perl cannot parse Perl" (because it is undecidable). Similarly, Lisp [[macro instruction|macros]] introduced by the <code>defmacro</code> syntax also execute during parsing, meaning that a Lisp compiler must have an entire Lisp run-time system present. In contrast C macros are merely string replacements, and do not require code execution.<ref>{{cite web|url=http://www.apl.jhu.edu/~hall/Lisp-Notes/Macros.html |title=An Introduction to Common Lisp Macros |publisher=Apl.jhu.edu |date=1996-02-08 |accessdate=2013-08-17}}</ref><ref>{{cite web|url=http://cl-cookbook.sourceforge.net/macros.html |title=The Common Lisp Cookbook - Macros and Backquote |publisher=Cl-cookbook.sourceforge.net |date=2007-01-16 |accessdate=2013-08-17}}</ref>

== Syntax versus semantics ==

The syntax of a language describes the form of a valid program, but does not provide any information about the meaning of the program or the results of executing that program. The meaning given to a combination of symbols is handled by semantics (either [[Formal semantics of programming languages|formal]] or hard-coded in a [[Reference implementation (computing)|reference implementation]]). Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and the soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibit [[undefined behavior]]. Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it.

Using [[natural language]] as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false:
* "[[Colorless green ideas sleep furiously]]." is grammatically well formed but has no generally accepted meaning.
* "John is a married bachelor." is grammatically well formed but expresses a meaning that cannot be true.

The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (because <tt>p</tt> is a [[null pointer]], the operations <tt>p->real</tt> and <tt>p->im</tt> have no meaning):
<source lang=C>
complex *p = NULL;
complex abs_p = sqrt (p->real * p->real + p->im * p->im);
</source>

As a simpler example,
<source lang=C>
int x;
printf("%d", x);
</source>
is syntactically valid, but not semantically defined, as it uses an [[uninitialized variable]]. Even though compilers for some programming languages (e.g., Java and C#) would detect uninitialized variable errors of this kind, they should be regarded as [[Programming language#Semantics|semantic]] errors rather than syntax errors.<ref name="uninitialized var" /><ref>[https://stackoverflow.com/questions/8803718/issue-of-syntax-or-semantics/8803765#8803765 Issue of syntax or semantics?]</ref>

== See also ==
To quickly compare syntax of various programming languages, take a look at the list of [["Hello, World!" program]] examples:
* [[Prolog syntax and semantics]]
* [[Perl language structure#Basic syntax|Perl syntax]]
* [[PHP syntax and semantics]]
* [[C syntax]]
* [[C++ syntax]]
* [[Java syntax]]
* [[JavaScript syntax]]
* [[Python syntax and semantics]]
* [[Lua syntax]]
* [[Haskell syntax]]

== References ==
{{reflist}}

== External links ==
* Various syntactic constructs used in [http://merd.sourceforge.net/pixel/language-study/syntax-across-languages/ computer programming languages]

[[Category:Programming language syntax| ]]
[[Category:Programming language topics]]
[[Category:Source code]]

Revision as of 05:45, 20 August 2017