Jump to content

Maximal munch: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Adding short description: "Longest-match principle in parsing"
 
(19 intermediate revisions by 15 users not shown)
Line 1: Line 1:
{{Short description|Longest-match principle in parsing}}
In [[computer programming]] and [[computer science]], "'''maximal munch'''" or "'''longest match'''" is the principle that when creating some construct, as much of the available input as possible should be consumed.
In [[computer programming]] and [[computer science]], "'''maximal munch'''" or "'''longest match'''" is the principle that when creating some construct, as much of the available input as possible should be consumed.

The earliest known use of this term is by R.G.G. Cattell in his PhD thesis<ref>Cattell, R. G. G. “Formalization and Automatic Derivation of Code Generators”. PhD thesis, 1978. Carnegie Mellon University, Pittsburgh, Pennsylvania, USA</ref> on automatic derivation of [[Code generation (compiler)|code generators]] for [[compilers]].


==Application==
==Application==
Line 5: Line 8:
For instance, the [[lexical grammar|lexical syntax]] of many [[programming languages]] requires that [[Token (parser)|tokens]] be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used [[regular expression]]s such as <code>[a-z]+</code> (one or more lower-case letters).<ref>Aho ''et al.'', 168.</ref>
For instance, the [[lexical grammar|lexical syntax]] of many [[programming languages]] requires that [[Token (parser)|tokens]] be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used [[regular expression]]s such as <code>[a-z]+</code> (one or more lower-case letters).<ref>Aho ''et al.'', 168.</ref>


The term is also used in [[compilers]] in the [[instruction selection]] stage to describe a method of "tiling" — determining how a structured tree representing a program in an [[intermediate language]] should be converted into linear [[machine code]]. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles," each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch."<ref>Page, 470.</ref>
The term is also used in [[compilers]] in the [[instruction selection]] stage to describe a method of "tiling" — determining how a structured tree representing a program in an [[intermediate language]] should be converted into linear [[machine code]]. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles", each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch".<ref>Page, 470.</ref>


==Drawbacks==
==Drawbacks==


In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For instance, in the [[C]] programming language, the statement <code>x=y/*z;</code> (without any whitespace) will probably lead to a syntax error, since the <code>/*</code> character sequence initiates a (unintended) comment that is either unterminated or terminated by the end token <code>*/</code> of some later, unrelated ''actual'' comment (comments in C do not nest). What was actually meant in the statement was to assign to the variable <code>x</code> the result of dividing the value in <code>y</code> by the value obtained by dereferencing [[pointer]] <code>z</code>; this would be perfectly valid (though not very common) code. It cannot be stated without making use of whitespace, in a language whose syntax is supposedly whitespace-agnostic.
In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For instance, in the [[C (programming language)|C]] programming language, the statement <code>x=y/*z;</code> (without any whitespace) will probably lead to a syntax error since the <code>/*</code> character sequence (unintentionally) initiates a comment that is either unterminated or terminated by the end token <code>*/</code> of some later, unrelated ''actual'' comment (comments in C do not nest). What was actually meant in the statement was to assign to the variable <code>x</code> the result of dividing the value in <code>y</code> by the value obtained by dereferencing [[Pointer (computer programming)|pointer]] <code>z</code>; this would be valid code. It can be stated by making use of whitespace or using <code>x=y/(*z);</code>.


Another example, in [[C++]], uses the "angle bracket" characters <code>&lt;</code> and <code>&gt;</code> in the syntax for [[Template (programming)|template specialization]], but two consecutive <code>&gt;</code> characters are interpreted as the [[bit shift|right-shift]] operator <code>&gt;&gt;</code>.<ref>Vandevoorde.</ref> Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens:
Another example, in [[C++]], uses the "angle bracket" characters <code>&lt;</code> and <code>&gt;</code> in the syntax for [[Template (programming)|template specialization]], but two consecutive <code>&gt;</code> characters are interpreted as the [[bit shift|right-shift]] operator <code>&gt;&gt;</code>.<ref>Vandevoorde.</ref> Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens:


<source lang="Cpp">
<syntaxhighlight lang="Cpp">
std::vector<std::vector<int>> incorrect; //Incorrect in C++03, but correct in C++11.
std::vector<std::vector<int>> my_mat_11; //Incorrect in C++03, correct in C++11.
std::vector<std::vector<int> > correct; //Correct in both C++03 and C++11.
std::vector<std::vector<int> > my_mat_03; //Correct in either C++03 or C++11.
</syntaxhighlight>
</source>


The [[C++11]] standard adopted in August 2011 [[C++11#Right angle bracket|amended the grammar]] so that a right-shift token is accepted as synonymous with a pair of right-angle brackets (as in [[Java (programming language)|Java]]), which complicates the grammar but allows the continued use of the maximal munch principle.
The [[C++11]] standard adopted in August 2011 [[C++11#Right angle bracket|amended the grammar]] so that a right-shift token is accepted as synonymous with a pair of right angle brackets (as in [[Java (programming language)|Java]]), which complicates the grammar but allows the continued use of the maximal munch principle. An exception to the maximal munch rule had to be added anyway to deal with the sequence <code><::</code> which can appear in templates. In that case, unless the sequence is followed by <code>:</code> or <code>></code> the character <code><</code> is interpreted as its own token instead of part of the token <code>[[Digraph (computing)#C.2B.2B|<:]]</code>.


==Alternatives==
==Alternatives==


Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions," which instead of directly taking the longest match will put some restrictions on what characters can ''follow'' a valid match. For example, stipulating that strings matching <code>[a-z]+</code> cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression.<ref>Van den Brand ''et al.'', 26.</ref> Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (''e.g.'', the right-shift token in Java would not be matched in the context of a [[Generics in Java|generics]] expression, where it is syntactically invalid).<ref>Van Wyk ''et al.'', 63.</ref>
Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions", which instead of directly taking the longest match will put some restrictions on what characters can ''follow'' a valid match. For example, stipulating that strings matching <code>[a-z]+</code> cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression.<ref>Van den Brand ''et al.'', 26.</ref> (In the context of regular expressions, the maximal munch principle is referred to as ''greediness'' and contrasted with ''[[Regular expression#Lazy matching|laziness]]''.) Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (''e.g.'', the right-shift token in Java would not be matched in the context of a [[Generics in Java|generics]] expression, where it is syntactically invalid).<ref>Van Wyk ''et al.'', 63.</ref>


==References==
==References==
Line 31: Line 34:


* {{Cite book
* {{Cite book
|last=Aho
|last1=Aho
|first=Alfred V.
|first1=Alfred V.
|last2=Lam
|last2=Lam
|first2=Monica S.
|first2=Monica S.
Line 39: Line 42:
|last4=Ullman
|last4=Ullman
|first4=Jeffrey D.
|first4=Jeffrey D.
|title=[[Compilers: Principles, Techniques, and Tools#Second edition|Compilers: Principles, Techniques & Tools]]
|title=Compilers: Principles, Techniques & Tools
|edition=2nd
|edition=2nd
|place=Boston
|place=Boston
|publisher=Addison-Wesley
|publisher=Addison-Wesley
|year=2007
|year=2007
|isbn=978-0-321-48681-3}}
|isbn=978-0-321-48681-3|title-link=Compilers: Principles, Techniques, and Tools#Second edition
}}
* {{Cite book
* {{Cite book
|last=Page
|last=Page
|first=Daniel
|first=Daniel
|title=Practical Introduction to Computer Architecture
|title=Practical Introduction to Computer Architecture
|pages=451–493
|publisher=Springer
|publisher=Springer
|location=London
|location=London
|year=2009
|year=2009
|isbn=978-1-84882-255-9
|isbn=978-1-84882-255-9
|doi=10.1007/978-1-84882-256-6_11
|url=http://www.springerlink.com/content/j10237532887k344/}}
|chapter=Compilers
* {{Cite journal
|series=Texts in Computer Science
|last=Van den Brand
}}
|first=Mark G.J.
* {{Cite book
|last1=Van den Brand
|first1=Mark G.J.
|last2=Scheerder
|last2=Scheerder
|first2=Jeroen
|first2=Jeroen
Line 63: Line 71:
|last4=Visser
|last4=Visser
|first4=Eelco
|first4=Eelco
|title=Disambiguation Filters for Scannerless Generalized LR Parsers
|title=Compiler Construction
|chapter=Disambiguation Filters for Scannerless Generalized LR Parsers
|journal=Lecture Notes in Computer Science
|series=Lecture Notes in Computer Science
|volume=2304/2002
|volume=2304/2002
|pages=21–44
|pages=21–44
|publisher=Springer
|publisher=Springer
|location=Berlin/Heidelberg
|location=Berlin/Heidelberg
|year=2002
|year=2002
|url=http://www.springerlink.com/content/03359k0cerupftfh/
|issn=0302-9743
|issn=0302-9743
|doi=10.1007/3-540-45937-5_12}}
|doi=10.1007/3-540-45937-5_12|isbn=978-3-540-43369-9
}}
* {{Cite web
* {{Cite web
|last=Vandevoorde
|last=Vandevoorde
Line 80: Line 89:
|accessdate=31 March 2010
|accessdate=31 March 2010
|url=http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1757.html}}
|url=http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1757.html}}
* {{Cite journal
* {{Cite book
|last=Van Wyk
|last1=Van Wyk
|first=Eric
|first1=Eric
|last2=Schwerdfeger
|last2=Schwerdfeger
|first2=August
|first2=August
|title=Proceedings of the 6th international conference on Generative programming and component engineering
|title=Context-Aware Scanning for Parsing Extensible Languages
|chapter=Context-aware scanning for parsing extensible languages
|journal=GPCE '07: Proceedings of the 6th international conference on Generative programming and component engineering
|year=2007
|year=2007
|pages=63–72
|pages=63–72
|publisher=ACM
|publisher=ACM
|location=New York
|location=New York
|url=http://portal.acm.org/citation.cfm?id=1289983&dl=GUIDE,
|doi=10.1145/1289971.1289983|isbn=9781595938558
|s2cid=9145863
|doi=10.1145/1289971.1289983}}
}}


[[Category:Compilers]]
[[Category:Compilers]]

Latest revision as of 08:06, 23 April 2024

In computer programming and computer science, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed.

The earliest known use of this term is by R.G.G. Cattell in his PhD thesis[1] on automatic derivation of code generators for compilers.

Application

[edit]

For instance, the lexical syntax of many programming languages requires that tokens be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used regular expressions such as [a-z]+ (one or more lower-case letters).[2]

The term is also used in compilers in the instruction selection stage to describe a method of "tiling" — determining how a structured tree representing a program in an intermediate language should be converted into linear machine code. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles", each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch".[3]

Drawbacks

[edit]

In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For instance, in the C programming language, the statement x=y/*z; (without any whitespace) will probably lead to a syntax error since the /* character sequence (unintentionally) initiates a comment that is either unterminated or terminated by the end token */ of some later, unrelated actual comment (comments in C do not nest). What was actually meant in the statement was to assign to the variable x the result of dividing the value in y by the value obtained by dereferencing pointer z; this would be valid code. It can be stated by making use of whitespace or using x=y/(*z);.

Another example, in C++, uses the "angle bracket" characters < and > in the syntax for template specialization, but two consecutive > characters are interpreted as the right-shift operator >>.[4] Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens:

    std::vector<std::vector<int>> my_mat_11; //Incorrect in C++03, correct in C++11.
    std::vector<std::vector<int> > my_mat_03; //Correct in either C++03 or C++11.

The C++11 standard adopted in August 2011 amended the grammar so that a right-shift token is accepted as synonymous with a pair of right angle brackets (as in Java), which complicates the grammar but allows the continued use of the maximal munch principle. An exception to the maximal munch rule had to be added anyway to deal with the sequence <:: which can appear in templates. In that case, unless the sequence is followed by : or > the character < is interpreted as its own token instead of part of the token <:.

Alternatives

[edit]

Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions", which instead of directly taking the longest match will put some restrictions on what characters can follow a valid match. For example, stipulating that strings matching [a-z]+ cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression.[5] (In the context of regular expressions, the maximal munch principle is referred to as greediness and contrasted with laziness.) Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (e.g., the right-shift token in Java would not be matched in the context of a generics expression, where it is syntactically invalid).[6]

References

[edit]
  1. ^ Cattell, R. G. G. “Formalization and Automatic Derivation of Code Generators”. PhD thesis, 1978. Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
  2. ^ Aho et al., 168.
  3. ^ Page, 470.
  4. ^ Vandevoorde.
  5. ^ Van den Brand et al., 26.
  6. ^ Van Wyk et al., 63.

Bibliography

[edit]
  • Aho, Alfred V.; Lam, Monica S.; Sethi, Ravi; Ullman, Jeffrey D. (2007). Compilers: Principles, Techniques & Tools (2nd ed.). Boston: Addison-Wesley. ISBN 978-0-321-48681-3.
  • Page, Daniel (2009). "Compilers". Practical Introduction to Computer Architecture. Texts in Computer Science. London: Springer. pp. 451–493. doi:10.1007/978-1-84882-256-6_11. ISBN 978-1-84882-255-9.
  • Van den Brand, Mark G.J.; Scheerder, Jeroen; Vinju, Jurgen J.; Visser, Eelco (2002). "Disambiguation Filters for Scannerless Generalized LR Parsers". Compiler Construction. Lecture Notes in Computer Science. Vol. 2304/2002. Berlin/Heidelberg: Springer. pp. 21–44. doi:10.1007/3-540-45937-5_12. ISBN 978-3-540-43369-9. ISSN 0302-9743.
  • Vandevoorde, Daveed (14 January 2005). "Right Angle Brackets". Retrieved 31 March 2010.
  • Van Wyk, Eric; Schwerdfeger, August (2007). "Context-aware scanning for parsing extensible languages". Proceedings of the 6th international conference on Generative programming and component engineering. New York: ACM. pp. 63–72. doi:10.1145/1289971.1289983. ISBN 9781595938558. S2CID 9145863.