Talk:Earley parser

Computer science Start‑class Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science

Start

This article has been rated as Start-class on Wikipedia's content assessment scale.

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

The article says that rules like

   a -> a b

will cause both top-down and bottom-up parsers to go into an infinite loop, but I thought it was only top-down parsers that had a problem with left-recursion. Any comments? — Cadr

Since no-one's commented, I'll go ahead and change the page... — Cadr

It's not exactly true, top down parsers like LL(1) will enter in a infinite loop, but other top-down parsers, indeterministic LL, (and I think some kind of top-down parser that uses breadth-first search also) will eventually finish.

Works on all CFGs?

http://pages.cpsc.ucalgary.ca/~aycock/spark/doc.html seems to suggest that "[...] should work with any Backus-Naur form (BNF) grammar, including grammars with empty right-hand sides and ambiguous grammars. There are apparently a few rather obscure cases where Earley's parsing algorithm fails, but they're not likely to crop up in any reasonable application." — Preceding unsigned comment added by 18.111.116.225 (talk) 16:26, 24 April 2012 (UTC)[reply]

The Dragon book claims that Earley parsers can handle all CFGs. Two other (googled) references repeat that (http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt, http://tom-ridge.blogspot.com/2011/06/simple-functional-sound-and-complete_22.html). I can't find anywhere else that someone claims Earley parsers don't parse all CFGs, so I would tend to remove the discuss tag unless you can cite more substantial evidence. JustinBlank (talk) 16:42, 25 October 2012 (UTC)[reply]

NP-complete?

Parsing Unrestricted languages is not NP-complete, it is much harder.

It can be easily shown (simulation of Turing machines by grammars gives an immediate reduction of the halting problem to parsing of general languages) that the problem is not computable.

Unless anybody objects, I'll correct it.

--UrsusArctos 22:37, 19 Oct 2004 (UTC)

Be careful with the meaning of the word "parse", though. You could argue that you don't have to be able to decide a language in order to parse it, and consequently even unrestricted languages would be "parseable" in a sense, i.e. the sense that if there was a parse, you would get it eventually. I'm not sure there's any general agreement on whether this would count as "parsing" or not, so I think we should word it carefully. Otherwise yeah, the NP-complete thing is incorrect. Cadr 12:15, 20 Oct 2004 (UTC)

Turning the recognizer into a parser

Everyone describes Earley's algorithm as a parser, but no-one ever bothers to explain how you turn the recognizer into a parser. My understanding is that you modify the completer to record a backpointer.. but exactly what item gets the backpointer and what it is to point to is still kind of a mystery to me. I believe that the new item you are creating should receive the backpointer and it should point to the item you are copying (to advance the dot in). However this is not sufficient, you also need some mechanism to inherit backpointers, otherwise you end up with a disconnected tree. Can anyone clarify this process?

That would be very useful! Unfortunately, I haven't found any description about that yet... --ThSoft 21:39, 14 September 2006 (UTC)[reply]

My understanding is that you generate a certains number of Earley items, which may or may not be part of *a* complete parse. --92.137.198.224 (talk) 11:42, 1 August 2010 (UTC)[reply]

It's all laid out in Earley's dissertation (1968). I just added references to it. In the article that was already listed, the part that you're really wanting to see says "for more information, see my dissertation" (basically). I just got it implemented myself in Java. Good luck! Limited Atonement (talk) 19:12, 12 September 2012 (UTC)[reply]

Be forewarned that the method described in Earley's paper for reconstructing parse trees from the Earley sets may have problems. As stated by Elizabeth Scott, 2008 ^[1], "adding multiple pointers to a given instance of a non-terminal can create errors"; that paper, citing an example of Tomita ^[2], says the grammar "S ::= SS | b" on the input "bbb" will correctly identify derivations of bbb, but will also include erroneous derivations of bbbb and bb. (Scott's paper says it provides an algorithm for correct construction of the parse forest; I am trying to work through this as well as Aycock's various works on this subject now.) Pnkfelix (talk) 11:29, 2 March 2013 (UTC)[reply]

I just finished getting through Elizabeth Scott's paper and implementing a toy JavaScript parser (with an annotated version), so I added that to the JS implementations section and wrote a section for the main page covering the things that I understand about generating the parse forest. I think I've followed the rules, but it's the first edit I've made on Wikipedia that wasn't just a typo fix, so let me know if I've screwed anything up. JoshGrams (talk) 11:03, 16 September 2015 (UTC)[reply]

Actually, you don't need to modify the recognizer at all to reconstruct your parse tree. States are all you need. You need to start from the completed start rule, then search for the rules that may have caused its completion. I call that the "un-complete" step. There are analogous "un-predict" and "un-scan" steps. Which is applicable depends on the state rule you are examining. Also don't forget to backtrack. This should look like a straightforward tree construction recursive algorithm. If your parse is not ambiguous, you will always find one and only one possibility at each step. I have done one such reconstruction manually. I have yet to automate it. 88.180.140.48 (talk) 16:34, 23 October 2013 (UTC)[reply]

Error in the algorithm description?

I believe that:

Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, j+1) to S(k) for every production with Y on the left-hand side.

Should be changed to:

Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production with Y on the left-hand side.

This is a change of "j+1" to "k". See the example. —Preceding unsigned comment added by 87.116.85.20 (talk) 17:14, 4 December 2007 (UTC)[reply]

Unused symbol

Z is defined to represent a single nonterminal. However, it never gets used anywhere for the rest of the page. I believe that its definition should be removed. If nobody objects, I'll correct it. Opaldraggy (talk) 04:09, 14 December 2008 (UTC)[reply]

Added Parse::Marpa

I've added Parse::Marpa under External Links. It's a new implementation of a parser based on Earley's algorithm, with the improvements suggested by Aycock and Horspool. I'm the author of this Perl module, but I believe this edit is in compliance with WP:COS. --Jeffreykegler (talk) 01:58, 11 January 2009 (UTC)[reply]

Propose tidy of eternal links

I'd like to remove the subsections from external links section, so it would look the example below, any objections? This would be to make it concise and to avoid repeating information.

External links

'early' An Earley parser C -library.
'C Earley Parser' An Earley parser C.
PEN A Java library that implements the Earley.
Pep A Java library that implements the Earley algorithm and provides charts and parse trees as parsing artifacts.
[1] A Java implementation of Earley parser.
Marpa and Marpa::XS, Perl modules, incorporating improvements made to the Earley algorithm by Joop Leo, and by Aycock and Horspool.
Parse::Earley An Perl module that implements Jay Earley's original algorithm.
Charty a Python implementation of an Earley parser.
NLTK a Python toolkit that has an Earley parser.
Spark an Object Oriented "little language framework" for Python that implements an Earley parser.
CL-EARLEY-PARSER A Common Lisp library that implements an Earley parser.
Charty-Racket A Scheme / Racket implementation of an Earley parser.

Architectual (talk) 12:52, 27 October 2011 (UTC)[reply]

In the current organization you can see at a glance which language each is in. Unfortunately, that's often even more important than the quality of the implementation. Most users won't or can't consider an implementation that's not on their (often very short) list of preferred languages. --Jeffreykegler (talk) 15:16, 27 October 2011 (UTC)[reply]

Example doesn't agree with stated algorithm

In the algorithm pseudocode, the line for each state in chart[i] is perhaps misleading. In the example, S(0), say, Rule 4 is generated from rule 3. This either means that chart[i] is looped over multiple times (?) or somehow self-generates more items?. This should be made clear in the pseudocode. Limited Atonement (talk) 13:56, 10 September 2012 (UTC)[reply]

Ambiguity Concerning "Input Position"

The words "Input Position" show up on the page twice (as of now), and it means a different thing each time. The first one is explained, "input position (which represents a position between tokens)", but the second one is not explained: "input position k is called S(k)", but must refer to the input position between grammatical terminals or nonterminals. I may be incorrect, but I've been working with this for a while now, and believe it to be true. Limited Atonement (talk) 21:13, 10 September 2012 (UTC)[reply]

The second and first uses of the term are consistent and the second is correct, though it took me some thought to see that and at this point I've more than a little experience with these matters. However, the original definition of "input position" is, strictly speaking, wrong -- in an input of length n, positions 0 and n are not "between" tokens. I hope my new language is better.

Position in an Earley parse turns out to be a very deep concept, and it plays a major role in the Earley-derived Marpa parser.--Jeffreykegler (talk) 00:41, 11 September 2012 (UTC)[reply]

Incomplete Example?

The algorithm states "Prediction: For every state in S(k) of the form (X -> a . Y b, j) ..., add [all productions with Y on the left]." However, in the example, the state in S(k) (M -> . M * T, 0) does not expand as expected (it is nowhere expanded!). Limited Atonement (talk) 15:11, 11 September 2012 (UTC)[reply]

I've just made some changes to the description of the algorithm in the way of cleaning up terminology and clarification. The pseudo-code does not mention two important points. First, that duplicate states are never added to a state set. Second, that the loop over the prediction/scanning/completion operations must be performed as long as new states are being added to the state set.

Regarding the loop. I think it might worth noting this. For instance, I thought that you could not do it if you had a left recursive. However, I realized that this would not happen if duplicate rules are ignored. This part is not obvious and is sometimes omitted in books.

The second point is often missing in both descriptions and implementations, in part because it is not in Earley's original paper. This is a well-known bug with Earley's original description of his algorithm. Newer variants of Earley's, such as Marpa and the Aycock-Horspool algorithm, have found ways to make one pass through the queue sufficient, but these are complex and adding the loop is the standard way to restate the classic Earley's algorithm in bug-free form.--Jeffreykegler (talk) 18:26, 13 September 2012 (UTC)[reply]

Jurfafsky&Martin description does not fit the generic case

The scanner phase of the algorithm is tailored for the output of a POS tagger. Furthermore, it apparently describes the case of several (possible) tags per word. This is not the original Earley algorithm, I believe. It is an extension. I suggest fixing the scanner phase. — Preceding unsigned comment added by Srchvrs (talk • contribs) 20:03, 11 December 2012 (UTC)[reply]

Citations

^ Elizabeth Scott. 2008. SPPF-Style Parsing From Earley Recognisers. Electron. Notes Theor. Comput. Sci. 203, 2 (April 2008), 53-67. DOI=10.1016/j.entcs.2008.03.044 http://dx.doi.org/10.1016/j.entcs.2008.03.044
^ Masaru Tomita. 1985. Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers, Norwell, MA, USA.

[1] Elizabeth Scott. 2008. SPPF-Style Parsing From Earley Recognisers. Electron. Notes Theor. Comput. Sci. 203, 2 (April 2008), 53-67. DOI=10.1016/j.entcs.2008.03.044 http://dx.doi.org/10.1016/j.entcs.2008.03.044

[2] Masaru Tomita. 1985. Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers, Norwell, MA, USA.

[1]

[2]