LR剖析器
LR 剖析器是一種由下而上(bottom-up)的上下文無關語法剖析器。LR 意指由左(Left)至右處理輸入字串,並以最右邊優先衍生(Right derivation)的推導順序(相對於 LL 剖析器)建構語法樹。能以此方式剖析的語法稱為 LR 語法。而在 LR(k) 這樣的名稱中,k 代表的是剖析時所需前瞻符號(lookahead symbol)的數量,也就是除了目前處理到的輸入符號之外,還得再向右參照幾個符號之意;省略 (k) 時即視為 LR(1),而非 LR(0)。
由於LR 剖析器嘗試由剖析樹的葉節點開始,向上一層層透過文法規則的化簡,最後推導回到樹的根部(起始符號),所以它是一種由下而上的剖析方法。許多程式語言使用 LR(1) 描述文法,因此許多編譯器都使用 LR 剖析器分析原始碼的文法結構。LR 剖析的優點如下:
- 眾多的程式語言都可以用某種 LR 剖析器(或其變形)分析文法。(C++是個著名的例外)
- LR 剖析器可以很有效率的建置。
- 對所有「由左而右」掃描原始碼的剖析器而言,LR 剖析器可以在最短的時間內偵測到文法錯誤(這是指文法無法描述的字串)。
然而 LR 剖析器很難以人工的方式設計,一般使用「剖析產生器(parser generator)」或「編譯器的編譯器(compiler-compiler, 產生編譯器的工具)」來建構它。 LR 剖析器可根據剖析表(parsing table)的建構方式,分類為「簡單 LR 剖析器(SLR, Simple LR parser)」、「前瞻 LR 剖析器(LALR, Look-ahead LR parser)」以及「正統 LR 剖析器 (Canonical LR parser)」。這些解析器都可以處理大量的文法規則,其中 LALR 剖析器較 SLR 剖析器強大,而正統 LR 剖析器又比 LALR 剖析器能處理更多的文法。著名的 Yacc 即是用來產生 LALR 剖析器的工具。
LR 剖析器的結構
以表格為主(table-based)由下而上的剖析器可用圖一描述其結構,LR 剖析器是一種不確定自動推倒機(有限狀態自動機的一類),它包含:
- 一個輸入緩衝區,輸入的原始碼儲存於此,剖析將由第一個符號開始依序向後掃描。
- 一座堆疊,儲存過去的狀態與化簡中的符號。
- 一張狀態轉移表(goto table),決定狀態的移轉規則。
- 一張動作表(action table),決定目前的狀態碰到輸入符號時應採取的文法規則,輸入符號指的是終端符號(Terminals)與非終端符號(Non-terminals)。
剖析演算法
LR 剖析過程如下:
- 將結尾字元 $ 與起始狀態 0 依序壓入空堆疊,之後的狀態與符號會被壓入堆疊的頂端。
- 根據目前的狀態以及輸入的終端符號,到動作表中找到對應動作:
- 移位(shift) sn:
- 將目前的終端符號由輸入緩衝區中移出並壓入堆疊
- 再將狀態 n 壓入堆疊並成為最新的狀態
- 化簡(reduce) rm:
- 考慮第 m 條文法規則,假設該文法的右邊(right-hand side)有 X 個符號,則將 2X 個元素從堆疊中彈出
- 此時過去的某個狀態會回到堆疊頂端
- 在狀態轉移表中查找此狀態遇到文法左邊(left-hand side)的符號時的狀態轉移
- 將文法左手邊的符號壓入堆疊
- 將查找到的新狀態壓入堆疊
- 接受,輸入字串解析完成。
- 無對應動作,此情形即為文法錯誤。
- 移位(shift) sn:
- 重複步驟二直到輸入的字串被接受或偵測到文法錯誤。
範例
考慮以下文法:
- (1) E → E * B
- (2) E → E + B
- (3) E → B
- (4) B → 0
- (5) B → 1
待剖析的輸入字串是:
- 1 + 1
動作表與狀態轉移表
LR(0) 剖析器使用的表格如下:
動作 | 狀態轉移 | |||||||
狀態 | * | + | 0 | 1 | $ | E | B | |
0 | s1 | s2 | 3 | 4 | ||||
1 | r4 | r4 | r4 | r4 | r4 | |||
2 | r5 | r5 | r5 | r5 | r5 | |||
3 | s5 | s6 | acc | |||||
4 | r3 | r3 | r3 | r3 | r3 | |||
5 | s1 | s2 | 7 | |||||
6 | s1 | s2 | 8 | |||||
7 | r1 | r1 | r1 | r1 | r1 | |||
8 | r2 | r2 | r2 | r2 | r2 |
動作表 用以表示目前狀態遇到終端符號(包含結尾字元 $)的對應動作,欄位中可能有三種動作:
- 移位,記為 'sn',表示下個狀態是 n
- 化簡,記為 'rm',表示使用第 m 條文法規則化簡堆疊中的內容。
- 接受,記為 'acc',表示剖析正確的完成,輸入的字串被文法所定義的語言接受.
狀態轉移表 用以表示簡化後的狀態遇到非終端符號時的轉移規則。
剖析過程
下表是剖析過程中的各步驟,堆疊的頂端在最右邊,狀態的轉移與堆疊的化簡都以上表為依據,而特殊字元 '$' 也被加到輸入串的尾端表示結尾。
目前的狀態 | 堆疊 | 輸入 | 將採取的動作 |
---|---|---|---|
0 | $ 0 | 1+1$ | Shift 2 |
2 | $ 0 '1' 2 | +1$ | Reduce 5 |
4 | $ 0 B 4 | +1$ | Reduce 3 |
3 | $ 0 E 3 | +1$ | Shift 6 |
6 | $ 0 E 3 + 6 | 1$ | Shift 2 |
2 | $ 0 E 3 + 6 '1' 2 | $ | Reduce 5 |
8 | $ 0 E 3 + B 8 | $ | Reduce 2 |
3 | $ 0 E 3 | $ | Accept |
範例說明
剖析起始時堆疊會包含元素 $ 與 0:
- [$ 0]
剖析器首先從輸入緩衝區看到符號 '1',根據動作表當狀態 0 碰到終端符號 '1' 時採用移位動作 s2,即是將 '1' 從輸入緩衝區中移出並推入堆疊,再將新的狀態 2 也推入堆疊,這時堆疊會變成:
- [$ 0 '1' 2]
(為避免終端符號與狀態混淆,故堆疊中的終端符號都加上單引號區別)
接著看到的終端符號是 '+',根據動作表無論狀態 2 碰到任何終端符號,都執行 r5 動作(以第五條文法規則 B → 1 化簡堆疊內容)。此化簡的動作表示剖析器已經在堆疊中認出第五條文法規則的右手邊部分,因此可以用該規則的左手邊符號 B 取代。因為第五條文法的右邊有一個符號,因此我們將兩個元素(1 × 2 = 2)自堆疊彈出,此時會回到狀態 0,再推入符號 B,並查找轉移表中狀態 0 遇到非終端符號 B 後的新狀態。新的狀態是 4,完成此步驟後的堆疊是:
- [$ 0 B 4]
由於上一個終端符號 '+' 尚未被處理,因此仍保留在輸入緩衝區中。依據動作表,在狀態 4 碰到 '+' 時做 r3 化簡。根據第三條文法 E → B,我們將 4 與 B 從堆疊彈出,回到狀態 0。接著壓入 E ,根據狀態轉移表,當狀態 0 遇到非終端符號 E 時需轉移至狀態 3 ,因此將 3 壓入堆疊:
- [$ 0 E 3]
繼續尚未處理的符號 '+',當狀態 3 遇到 '+' 時的對應動作是 s6,將 '+' 從輸入中移出並壓入堆疊,再將新的狀態 6 也壓入堆疊:
- [$ 0 E 3 '+' 6]
下一個符號是 '1',在狀態 6 看到 '1' 時的動作是 s2,將 '1' 從輸入中移出並壓入堆疊,再將新的狀態 2 也壓入堆疊:
- [$ 0 E 3 '+' 6 '1' 2]
最後看到的輸入符號是 $,狀態 2 遇到 $ 時的動作是 r5,以第五條文法規則化簡堆疊內容。此化簡動作與第二步驟相似,堆疊彈出兩個元素後回到狀態 6,這時再壓入符號 B 後會進入狀態 8(根據狀態轉移表),因此也將 8 壓入堆疊:
- [$ 0 E 3 '+' 6 B 8]
在狀態 8 看到符號 $ 時剖析器會繼續化簡,根據動作表執行 r2 化簡動作,採用第二條文法規則 E → E + B 簡化堆疊。由於該規則的右手邊有三個符號,故從堆疊中彈出六個元素。這時回到狀態 0,將規則左邊的符號 E 推入堆疊後,進入新狀態 3(根據狀態轉移表),將之壓入後堆疊為:
- [$ 0 E 3]
最後在狀態 3 看到符號 $,對應的動作是 acc,表示剖析順利完成。
建構 LR(0) 剖析表
LR(0) 項目(Items)
建構剖析表的過程須使用 LR(0) 項目(以下簡稱為「項目」),這些項目是在文法規則的右手邊插入一個特殊的符號「‧」所產生。例如文法 E → E + B 有下列四個對應的項目:
- E → ‧ E + B
- E → E ‧ + B
- E → E + ‧ B
- E → E + B ‧
若文法規則的形式是 A → ε ,則對應的唯一項目是:
- A → ‧
建立項目的用意是要決定剖析器的狀態,例如 E → E ‧ + B 的意義是「剖析器已經在輸入的符號中認出 E 的部分,目前正等著看到一個 '+' 符號與接續的 B 的部份」。
項目集合
在一般的情形中,剖析器不能預知未來要用哪一條文法規則來化簡堆疊內容,因此很難以單一個項目決定狀態。例如以下文法:
- E → E + B
- E → E * B
當剖析器認出堆疊中的 E 部分時,它無法預測未來會繼續看到 '+' 或 '*',因此這時的狀態須以兩個項目表示:
- E → E ‧ + B
- E → E ‧ * B
故我們使用項目的集合 { E → E ‧ + B, E → E ‧ * B } 來表示「剖析器認出 E 並期待 + B 或 * B」的狀態。
元素集合的封閉性
An item with a dot in front of a nonterminal, such as E → E + ‧ B, indicates that the parser expects to parse the nonterminal B next. To ensure the item set contains all possible rules the parser may be in the midst of parsing, it must include all items describing how B itself will be parsed. This means that if there are rules such as B → 1 and B → 0 then the item set must also include the items B → ‧ 1 and B → ‧ 0. In general this can be formulated as follows:
- If there is an item of the form A → v‧Bw in an item set and in the grammar there is a rule of the form B → w' then the item B → ‧ w' should also be in the item set.
Any set of items can be extended such that it satisfies this rule: simply continue to add the appropriate items until all nonterminals preceded by dots are accounted for. The minimal extension is called the closure of an item set and written as clos(I) where I is an item set. It is these closed item sets that we will take as the states of the parser, although only the ones that are actually reachable from the begin state will be included in the tables.
擴充文法
在決定狀態間的轉移前,我們必須先加入一條擴充文法:
- (0) S → E
其中 S 是新的起始符號(start symbol)而 E 是原先的起始符號。考慮以下文法:
- (1) E → E * B
- (2) E → E + B
- (3) E → B
- (4) B → 0
- (5) B → 1
加入擴充文法後,我們使用下列規則來決定項目集合與狀態:
- (0) S → E
- (1) E → E * B
- (2) E → E + B
- (3) E → B
- (4) B → 0
- (5) B → 1
尋找可到達的元素集合與集合間的轉移
The first step of constructing the tables consists of determining the transitions between the closed item sets. These transitions will be determined as if we are considering a finite automaton that can read terminals as well as nonterminals. The begin state of this automaton is always the closure of the first item of the added rule: S → ‧ E:
- Item set 0
- S → ‧ E
- + E → ‧ E * B
- + E → ‧ E + B
- + E → ‧ B
- + B → ‧ 0
- + B → ‧ 1
The '+' in front of an item indicates the items that were added for the closure. The original items without a '+' are called the kernel of the item set.
Starting at the begin state (S0) we will now determine all the states that can be reached from this state. The possible transitions for an item set can be found by looking at the symbols (terminals and nonterminals) we find right after the dots; in the case of item set 0 these are the terminals '0' and '1' and the nonterminal E and B. To find the item set that a symbol x leads to we follow the following procedure:
- Take the set, S, of all items in the current item set where there is a dot in front of some symbol x.
- For each item in S, move the dot to the right of x.
- Close the resulting set of items.
For the terminal '0' this results in:
- Item set 1
- B → 0 ‧
and for the terminal '1' in:
- Item set 2
- B → 1 ‧
and for the nonterminal E in:
- Item set 3
- S → E ‧
- E → E ‧ * B
- E → E ‧ + B
and for the nonterminal B in:
- Item set 4
- E → B ‧
Note that the closure does not add new items in all cases. We continue this process until no more new item sets are found. For the item sets 1, 2, and 4 there will be no transitions since the dot is not in front of any symbol. For item set 3 we see that the dot is in front of the terminals '*' and '+'. For '*' the transition goes to:
- Item set 5
- E → E * ‧ B
- + B → ‧ 0
- + B → ‧ 1
and for '+' the transition goes to:
- Item set 6
- E → E + ‧ B
- + B → ‧ 0
- + B → ‧ 1
For item set 5 we have to consider the terminals '0' and '1' and the nonterminal B. For the terminals we see that the resulting closed item sets are equal to the already found item sets 1 and 2, respectively. For the nonterminal B the transition goes to:
- Item set 7
- E → E * B ‧
For item set 6 we also have to consider the terminal '0' and '1' and the nonterminal B. As before, the resulting item sets for the terminals are equal to the already found item sets 1 and 2. For the nontermal B the transition goes to:
- Item set 8
- E → E + B ‧
These final item sets have no symbols beyond their dots so no more new item sets are added and we are finished. The transition table for the automaton now looks as follows:
Item Set | * | + | 0 | 1 | E | B |
---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | ||
1 | ||||||
2 | ||||||
3 | 5 | 6 | ||||
4 | ||||||
5 | 1 | 2 | 7 | |||
6 | 1 | 2 | 8 | |||
7 | ||||||
8 |
建構動作表與狀態轉移表
From this table and the found item sets we construct the action and goto table as follows:
- the columns for nonterminals are copied to the goto table
- the columns for the terminals are copied to the action table as shift actions
- an extra column for '$' (end of input) is added to the action table that contains acc for every item set that contains S → E ‧.
- if an item set i contains an item of the form A → w ‧ and A → w is rule m with m > 0 then the row for state i in the action table is completely filled with the reduce action rm.
The reader may verify that this results indeed in the action and goto table that were presented earlier on.
關於 LR(0) 與 SLR、LALR 剖析
Note that only step 4 of the above procedure produces reduce actions, and so all reduce actions must occupy an entire table row, causing the reduction to occur regardless of the next symbol in the input stream. This is why these are LR(0) parse tables: they don't do any lookahead (that is, they look ahead zero symbols) before deciding which reduction to perform. A grammar that needs lookahead to disambiguate reductions would require a parse table row containing different reduce actions in different columns, and the above procedure is not capable of creating such rows.
Refinements to the LR(0) table construction procedure (such as SLR and LALR) are capable of constructing reduce actions that do not occupy entire rows. Therefore, they are capable of parsing more grammars than LR(0) parsers.
表格中的衝突
The automaton that is constructed is by the way it is constructed always a deterministic automaton. However, when reduce actions are added to the action table it can happen that the same cell is filled with a reduce action and a shift action (a shift-reduce conflict) or with two different reduce actions (a reduce-reduce conflict). However, it can be shown that when this happens the grammar is not an LR(0) grammar.
A small example of a non-LR(0) grammar with a shift-reduce conflict is:
- (1) E → 1 E
- (2) E → 1
One of the item sets we then find is:
- Item set 1
- E → 1 ‧ E
- E → 1 ‧
- + E → ‧ 1 E
- + E → ‧ 1
There is a shift-reduce conflict in this item set because in the cell in the action table for this item set and the terminal '1' there will be both a shift action to state 1 and a reduce action with rule 2.
A small example of a non-LR(0) grammar with a reduce-reduce conflict is:
- (1) E → A 1
- (2) E → B 2
- (3) A → 1
- (4) B → 1
In this case we obtain the following item set:
- Item set 1
- A → 1 ‧
- B → 1 ‧
There is a reduce-reduce conflict in this item set because in the cells in the action table for this item set there will be both a reduce action for rule 3 and one for rule 4.
Both examples above can be solved by letting the parser use the follow set (see LL parser) of a nonterminal A to decide if it is going to use one of As rules for a reduction; it will only use the rule A → w for a reduction if the next symbol on the input stream is in the follow set of A. This solution results in so-called Simple LR parsers.
Some more examples on LR(0)
E->E+T/T T->T*F/F F->id
S->AA A->aA/b
參考資料
- Compilers: Principles, Techniques, and Tools, Aho, Sethi, Ullman, Addison-Wesley, 1986. ISBN 0-201-10088-6
- Internals of an LALR(1) parser generated by GNU Bison
这是一篇與计算机相關的小作品。您可以通过编辑或修订扩充其内容。 |