|
|
Line 1: |
Line 1: |
| In [[computer science]], the '''Cocke–Younger–Kasami (CYK) algorithm''' (alternatively called '''CKY''') is a [[parsing]] [[algorithm]] for [[context-free grammar]]s, its name came from the inventors, [[John Cocke]], Daniel Younger and [[Tadao Kasami]]. It employs [[bottom-up parsing]] and [[dynamic programming]].
| | Friends call her Roni. What she really enjoys doing is films and he or she would never stop working. Nevada is largest I love most. Taking care of animals will be the I support my family but I've always wanted my own home based business. Check out my website here: http://primacleanse.net/<br><br>my page; [http://primacleanse.net/ Prima Cleanse] |
| | |
| The standard version of CYK operates only on context-free grammars given in [[Chomsky normal form]] (CNF). However any context-free grammar may be transformed to a CNF grammar expressing the same language {{harv|Sipser|1997}}.
| |
| | |
| The importance of the CYK algorithm stems from its high efficiency in certain situations. Using [[Landau symbol]]s, the [[Analysis of algorithms|worst case running time]] of CYK is <math>\Theta(n^3 \cdot |G|)</math>, where ''n'' is the length of the parsed string and ''|G|'' is the size of the CNF grammar ''G''. This makes it one of the most efficient parsing algorithms in terms of worst-case [[asymptotic complexity]], although other algorithms exist with better average running time in many practical scenarios.
| |
| | |
| ==Standard form==
| |
| | |
| The algorithm requires the context-free grammar to be rendered into [[Chomsky normal form]] (CNF), because it tests for possibilities to split the current sequence in half. Any context-free grammar that does not generate the empty string can be represented in CNF using only [[Formal grammar#The syntax of grammars|production rules]] of the forms <math>A\rightarrow \alpha</math> and <math>A\rightarrow B C</math>.
| |
| | |
| ==Algorithm==
| |
| | |
| ===As pseudocode===
| |
| The algorithm in [[pseudocode]] is as follows:
| |
| | |
| '''let''' the input be a string ''S'' consisting of ''n'' characters: ''a''<sub>1</sub> ... ''a''<sub>''n''</sub>.
| |
| '''let''' the grammar contain ''r'' nonterminal symbols ''R''<sub>1</sub> ... ''R''<sub>''r''</sub>.
| |
| This grammar contains the subset ''R''<sub>''s''</sub> which is the set of start symbols.
| |
| '''let''' ''P''[''n'',''n'',''r''] be an array of booleans. Initialize all elements of ''P'' to false.
| |
| '''for each''' ''i'' = 1 to ''n''
| |
| '''for each''' unit production ''R''<sub>''j''</sub> -> ''a''<sub>''i''</sub>
| |
| set ''P''[''i'',''1'',''j''] = true
| |
| '''for each''' ''i'' = 2 to ''n'' ''-- Length of span''
| |
| '''for each''' ''j'' = 1 to ''n''-''i''+1 ''-- Start of span''
| |
| '''for each''' ''k'' = 1 to ''i''-1 ''-- Partition of span''
| |
| '''for each''' production ''R''<sub>''A''</sub> -> ''R''<sub>''B''</sub> ''R''<sub>''C''</sub>
| |
| '''if''' ''P''[''j'',''k'',''B''] and ''P''[''j''+''k'',''i''-''k'',''C''] '''then''' set ''P''[''j'',''i'',''A''] = true
| |
| '''if''' any of ''P''[1,''n'',''x''] is true (''x'' is iterated over the set ''s'', where ''s'' are all the indices for ''R''<sub>''s''</sub>) '''then'''
| |
| ''S'' is member of language
| |
| '''else'''
| |
| ''S'' is not member of language
| |
| | |
| ===As prose===
| |
| In informal terms, this algorithm considers every possible subsequence of the sequence of words and sets <math>P[i,j,k]</math> to be true if the subsequence of words starting from <math>i</math> of length <math>j</math> can be generated from <math>R_k</math>. Once it has considered subsequences of length 1, it goes on to subsequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two parts, and checks to see if there is some production <math>P \to Q \; R</math> such that <math>Q</math> matches the first part and <math>R</math> matches the second part. If so, it records <math>P</math> as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire sentence is matched by the start symbol.
| |
| | |
| ==Example==
| |
| This is an example grammar:
| |
| | |
| :<math>\begin{array}{lcl}
| |
| S &\to& NP \;\; VP\\
| |
| VP &\to& VP \;\; PP\\
| |
| VP &\to& V \;\; NP\\
| |
| VP &\to& \textit{eats}\\
| |
| PP &\to& P \;\; NP\\
| |
| NP &\to& Det \;\; N\\
| |
| NP &\to& \textit{she}\\
| |
| V &\to& \textit{eats}\\
| |
| P &\to& \textit{with}\\
| |
| N &\to& \textit{fish}\\
| |
| N &\to& \textit{fork}\\
| |
| Det &\to& a
| |
| \end{array}</math>
| |
| | |
| Now the sentence ''she eats a fish with a fork'' is analyzed using the CYK algorithm. In the following table, in <math>P[i,j,k]</math>, <math>i</math> is the number of the column (starting at the left at 1), and <math>j</math> is the number of the row (starting at the bottom at 1).
| |
| | |
| {| class="wikitable"
| |
| |+CYK table
| |
| |-
| |
| | '''S'''
| |
| |-
| |
| | || VP
| |
| |-
| |
| | || ||
| |
| |-
| |
| | '''S''' || || ||
| |
| |-
| |
| | || VP || || || PP
| |
| |-
| |
| | '''S'''|| || NP || || || NP
| |
| |-
| |
| | NP || V, VP || Det. || N || P || Det || N
| |
| |- style="border-top:3px solid grey;"
| |
| | she || eats || a || fish || with || a || fork
| |
| |}
| |
| | |
| Since <math>P[1,7,R_S]</math> is true, the example sentence can be generated by the grammar.
| |
| | |
| ==Extensions==
| |
| ===Generating a parse tree===
| |
| It is simple to extend the above algorithm to not only determine if a sentence is in a language, but to also construct a [[parse tree]], by storing parse tree nodes as elements of the array, instead of booleans. Since the grammars being recognized can be ambiguous, it is necessary to store a list of nodes (unless one wishes to only pick one possible parse tree); the end result is then a forest of possible parse trees.
| |
| An alternative formulation employs a second table B[n,n,r] of so-called ''backpointers''.
| |
| | |
| ===Parsing non-CNF context-free grammars===
| |
| | |
| As pointed out by {{harvtxt|Lange|Leiß|2009}}, the drawback of all known transformations into Chomsky normal form is that they can lead to an undesirable bloat in grammar size. The size of a grammar is the sum of the sizes of its production rules, where the size of a rule is one plus the length of its right-hand side. Using <math>g</math> to denote the size of the original grammar, the size blow-up in the worst case may range from <math>g^2</math> to <math>2^{2 g}</math>, depending on the transformation algorithm used. For the use in teaching, Lange and Leiß propose a slight generalization of the CYK algorithm, "without compromising efficiency of the algorithm, clarity of its presentation, or simplicity of proofs" {{harv|Lange|Leiß|2009}}.
| |
| | |
| ===Parsing weighted context-free grammars===
| |
| It is also possible to extend the CYK algorithm to parse strings using [[weighted context-free grammar|weighted]] and [[stochastic context-free grammar]]s. Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability).
| |
| | |
| ===Valiant's algorithm===
| |
| The [[Analysis of algorithms|worst case running time]] of CYK is <math>\Theta(n^3 \cdot |G|)</math>, where ''n'' is the length of the parsed string and ''|G|'' is the size of the CNF grammar ''G''. This makes it one of the most efficient algorithms for recognizing general context-free languages in practice. {{harvtxt|Valiant|1975}} gave an extension of the CYK algorithm. His algorithm computes the same parsing table
| |
| as the CYK algorithm; yet he showed that [[Matrix multiplication#Algorithms for efficient matrix multiplication|algorithms for efficient multiplication]] of [[Boolean matrix|matrices with 0-1-entries]] can be utilized for performing this computation.
| |
| | |
| Using the [[Coppersmith–Winograd algorithm]] for multiplying these matrices, this gives an asymptotic worst-case running time of <math>O(n^{2.38} \cdot |G|)</math>. However, the constant term hidden by the [[Big O Notation]] is so large that the Coppersmith–Winograd algorithm is only worthwhile for matrices that are too large to handle on present-day computers {{harv|Knuth|1997}}, and this approach requires subtraction and so is only suitable for recognition. The dependence on efficient matrix multiplication cannot be avoided altogether: {{harvtxt|Lee|2002}} has proved that any parser for context-free grammars working in time <math>O(n^{3-\varepsilon} \cdot |G|)</math> can be effectively converted into an algorithm computing the product of <math>(n \times n)</math>-matrices with 0-1-entries in time <math>O(n^{3 - \varepsilon/3})</math>.
| |
| | |
| ==See also==
| |
| * [[GLR parser]]
| |
| * [[Earley parser]]
| |
| * [[Packrat parser]]
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| * [[John Cocke]] and Jacob T. Schwartz (1970). Programming languages and their compilers: Preliminary notes. Technical report, [[Courant Institute of Mathematical Sciences]], [[New York University]].
| |
| * [[Tadao Kasami|T. Kasami]] (1965). An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL-65-758, Air Force Cambridge Research Lab, [[Bedford, MA]].
| |
| * Daniel H. Younger (1967). Recognition and parsing of context-free languages in time ''n''<sup>3</sup>. ''Information and Control'' 10(2): 189–208.
| |
| * {{citation |last=Knuth |first=Donald E. |authorlink=Donald E. Knuth |title=The Art of Computer Programming Volume 2: Seminumerical Algorithms |publisher=Addison-Wesley Professional |edition=3rd |date=November 14, 1997 |isbn=978-0-201-89684-8 |pages=501 }}
| |
| * {{Citation
| |
| | last=Lange
| |
| | first=Martin
| |
| | last2=Leiß
| |
| | first2=Hans
| |
| | title=To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm
| |
| | year=2009
| |
| | journal=Informatica Didactica
| |
| | volume=8
| |
| | url=http://www.informatica-didactica.de/cmsmadesimple/index.php?page=LangeLeiss2009
| |
| | place=[http://www.informatica-didactica.de/cmsmadesimple/uploads/Artikel/LangeLeiss2009/LangeLeiss2009.pdf pdf]
| |
| }}
| |
| *{{Citation
| |
| | last=Sipser
| |
| | first=Michael
| |
| | title=Introduction to the Theory of Computation
| |
| | publisher=IPS
| |
| | year=1997
| |
| | edition=1st
| |
| | page=99
| |
| | isbn =0-534-94728-X
| |
| }}
| |
| *{{Citation
| |
| | last = Lee
| |
| | first = Lillian
| |
| | title = Fast context-free grammar parsing requires fast Boolean matrix multiplication
| |
| | journal = [[Journal of the ACM]]
| |
| | volume = 49
| |
| | issue = 1
| |
| | pages = 1–15
| |
| | year = 2002
| |
| | doi = 10.1145/505241.505242
| |
| | postscript = .
| |
| }}
| |
| *{{citation |last=Valiant |first=Leslie G. |authorlink=Leslie G. Valiant |title=General context-free recognition in less than cubic time |journal=Journal of Computer and System Sciences |volume=10 |issue=2 |year=1975 |pages=308–314 }}
| |
| | |
| ==External links==
| |
| * [http://www.diotavelli.net/people/void/demos/cky.html CYK parsing demo in JavaScript]
| |
| * [http://www.informatik.uni-leipzig.de/alg/lehre/ss08/AUTO-SPRACHEN/Java-Applets/CYK-Algorithmus.html Interactive Applet from the University of Leipzig to demonstrate the CYK-Algorithm (Site is in german)]
| |
| * [http://www.swisseduc.ch/compscience/exorciser/ Exorciser is a Java application to generate exercises in the CYK algorithm as well as Finite State Machines, Markov algorithms etc]
| |
| | |
| [[Category:Parsing algorithms]]
| |