Affine space: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Rschwieb
en>D.Lazard
m top: misplaced wl
 
Line 1: Line 1:
Grammar theory to model symbol strings originated from work in [[computational linguistics]] aiming at understanding the structure of [[natural language]]s.<ref name="Chomsky_1956" /><ref name="Chomsky_1959" /><ref name="Chomsky_1957" /> Through [[controlled grammar]] exploring and scoring the correctness of a sentence construct in a language by [[computation]] is achievable. Grammars are said to be [[generative grammar]]s/[[transformational grammar]]s if their rules are used to predict/emit words forming grammatical sentences. '''Stochastic context free grammars (SCFG)''' have been applied in '''probabilistic modeling''' of [[RNA structure]]s almost 40 years post their introduction in [[computational linguistics]].<ref name="Eddy 1994" /><ref name="Sakakibara 1994" /><ref name="Grat 1995" /><ref name="Lefebvre 1995" /><ref name="Lefebvre 1996" />
My name's Alvin Hankinson but everybody calls me Alvin. I'm from Norway. I'm studying at the college (1st year) and I play the Dobro for 10 years. Usually I choose songs from the famous films :D. <br>I have two sister. I like Record collecting, watching movies and Fishkeeping.<br><br>my blog post :: [http://www.161991up.com http://www.161991up.com]
 
SCFGs extend [[context-free grammar]]s similar to how [[hidden Markov model]]s extend [[regular grammar]]s. Each [[Formal grammar#The syntax of grammars|production]] is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities are typically computed by [[machine learning]] programs operating on large databases. A probabilistic grammar's validity is constrained by context of its training dataset.
 
SCFGs have application in areas as diverse as [[natural language processing]] to the study the structure of [[RNA]] molecules and design of programming languages. Designing efficient SCFG grammars has to weigh factors of scalability and generality. Issues such as grammar ambiguity need to be resolved. The grammar design influences results accuracy. Grammar parsing algorithms have various time and memory requirements.
 
==Definitions==
'''Derivation:'''  The process of recursive generation of strings from a grammar.
                                         
'''[[Parsing]]:''' Finding a valid derivation using an automaton.
 
'''[[Parse tree|Parse Tree:]]''' The alignment of the grammar to a sequence.
 
An example of a parser for SCFG grammars is the [[pushdown automaton]]. The algorithm parses grammar nonterminals from left to right in a [[Stack (abstract data type)#Applications|stack-like]] manner. This [[Brute-force search|brute-force]] approach is not very efficient. In RNA secondary structure prediction variants of the '''[[CYK algorithm|Cocke–Younger–Kasami (CYK) algorithm]] (CYK)''' algorithm provide more efficient alternatives to grammar parsing than pushdown automata.<ref name="Durbin 1998" /> Another example of a  SCFG parser is the Stanford Statistical Parser which has been trained using Treebank,.<ref>{{cite journal|last=Klein|first=Daniel|coauthors=Manning, Christopher|title=Accurate Unlexicalized Parsing|journal=Proceedings of the 41st Meeting of the Association for Computational Linguistics|year=2003|pages=423–430.|url=http://nlp.stanford.edu/~manning/papers/unlexicalized-parsing.pdf}}</ref>
 
==Relation with Hidden Markov Models==
SCFGs models extend [[context-free grammar]]s the same way as [[hidden Markov model]]s extend [[regular grammar]]s.
 
The [[Inside–outside algorithm|Inside-Outside algorithm]] is an analogue of the [[Forward-backward algorithm|Forward-Backward algorithm]]. It computes the total probability of all derivations that are consistent with a given sequence, based on some SCFG.  This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar. The Inside-Outside algorithm is used in model [[Parameterization|parametrization]] to estimate prior frequencies observed from training sequences in the case of RNAs.
 
[[Dynamic programming]] variants of the [[CYK algorithm]] find the [[Viterbi algorithm|Viterbi parse]] of a RNA sequence for a SCFG model.  This parse is the most likely derivation of the sequence by the given SCFG.
 
==Grammar Construction==
Context-free grammars are represented as a set of rules inspired from attempts to model natural languages.<ref name="Chomsky_1957" /> The rules are absolute and have a typical syntax representation known as [[Backus-Naur Form]]. The production rules consist of terminal '''<math>\left \{a, b \right\}</math>''' and non-terminal '''<math>S</math>''' symbols and a blank '''<math>\epsilon</math>''' may also be used as an end point.  In the production rules of CFG and SCFG the left side has only one nonterminal whereas the right side can be any string of terminal or nonterminals. In SCFG nulls are excluded.<ref name="Durbin 1998" /> An example of a grammar:
                                   
                                    <math>
                                        S \to aS, S \to bS, S \to \epsilon
    </math>                             
This grammar can be shortened using the '''''|''''' 'or' character into:
 
                                    <math>    S\to aS | bS |\epsilon </math>
 
Terminals in a grammar are words and through the grammar rules a non-terminal symbol is transformed into a string of either terminals and/or non-terminals. The above grammar is read as "beginning from a terminal <math>S</math> the emission can generate either <math>a</math> or <math>b</math> or <math>\epsilon</math>".
Its derivation is:
                                      <math>S \Rightarrow aS\Rightarrow abS\Rightarrow abbS \Rightarrow abb</math>
 
[[Ambiguous grammar]] may result in ambiguous parsing if applied on [[homograph]]s since the same word sequence can have more than one interpretation. [[pun|Pun sentences]] such as the newspaper headline "Iraqi Head Seeks Arms" are an example of ambiguous parses.
 
One strategy of dealing with ambiguous parses (originating with grammarians as early as [[Pāṇini]]) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become difficult to manage. Another difficulty is overgeneration, where unlicensed structures are also generated.
 
Probabilistic grammars circumvent these problems by ranking various productions on frequency weights, resulting in a "most likely" (winner-take-all) interpretation.  As usage patterns are altered in [[Historical linguistics|diachronic]] shifts, these probabilistic rules can be re-learned, thus upgrading the grammar.
 
Assigning probability to production rules makes a SCFG. These probabilities are informed by observing distributions on a training set of similar composition to the language to be modeled. On most samples of broad language, probabilistic grammars where probabilities are estimated from data typically outperform hand-crafted grammars. CFGs when contrasted with SCFGs are not applicable to RNA structure prediction because while they incorporate sequence-structure relationship they lack the scoring metrics that reveal a sequence structural potential <ref name="Dowell 2004" />
 
==Applications==
 
===RNA Structure Prediction===
Energy minimization<ref name="McCaskill 1990" /><ref name="Juan 1999" /> and SCFG provide ways to predicting RNA secondary structure with comparable performance.<ref name="Eddy 1994" /><ref name="Sakakibara 1994" /><ref name="Durbin 1998" /> However structure prediction by SCFGs is scored probabilistically rather than by minimum free energy calculation. SCFG model parameters are directly derived from frequencies of different features observed in databases of RNA structures <ref name="Dowell 2004" /> rather than by experimental
determination as is the case with energy minimization methods.<ref name="Zuker 2000" /><ref name="Mathews 1999" />
 
The types of various structure that can be modeled by an SCFG include long range interactions, pairwise structure and other nested structures. However, pseudoknots can not be modeled.<ref name="Eddy 1994" /><ref name="Sakakibara 1994" /><ref name="Durbin 1998" /> SCFGs extend CFG by assigning probabilities to each production rule. A maximum probability parse tree from the grammar implies a maximum probability structure. Since RNAs preserve their structures over their primary sequence; RNA structure prediction can be guided by combining evolutionary information from comparative sequence analysis with biophysical knowledge about a structure plausibility based on such probabilities. Also search results for structural homologs using SCFG rules are scored according to SCFG derivations probabilities. Therefore building grammar to model the behavior of base-pairs and single-stranded regions starts with exploring features of structural multiple sequence alignment of related RNAs.<ref name="Durbin 1998" />
                                            <math>S \to aSa | bSb | aa | bb</math>
 
The above grammar generates a string in an outside-in fashion, that is the basepair on the furthest extremes of the terminal is derived first. So a string such as <math>aabaabaa</math> is derived by first generating the distal <math>a</math>'s on both sides before moving inwards:
                                          <math>S \Rightarrow aSa \Rightarrow aaSaa \Rightarrow aabSbaa \Rightarrow aabaabaa</math>
 
A SCFG model extendibility allows constraining structure prediction by incorporating expectations about different features of an RNA . Such expectation may reflect for example the propensity for assuming a certain structure by an RNA.<ref name="Dowell 2004" />  However incorporation of too much information may increase SCFG space and memory complexity and it is desirable that a SCFG-based model be as simple as possible.<ref name="Dowell 2004" /><ref name="Knudsen 2003" />
 
Every possible string <math>x</math> a grammar generates is assigned a probability weight <math>P(x|\theta)</math> given the SCFG model <math>\theta</math>. It follows that the sum of all probabilities to all possible grammar productions is  <math>\sum_{\text{x}}P(x|\theta)=1</math>. The scores for each paired and unpaired residue explain likelihood for secondary structure formations. Production rules also allow scoring loop lengths as well as the order of base pair stacking hence it is possible to explore the range of all possible generations including suboptimal structures from the grammar and accept or reject structures based on score thresholds.<ref name="Durbin 1998" /><ref name="Dowell 2004" />
 
====Implementations====
RNA secondary structure implementations based on SCFG approaches can be utilized in :
* Finding consensus structure by optimizing structure joint probabilities over MSA.<ref name="Knudsen 2003" /><ref name="Knudsen 1999" />
* Modeling base-pair covariation to detecting homology in database searches.<ref name="Eddy 1994" />
* pairwise simultaneous folding and alignment.<ref name="Rivas 2001" /><ref name="Holmes 2002" />
 
Different implementation of these approaches exist. For example Pfold is used in secondary structure prediction from a group of related RNA sequences,<ref name="Knudsen 2003" /> covariance models are used in searching databases for homologous sequences and RNA annotation and classification,<ref name="Eddy 1994" /><ref name="Gardner 2010" /> RNApromo, CMFinder and TEISER are used in finding stable structural motifs in RNAs.<ref name="Yao 2006" /><ref name="Rabani 2008" /><ref name="Goodarzi 2012" />
 
====Design condiserations====
SCFG design impacts the secondary structure prediction accuracy. Any useful structure prediction probabilistic model based on SCFG has to maintain simplicity without much compromise to prediction accuracy. Too complex a model of excellent performance on a single sequence may not scale.<ref name="Durbin 1998" /> A grammar based model should be able to:
* Find the optimal alignment between a sequence and the SCFG.
* Score the probability of the structures for the sequence and subsequences.
* Parameterize the model by training on sequences/structures.
* Finde the optimal grammar parse tree (CYK algorithm).
* Check for ambiguous grammar (Conditional Inside algorithm).
 
The resulting of multiple parse trees per grammar denotes grammar ambiguity. This may be useful in revealing all possible base-pair structures for a grammar. However an optimal structure is the one where there is one and only one correspondence between the parse tree and the secondary structure.
 
Two types of ambiguities can be distinguished. Parse tree ambiguity and structural ambiguity. Structural ambiguity does not affect thermodynamic approaches as the optimal structure selection is always on the basis of lowest free energy scores.<ref name="Dowell 2004" /> Parse tree ambiguity concerns the existence of multiple parse trees per sequence. Such an ambiguity can reveal all possible base-paired structures for the sequence by generating all possible parse trees then finding the optimal one.<ref name="Sipser 1996" /><ref name="Harrison 1978" /><ref name="Hopcroft 1979" /> In the case of structural ambiguity multiple parse trees describe the same secondary structure. This obscures the CYK algorithm decision on finding an optimal structure as the correspondence between the parse tree and the structure is not unique.<ref name="Giegerich 2000" /> Grammar ambiguity can be checked for by the conditional-inside algorithm.<ref name="Durbin 1998" /><ref name="Dowell 2004" />
 
====Building a SCFG model====
A stochastic context free grammar consists of terminal and nonterminal variables. Each feature to be modeled has a production rule that is assigned a probability estimated from a training set of RNA structures. Production rules are recursively applied until only terminal residues are left.
 
A starting non-terminal <math>\mathbf{\mathit{S}}</math> produces loops. The rest of the grammar proceeds with parameter <math>\mathbf{\mathit{L}}</math> that decide whether a loop is a start of a stem or a single stranded region <math>s</math> and parameter <math>\mathbf{\mathit{F}}</math> that produces paired bases.
 
The formalism of this simple SCFG looks like:
                                                <math>\mathit{S \to LS|L}</math>
                                                <math>\mathit{L \to s | dFd}</math>
                                                <math>\mathit{F \to dFd | LS}</math>
 
The application of SCFGs in predicting structures is a multi-step process. In addition the SCFG itself can be incorporated into probabilistic models that consider RNA evolutionary history or search homologous sequences in databases. In an evolutionary history context inclusion of prior distributions of RNA structures of a structural alignment in the production rules of the SCFG facilitates good prediction accuracy.<ref name="Knudsen 1999" />
 
A summary of general steps for utilizing SCFGs in various scenarios:
* Generate production rules for the sequences.
* Check ambiguity.
* Recursively generate parse trees of the possible structures using the grammar.
* Rank and score the parse trees for the most plausible sequence.<ref name="Durbin 1998" />
 
====Algorithms====
Several algorithms dealing with aspects of SCFG based probabilistic models in RNA structure prediction exist. For instance the inside-outside algorithm and the CYK algorithm. The inside-outside algorithm is a recursive dynamic programming scoring algorithm that can follow [[Expectation-maximization algorithm|expectation-maximization]] paradigms. It computes the total probability of all derivations that are consistent with a given sequence, based on some SCFG. The inside part scores the subtrees from a parse tree and therefore subsequences probabilities given an SCFG. The outside part scores the probability of the complete parse tree for a full sequence.<ref name="Lari and Young 1990" /><ref name="Lari and Young 1991" /> CYK modifies the inside-outside scoring. Note that the term 'CYK algorithm' describes the CYK variant of the inside algorithm that finds an optimal parse tree for a sequence using a SCFG. It extends the actual [[CYK algorithm]] used in non-stochastic CFGs.<ref name="Durbin 1998" />
 
The inside algorithm calculates <math>\alpha(i,j,v)</math> probabilities for all <math>i, j, v</math>  of a parse subtree rooted at <math>W_v</math> for subsequence <math>xi,...,xj</math>. Outside algorithm calculates <math>\beta(i,j,v)</math> probabilities of a complete parse tree for sequence <math>x</math> from root excluding the calculation of <math>xi,...,xj</math>. The variables <math>\alpha</math> and <math>\beta</math> refine the estimation of probability parameters of an SCFG. It is possible to reestimate the SCFG algorithm by finding the expected number of times a state is used in a derivation through summing all the products of <math>\alpha</math> and <math>\beta</math> divided by the probability for a sequence <math>x</math> given the model <math>P(x|\theta)</math>. It is also possible to find the expected number of times a production rule is used by an expectation-maximization that utilizes the values of <math>\alpha</math> and <math>\beta</math>.<ref name="Lari and Young 1990" /><ref name="Lari and Young 1991" /> The CYK algorithm calculates <math>\gamma(i, j, v)</math> to find the most probable parse tree <math>\hat{\pi}</math> and yields <math>\log P(x, \hat{\pi}|\theta)</math>.<ref name="Durbin 1998" />
 
Memory and time complexity for general SCFG algorithms in RNA structure predictions are <math>O(L^2M)</math> and <math>O(L^3M^3)</math> respectively. Restricting a SCFG may alter this requirement as is the case with database searches methods.
 
====SCFG in homology search====
Covariance models (CMs) are a special type of SCFGs with applications in database searches for homologs, annotation and RNA classification. Through CMs it is possible to build SCFG-based RNA profiles where related RNAs can be represented by a consensus secondary structure.<ref name ="Eddy 1994" /><ref name="Sakakibara 1994" /> The RNA analysis package Infernal uses such profiles in inference of RNA alignments.<ref name="Nawrocki 2013" /> The Rfam database also uses CMs in classifying RNAs into families based on their structure and sequence information.<ref name="Gardner 2010" />
 
CMs are designed from a consensus RNA structure. A CM allows [[indel]]s of unlimited length in the alignment. Terminals constitute states in the CM and the transition probabilities between the states is 1 if no indels are considered.<ref name="Durbin 1998" /> Grammars in a CM are as follows:
 
        <math>P \to aWb~~~~~~~~~~~\text{probabilities of pairwise interactions between 16 possible pairs}</math>
        <math>L \to aW~~~~~~~~~~\text{probabilities of generating 4 possible single bases on the left}</math>
        <math>R \to Wa~~~~~~~~~~~~~\text{probabilities of generating 4 possible single bases on the right}</math>
        <math>B \to SS~~~~~~~~~\text{bifurcation with a probability of 1}</math>
        <math>S \to W~~~~~~~~~~~\text{start with a probability of 1}</math>
        <math>E \to \epsilon ~~~~~~~~~~~~\text{end with a probability of 1}</math>
 
The model has 6 possible states and each state grammar includes different types of secondary structure probabilities of the non-terminals. The states are connected by transitions. Ideally current node states connect to all insert states and subsequent node states connect to non-insert states. In order to allow insertion of more than one base insert states connect to themselves.<ref name="Durbin 1998" />
 
In order to score a CM model the inside-outside algorithms are used. CMs use a slightly different implementation of CYK. Log-odds emission scores for the optimum parse tree - <math>log \hat{e}</math> - are calculated out of the emitting states <math>P,~L,~R</math>. Since these scores are a function of sequence length a more discriminative measure to recover an optimum parse tree probability score- <math>log\text{P}(x, \hat{\pi}|\theta)</math> - is reached by limiting the maximum length of the sequence to be aligned and calculating the log-odds relative to a null. The computation time of this step is linear to the database size and the algorithm has a memory complexity of <math>O(M_aD+M_bD^2)</math>.<ref name="Durbin 1998" />
 
==== Example: Using evolutionary information to guide structure prediction ====
The KH-99 algorithm by Knudsen and Hein lays the basis of the Pfold approach to predicting RNA secondary structure.<ref name="Knudsen 2003" /> In this approach the parameterization requires evolutionary history information derived from an alignment tree in addition to probabilities of columns and mutations. The grammar probabilities are observed from a training dataset.
 
===== Estimate column probabilities for paired and unpaired bases =====
In a structural alignment the probabilities of the unpaired bases columns and the paired bases columns are independent of other columns. By counting bases in single base positions and paired positions one obtains the frequencies of  bases in loops and stems.
For basepair <math>X</math> and <math>Y</math> an occurrence of <math>XY</math> is also counted as an occurrence of <math>YX</math>. Identical basepairs such as <math>XX</math> are counted twice.
 
===== Calculate mutation rates for paired and unpaired bases =====
By pairing sequences in all possible ways overall mutation rates are estimated. In order to recover plausible mutations a sequence identity threshold should be used so that the comparison is between similar sequences. This approach uses 85% identity threshold between pairing sequences.
First single base positions differences -except for gapped columns- between sequence pairs are counted such that if the same position in two sequences had different bases <math>X, Y</math> the count of the difference is incremented for each sequence.
 
<math>while X\ne Y</math>
              <math>        C_{\text{XY}} +1 ~~~~~~~~~~~~~\text{first sequence  pair}</math>
              <math>        C_{\text{YX}} +1 ~~~~~~~~~~~~~\text{second sequence pair}</math>
<math>\text{Calculate mutation rates.}</math>
                <math>\text{Let}~  r_{\text{XY}}= \text{mutation of base X to base Y}  = K~C_{\text{XY}}/P_{x}P_{s}</math>
                <math>\text{Let}~  r_{\text{XX}}= \text{the negative of the rate of X mutation to other bases}= - \sum r_{\text{XY}}</math>
                <math>P_{s} =\text{the probability that the base is not paired.}</math>
 
For unpaired bases a 4 X 4 mutation rate matrix is used that satisfies that the mutation flow from X to Y is reversible:<ref name="Tavaré 1986" />
                <math>PX^rXY = PY^rYX</math>
For basepairs a 16 X 16 rate distribution matrix is similarly generated.<ref name="Muse 1995" /><ref name="Schöniger  1994" />
The SCFG is used to predict the prior probability distribution of the structure whereas posterior probabilities are estimated by the inside-outside algorithm and the most likely structure is found by the CYK algorithm.<ref name="Knudsen 2003" />
 
===== Estimate alignment probabilities =====
After calculating the column prior probabilities the alignment probability is estimated by summing over all possible secondary structures. Any column <math>C</math>  in a secondary structure <math>\sigma</math> for a sequence <math>D</math> of length <math>l</math> such that <math>D=(C_1,~C_2, ...C_l )</math> can be scored with respect to the alignment tree <math>T</math> and the mutational model <math>M</math>. The prior distribution given by the SCFG is <math>P(\sigma|M)</math>. The phylogenetic tree, <math>T</math> can be calculated from the model by maximum likelihood estimation. Note that gaps are treated as unknown bases and the summation can be done through [[dynamic programming]].<ref name="Baker 1979" />
      <math>P(D|T,M)</math>
              <math>=\sum P (D, \sigma |T,M)</math>
              <math>=\sum P(D|\sigma, T, M) P(\sigma|T,M)</math>
              <math>=\sum P(D|\sigma,T,M) P(\sigma|M)</math>
 
===== Assign production probabilities to each rule in the grammar =====
Each structure in the grammar is assigned production probabilities devised from the structures of the training dataset. These prior probabilities give weight to predictions accuracy.<ref name="Knudsen 1999" /><ref name="Lari and Young 1990" /><ref name="Lari and Young 1991" /> The number of times each rule is used depends on the observations from the training dataset for that particular grammar feature. These probabilities are written in parenthesis in the grammar formalism and each rule will have a total of 100%.<ref name="Knudsen 2003" /> For instance:
 
              <math> S \to LS (80%) |L (20%)</math>
              <math>L \to s (70%) | dFd (30%)</math>
              <math>F \to dFd (60.4%)| LS (39.6%)</math>
 
===== Predict the structure likelihood =====
Given the prior alignment frequencies of the data the most likely structure from the ensemble predicted by the grammar can then be computed by maximizing <math>P(\sigma|D,T,M)</math> through the CYK algorithm. The structure with the highest predicted number of correct predictions is reported as the consensus structure.<ref name="Knudsen 2003" />
 
<math>\sigma^{MAP}= arg\underset{\sigma}max P(D| \sigma,T^ML, M) P(\sigma|M)</math>
 
===== Pfold improvements on the KH-99 algorithm =====
SCFG based approaches are desired to be scalable and general enough. Compromising speed for accuracy needs to as minimal as possible. Pfold addresses the limitations of the KH-99 algorithm with respect to scalability, gaps, speed and accuracy.<ref name="Knudsen 2003" />  
*In Pfold gaps are treated as unknown. In this sense the probability of a gapped column equals that of an ungapped one.
*In Pfold the tree <math>T</math> is calculated prior to structure prediction through neighbor joining and not by maximum likelihood through the SCFG grammar. Only the branch lengths are adjusted to maximum likelihood estimates.
*An assumption of Pfold is that all sequences have the same structure. Sequence identity threshold and allowing a 1% probability that any nucleotide becomes another limit the performance deterioration due to alignment errors.
 
== Linguistics and Psycholinguistics ==
With the publication of Gold's Theorem 1967 it was claimed that grammars for natural languages governed by deterministic rules could not be learned based on positive instances alone.<ref>{{cite journal |author=Gold, E. |title=Language identification in the limit |journal=Information and Control |volume=10 |issue= 5|pages=447–474 |year=1967 |doi= 10.1016/S0019-9958(67)91165-5|url=}}</ref> This was part of the argument from the [[poverty of stimulus]], presented in 1980<ref>{{cite book |author=Chomsky, N. |title=Rules and representations |publisher=Basil Blackwell |location=Oxford |year=1980 |isbn= |pages= |url= }}</ref>  and implicit since the early works by Chomsky of the 1950s. Among other arguments, this led to the [[psychological nativism|nativist]] view that a form of grammar, including a complete conceptual lexicon in certain versions, is hardwired from birth.
However, Gold's learnability result can easily be circumvented, if we either assume the learner learns a near-perfect approximation to the correct language or that the learner learns from typical input rather than arbitrarily devious input. In fact, it has been proven that simply receiving input from a speaker who produces positive instances randomly rather than according to a pre-specified plan leads to identifiability in the limit with probability 1.<ref>{{cite arXiv |author=Clark, A. |eprint=cs.CL/0212024 |title=Unsupervised Language Acquisition: Theory and Practice |class= cs.CL|year=2001 |format=PhD thesis}}</ref><ref>{{Cite thesis |degree=Ph.D. |title=A study of grammatical inference |author=Horning, J.J. |year=1969 |publisher=Computer Science Department, Stanford University }}</ref>
 
Recently, probabilistic grammars gained cognitive plausibility. There are degrees of difficulty in accessing different syntactic structures (e.g. the [[Accessibility Hierarchy]] for [[relative clause]]s). Probabilistic versions of [[Minimalist grammar|Minimalist Grammars]] have been used to compute information-theoretic [[entropy]] values that correlate well with psycholinguistic data on understandability and production difficulty.<ref>
{{cite journal
| author= John Hale
| title= Uncertainty About the Rest of the Sentence
| journal = Cognitive Science
| year = 2006
| volume = 30
| issue= 4
| pages = 643–672
| institution = Dept Linguistics, Michigan State University
| doi = 10.1207/s15516709cog0000_64
}}</ref>
 
==See also==
*[[Statistical parsing]]
*[[L-system]]
*[http://rfam.janelia.org/ Rfam Database]
*[http://infernal.janelia.org/ Infernal]
*[http://nlp.stanford.edu/software/lex-parser.shtml The Stanford Parser: A statistical parser]
 
==References==
<div style='display:none'>
 
<ref name="Chomsky_1956">{{cite journal |author=Chomsky, Noam|title= Three models for the description of language. |journal=IRE Transactions on Information Theory.|volume=2|pages=113–124|year=1956|doi= 10.1109/TIT.1956.1056813.}}</ref>
 
<ref name="Chomsky_1959">{{cite journal |author=Chomsky, Noam |title=On certain formal properties of grammars  |journal=Information and Control|volume=2|pages=2|year=137-167|doi=10.1016/S0019-9958(59)90362-6}}</ref>
 
<ref name="Eddy 1994">{{cite journal |author=Eddy S. R. and Durbin R.|title=RNA sequence analysis using covariance models.|journal=Nucleic Acids Research|volume=22|pages=2079–2088|year=1994|doi=10.1093/nar/22.11.2079}}</ref>
 
<ref name="Sakakibara 1994">{{cite journal |author=Sakakibara Y., Brown, M. Hughey, R., Mian, I. S., et al|title=Stochastic context-free grammars for tRNA modelling.  |journal=Nucleic Acids Research|volume=22|issue= 23|pages=5112–5120|year=1994|doi=10.1093/nar/22.23.5112}}</ref>
<ref name="Grat 1995">{{cite journal |author=Grat, L. 1995. |title=Automatic RNA secondary struture determination with stochastic context-free grammars. |journal=In Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T and Wodak, S. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology,  AAAI Press|pages=236-144|year=1995}}</ref>
 
<ref name="Lefebvre 1995">{{cite journal |author=Lefebvre, F|title= An optimized parsing algorithm well suited to RNA folding  |journal=In Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T and Wodak, S. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. AAAI Press|pages=222–230|year=1995}}</ref>
<ref name="Lefebvre 1996">{{cite journal |author=Lefebvre, F.|title= A grammar-based unification of several alignment and folding algorithms. |journal=In States. D. J., Agarwal, P., Gaasterlan, T., Hunter, L and Smith R. F., eds., Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press|year=143-153}}</ref>
 
<ref name="Dowell 2004">{{cite journal |author=Dowell R and Eddy S|title=Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction.|journal=BMC Bioinformatics |volume=5|issue=71|doi=10.1186/1471-2105-5-71}}</ref>
<ref name="McCaskill 1990">{{cite journal |author=McCaskill JS.|title= The Equilibrium Partition Function and Base Pair Binding Probabilities for RNA Secondary Structure.  |journal=Biopolymers.|volume=29|pages=1105–19|year=1990}}</ref>
 
<ref name="Juan 1999">{{cite journal |author=Juan V, Wilson C.|title= RNA Secondary Structure Prediction Based on Free Energy and Phylogenetic Analysis.|journal=J Mol Biol.|volume= 289|pages=935–947.|year=1999|doi=10.1006/jmbi.1999.2801}}</ref>
 
<ref name="Zuker 2000">{{cite journal |author=Zuker M|title= Calculating Nucleic Acid Secondary Structure.|journal=Curr Opin Struct Biol.|volume=10|pages=303–310|year=2000|doi=10.1016/S0959-440X(00)00088-9}}</ref>
 
<ref name="Mathews 1999">{{cite journal |author=Mathews DH., Sabina J., Zuker M., Turner DH.|title= Expanded sequence dependence of thermodynamic parameterss improves prediction of RNA secondary structure.|journal=J Mol Biol. .|volume=288|pages=911–940|year=1999|doi=10.1006/jmbi.1999.2700}}</ref>
 
<ref name="Knudsen 2003">{{cite journal |author=B. Knudsen and J. Hein.|title= Pfold: RNA secondary structure prediction using stochastic context-free grammars. |journal=Nucleic Acids Research. |volume=31|issue=13|pages=3423–3428|year=2003|doi=10.1093/nar/gkg614}}</ref>
 
<ref name="Knudsen 1999">{{cite journal |author=Knudsen B, Hein J|title= RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars and Evolutionary History.  |journal=Bioinformatics|volume=15|pages=446–454|year=1999|doi=10.1093/bioinformatics/15.6.446}}</ref>
 
<ref name="Rivas 2001">{{cite journal |author=Rivas E, Eddy SR|title= Noncoding RNA Gene Detection Using Comparative Sequence Analysis.  |journal=BMC Bioinformatics.|volume=2|year=2001|doi=10.1186/1471-2105-2-8 }}</ref>
 
<ref name="Durbin 1998">{{cite book |editor=R. Durbin, S. Eddy, A. Krogh, G. Mitchinson |title=Biological sequence analysis : probabilistic models of proteins and nucleic acids |url=http://books.google.com/books?id=R5P2GlJvigQC |publisher=Cambridge University Press |year=1998 |isbn=978-0-521-62971-3}}</ref>
 
<ref name="Holmes 2002">{{cite journal |author=Holmes I, Rubin GM|title= Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars.  |journal=In Pac Symp Biocomput 2002|volume=|pages=163–174|year=2002}}</ref>
 
<ref name="Chomsky_1957">{{cite book |editor=Noam Chomsky|title=Syntactic Structures.|publisher= Mouton & Co. Publishers, Den Haag, Netherlands,|year= 1957}}</ref>
 
<ref name="Nawrocki 2013">{{cite journal| author="Nawrocki E. P., and Eddy S. R."|title=Infernal 1.1:100-fold faster RNA homology searches.|journal=Bioinformatics.|volume=29|pages=2933–2935|year=2013|doi=10.1093/bioinformatics/btt509}}</ref>
 
<ref name="Gardner 2010">{{cite journal |author=P.P. Gardner, J. Daub, J. Tate, B.L. Moore, I.H. Osuch, S. Griffiths-Jones, R.D. Finn, E.P. Nawrocki, D.L. Kolbe, S.R. Eddy, A. Bateman.|title=Rfam: Wikipedia, clans and the "decimal" release. |journal=Nucleic Acids Research 2011.|volume=39|year=2010|doi=10.1093/nar/gkq1129}}</ref>
 
<ref name="Yao 2006">{{cite journal |author=Yao Z, Weinberg Z, Ruzzo WL.|title=CMfinder-a covariance model based RNA motif finding algorithm. |journal=Bioinformatics|volume=22.
|pages=445–452.|year=2006|doi=10.1093/bioinformatics/btk008}}</ref>
<ref name="Rabani 2008">{{cite journal |author=Rabani M, Kertesz M, Segal E.|title= Computational prediction of RNA structural motifs involved in post-transcriptional regulatory processes.  |journal=. Proc Natl Acad Sci USA|volume=105|pages=14885–14890|year=2008|doi=10.1073/pnas.0803169105
}}</ref>
 
<ref name="Goodarzi 2012">{{cite journal |author=Goodarzi H, Najafabadi HS, Oikonomou P, Greco TM, Fish L, Salavati R, Cristea IM, Tavazoie S.|title= Systematic discovery of structural elements governing stability of mammalian messenger RNAs. |journal=Nature|volume=485|pages=264–268|year=2012|doi=10.1038/nature11013}}</ref>
 
<ref name="Goodarzi 2012">{{cite journal |author=Goodarzi H. et al |title=Systematic discovery of structural elements governing stability of mammalian messenger RNAs|journal=Nature|volume=485 |issue=20  |year=2012| doi= 10.1038/nature1101}}</ref>
 
<ref name="Lari and Young 1990">{{cite journal |author=Lari K and Young, S. J.|title= The estimation of stochastic context-free grammars using the inside-outside algorithm |journal=Computer Speech and Language|volume=4|pages=35–56|year=1990|doi=10.1016/0885-2308(90)90022-X}}</ref>
<ref name="Sipser 1996">{{cite journal |author=Sipser M|title= Introduction to Theory of Computation|journal=Brooks Cole Pub Co.|volume=|year=1996}}</ref>
 
<ref name="Lari and Young 1991">{{cite journal |author=Lari K and Young, S. J.|title= Applications of stochastic context-free grammars using the inside-outside algorithm.|journal=Computer Speech and Language|volume=5|pages=237–257|year=1991|doi=10.1016/0885-2308(91)90009-F}}</ref>
 
<ref name="Harrison 1978">{{cite journal |author=Harrison MA: |title= Introduction to Formal Language Theory. |journal=Addison-Wesley;|year=1978}}</ref>
 
<ref name="Myers 1995">{{cite journal |author=Myers, G.|title= Approximately matching context-free languages |journal=Information Processing Letters|volume=54|pages=85–92|year=1995|doi=10.1016/0020-0190(95)00007-Y}}</ref>
 
<ref name="Hopcroft 1979">{{cite journal |author=Hopcroft JE, Ullman JD:|title= Introduction to Automata Theory, Languages, and Computation. |journal=Addison-Wesley;|year=1979}}</ref>
 
<ref name="Giegerich 2000">{{cite journal |author=Giegerich R|title=Explaining and Controlling Ambiguity in Dynamic Programming.|journal= In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching 1848 Edited by: Giancarlo R, Sankoff D. Montréal, Canada: Springer-Verlag, Berlin.|pages=46–59|year=2000}}</ref>
 
<ref name="Tavaré 1986">{{cite journal |author=Tavaré S.|title= Some probabilistic and statistical problems in the analysis of DNA sequences. |journal=Lectures on Mathematics in the Life Sciences. American Mathematical Society|volume=17|pages=57–86|year=1986}}</ref>
 
<ref name="Muse 1995">{{cite journal |author=Muse SV. |title=Evolutionary analyses of DNA sequences subject to constraints of secondary structure.|journal=Genetics.|volume=139 |issue=3 |pages=1429–1439|year=1995|pmc=1206468}}</ref>
 
<ref name="Baker 1979">{{cite journal |author=J.K. Baker. |title=Trainable grammars for speech recognition|journal=In D. Klatt and J. Wolf, editors, Speech Communication Papers for the 97th Meeting of the Acoustical Society of America|volume=36 |issue=20 |pages=545–550 |year=1979}}</ref>
 
<ref name="Schöniger 1994">{{cite journal |author=Schöniger M., von Haeseler A. |title=A stochastic model for the evolution of autocorrelated DNA sequences. |volume=3 |issue=3 |pages=240–7 |year=1994| journal = Mol Phylogenet Evol.|doi=10.1006/mpev.1994.1026}}</ref>
</div>
{{reflist|2}}
 
{{DEFAULTSORT:Stochastic Context-Free Grammar}}
[[Category:Bioinformatics]]
[[Category:Formal languages]]
[[Category:Natural language parsing]]
[[Category:Statistical natural language processing]]
[[Category:Probabilistic models]]

Latest revision as of 15:14, 8 December 2014

My name's Alvin Hankinson but everybody calls me Alvin. I'm from Norway. I'm studying at the college (1st year) and I play the Dobro for 10 years. Usually I choose songs from the famous films :D.
I have two sister. I like Record collecting, watching movies and Fishkeeping.

my blog post :: http://www.161991up.com