Conditional random field: Difference between revisions
Olexa Riznyk (talk | contribs) Adding/removing wikilink(s) |
m cite repair; |
||
(34 intermediate revisions by 21 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Class of statistical modeling methods}} |
|||
{{multiple issues| |
{{multiple issues| |
||
{{context|date=January 2013}} |
{{context|date=January 2013}} |
||
Line 4: | Line 5: | ||
}} |
}} |
||
{{machine learning bar}} |
{{machine learning bar|Structured prediction}} |
||
'''Conditional random fields''' ('''CRFs''') are a class of [[statistical model|statistical modeling |
'''Conditional random fields''' ('''CRFs''') are a class of [[statistical model|statistical modeling methods]] often applied in [[pattern recognition]] and [[machine learning]] and used for [[structured prediction]]. Whereas a [[statistical classification|classifier]] predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a [[graphical model]], which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in [[natural language processing]], "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions. |
||
Other examples where CRFs are used are: [[sequence labeling|labeling]] or [[parsing]] of sequential data for [[natural language processing]] or [[bioinformatics|biological sequences]],<ref name="Laf:McC:Per01">{{cite conference |author=Lafferty, J. |author2=McCallum, A. |author3=Pereira, F. |title=Conditional random fields: Probabilistic models for segmenting and labeling sequence data|book-title =Proc. 18th International Conf. on Machine Learning | |
|||
publisher= Morgan Kaufmann|date = 2001| pages= 282–289|url= http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers }} |
|||
title=Conditional random fields: Probabilistic models for segmenting and labeling sequence data| |
|||
</ref> [[POS tagging|part-of-speech tagging]], [[shallow parsing]],<ref>{{cite conference| title=shallow parsing with conditional random fields|author1=Sha, F. |author2=Pereira, F. | date=2003 | url= http://portal.acm.org/ft_gateway.cfm?id=1073473&type=pdf&CFID=4684435&CFTOKEN=39459323}}</ref> [[named entity recognition]],<ref>{{cite conference| |
|||
booktitle =Proc. 18th International Conf. on Machine Learning | |
|||
publisher= Morgan Kaufmann| |
|||
date = 2001| pages= 282–289|url= http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers }} |
|||
</ref> |
|||
and in [[computer vision]].<ref>{{cite news |
|||
| title = Multiscale conditional random fields for image labeling |
|||
| last1 = He | first1 = X. | authorlink = Xuming He | last2 = Zemel | first2 = R.S. | last3 = Carreira-Perpinñán | first3 = M.A. |
|||
| date = 2004 |
|||
| publisher = IEEE Computer Society |
|||
| citeseerx = 10.1.1.3.7826 |
|||
}}</ref> |
|||
Specifically, CRFs find applications in [[POS tagging]], [[shallow parsing]],<ref>{{cite conference| title=shallow parsing with conditional random fields|author1=Sha, F. |author2=Pereira, F. | date=2003 | url= http://portal.acm.org/ft_gateway.cfm?id=1073473&type=pdf&CFID=4684435&CFTOKEN=39459323}}</ref> |
|||
[[named entity recognition]],<ref>{{cite conference| |
|||
url=http://www.aclweb.org/anthology/W04-1221.pdf | |
url=http://www.aclweb.org/anthology/W04-1221.pdf | |
||
title=Biomedical named entity recognition using conditional random fields and rich feature sets | |
title=Biomedical named entity recognition using conditional random fields and rich feature sets | |
||
author= Settles, B.|date=2004| |
author= Settles, B.|date=2004| |
||
book-title=Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications | |
|||
pages= 104–107}}</ref> [[Gene prediction|gene finding]], peptide critical functional region finding,<ref>{{cite journal | title = Analysis and Prediction of the Critical Regions of Antimicrobial Peptides Based on Conditional Random Fields. | |
|||
pages= 104–107}}</ref> |
|||
journal = PLOS ONE |author1=Chang KY |author2=Lin T-p |author3=Shih L-Y |author4=Wang C-K | |
|||
[[Gene prediction|gene finding]] and peptide critical functional region finding,<ref>{{cite conference | title = Analysis and Prediction of the Critical Regions of Antimicrobial Peptides Based on Conditional Random Fields. | |
|||
publisher = PLoS ONE |author1=Chang KY |author2=Lin T-p |author3=Shih L-Y |author4=Wang C-K | |
|||
date = 2015 | |
date = 2015 | |
||
volume = 10 | |
|||
doi = 10.1371/journal.pone.0119490 |pmc=4372350 }}</ref> |
|||
issue = 3 | |
|||
among other tasks, being an alternative to the related [[hidden Markov model]]s (HMMs). In computer vision, CRFs are often used for object recognition<ref name="Rui:Gal:Gon15">{{cite conference | title = UPGMpp: a Software Library for Contextual Object Recognition. | |
|||
pages = e0119490 | |
|||
doi = 10.1371/journal.pone.0119490 | |
|||
pmid = 25803302 |pmc=4372350 | |
|||
bibcode = 2015PLoSO..1019490C | |
|||
doi-access = free }}</ref> and [[object recognition]]<ref name="Rui:Gal:Gon15">{{cite conference | title = UPGMpp: a Software Library for Contextual Object Recognition. | |
|||
author1=J.R. Ruiz-Sarmiento |author2=C. Galindo |author3=J. Gonzalez-Jimenez | |
author1=J.R. Ruiz-Sarmiento |author2=C. Galindo |author3=J. Gonzalez-Jimenez | |
||
date = 2015 | |
date = 2015 | |
||
url= https://www.researchgate.net/publication/ |
url= https://www.researchgate.net/publication/281620302 | |
||
book-title= 3rd. Workshop on Recognition and Action for Scene Understanding (REACTS)}}</ref> and [[image segmentation]] in [[computer vision]].<ref>{{cite news |
|||
| title = Multiscale conditional random fields for image labeling |
|||
| last1 = He | first1 = X. | author-link = Xuming He | last2 = Zemel | first2 = R.S. | last3 = Carreira-Perpinñán | first3 = M.A. |
|||
| date = 2004 |
|||
| publisher = IEEE Computer Society |
|||
| citeseerx = 10.1.1.3.7826 |
|||
}}</ref> |
|||
==Description== |
==Description== |
||
CRFs are a type of [[discriminative model|discriminative]] [[Markov random field|undirected]] [[Statistical model|probabilistic]] [[graphical model]]. |
|||
[[John D. Lafferty|Lafferty]], [[Andrew McCallum|McCallum]] and Pereira<ref name="Laf:McC:Per01"/> define a CRF on observations <math>\boldsymbol{X}</math> and [[random variable]]s <math>\boldsymbol{Y}</math> as follows: |
[[John D. Lafferty|Lafferty]], [[Andrew McCallum|McCallum]] and Pereira<ref name="Laf:McC:Per01"/> define a CRF on observations <math>\boldsymbol{X}</math> and [[random variable]]s <math>\boldsymbol{Y}</math> as follows: |
||
<blockquote>Let <math>G = (V , E)</math> be a graph such that |
<blockquote>Let <math>G = (V , E)</math> be a graph such that <math>\boldsymbol{Y} = (\boldsymbol{Y}_v)_{v\in V}</math>, so that <math>\boldsymbol{Y}</math> is indexed by the vertices of <math>G</math>. |
||
Then <math>(\boldsymbol{X}, \boldsymbol{Y})</math> is a conditional random field when each random variable <math>\boldsymbol{Y}_v</math>, conditioned on <math>\boldsymbol{X}</math>, obeys the [[Markov property]] with respect to the graph; that is, its probability is dependent only on its neighbours in G: |
|||
<math>\boldsymbol{Y} = (\boldsymbol{Y}_v)_{v\in V}</math>, |
|||
so that <math>\boldsymbol{Y}</math> is indexed by the vertices of <math>G</math>. |
|||
<math>P(\boldsymbol{Y}_v |\boldsymbol{X}, \{\boldsymbol{Y}_w: w \neq v\}) = P(\boldsymbol{Y}_v |\boldsymbol{X}, \{\boldsymbol{Y}_w: w \sim v\})</math>, where <math>\mathit{w} \sim v</math> means |
|||
respect to the graph: <math>p(\boldsymbol{Y}_v |\boldsymbol{X}, \boldsymbol{Y}_w, w \neq v) = p(\boldsymbol{Y}_v |\boldsymbol{X}, \boldsymbol{Y}_w, w \sim v)</math>, where <math>\mathit{w} \sim v</math> means |
|||
that <math>w</math> and <math>v</math> are [[Neighbourhood (graph theory)|neighbors]] in <math>G</math>. |
that <math>w</math> and <math>v</math> are [[Neighbourhood (graph theory)|neighbors]] in <math>G</math>. |
||
</blockquote> |
</blockquote> |
||
Line 55: | Line 55: | ||
* If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the [[forward-backward algorithm|forward-backward]] and [[Viterbi algorithm]] for the case of HMMs. |
* If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the [[forward-backward algorithm|forward-backward]] and [[Viterbi algorithm]] for the case of HMMs. |
||
* If the CRF only contains pair-wise potentials and the energy is submodular, combinatorial min cut/max flow algorithms yield exact solutions. |
* If the CRF only contains pair-wise potentials and the energy is [[Submodular function|submodular]], combinatorial min cut/max flow algorithms yield exact solutions. |
||
If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include: |
If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include: |
||
Line 64: | Line 64: | ||
===Parameter Learning=== |
===Parameter Learning=== |
||
Learning the parameters <math>\theta</math> is usually done by [[maximum likelihood]] learning for <math>p(Y_i|X_i; \theta)</math>. |
Learning the parameters <math>\theta</math> is usually done by [[maximum likelihood]] learning for <math>p(Y_i|X_i; \theta)</math>. If all nodes have exponential family distributions and all nodes are observed during training, this [[Optimization (mathematics)|optimization]] is convex.<ref name="SuttonIntroduction" /> It can be solved for example using [[gradient descent]] algorithms, or [[Quasi-Newton method]]s such as the [[L-BFGS]] algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used. |
||
If all nodes have exponential family distributions and all nodes are observed during training, this [[Optimization (mathematics)|optimization]] is convex.<ref name="SuttonIntroduction" /> It can be solved for example using [[gradient descent]] algorithms, or [[Quasi-Newton method]]s such as the [[L-BFGS]] algorithm. |
|||
On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used. |
|||
===Examples=== |
===Examples=== |
||
In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables <math>X</math> represents a sequence of observations and |
In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables <math>X</math> represents a sequence of observations and <math>Y</math> represents a hidden (or unknown) state variable that needs to be inferred given the observations. The <math>Y_{i}</math> are structured to form a chain, with an edge between each <math>Y_{i-1}</math> and <math>Y_{i}</math>. As well as having a simple interpretation of the <math>Y_{i}</math> as "labels" for each element in the input sequence, this layout admits efficient algorithms for: |
||
* model ''training'', learning the conditional distributions between the <math>Y_{i}</math> and feature functions from some corpus of training data. |
|||
The <math>Y_{i}</math> are structured to form a chain, with an edge between each <math>Y_{i-1}</math> and <math>Y_{i}</math>. As well as having a simple interpretation of the <math>Y_{i}</math> as "labels" for each element in the input sequence, this layout admits efficient algorithms for: |
|||
* model ''training'', learning the conditional distributions between the <math>Y_{i}</math> and feature functions from some corpus of training data. |
|||
* ''decoding'', determining the probability of a given label sequence <math>Y</math> given <math>X</math>. |
* ''decoding'', determining the probability of a given label sequence <math>Y</math> given <math>X</math>. |
||
* ''inference'', determining the ''most likely'' label sequence <math>Y</math> given <math>X</math>. |
* ''inference'', determining the ''most likely'' label sequence <math>Y</math> given <math>X</math>. |
||
Line 90: | Line 87: | ||
| last1 = Lavergne |
| last1 = Lavergne |
||
| first1 = Thomas |
| first1 = Thomas |
||
| author-link1 = |
|||
| last2 = Yvon |
| last2 = Yvon |
||
| first2 = François |
| first2 = François |
||
| author-link2 = |
|||
| date = September 7, 2017 |
| date = September 7, 2017 |
||
| publisher = Association for Computational Linguistics |
| publisher = Association for Computational Linguistics |
||
| book-title = Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing |
| book-title = Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing |
||
| |
| page = 433 |
||
| location = Copenhagen, Denmark |
| location = Copenhagen, Denmark |
||
}}</ref> since their computational cost increases exponentially with <math>k</math>. |
}}</ref> since their computational cost increases exponentially with <math>k</math>. |
||
Line 105: | Line 100: | ||
| last1 = Chatzis |
| last1 = Chatzis |
||
| first1 = Sotirios |
| first1 = Sotirios |
||
| author-link1 = |
|||
| last2 = Demiris |
| last2 = Demiris |
||
| first2 = Yiannis |
| first2 = Yiannis |
||
| author-link2 = |
|||
| year = 2013 |
| year = 2013 |
||
| journal = IEEE Transactions on Pattern Analysis and Machine Intelligence |
| journal = IEEE Transactions on Pattern Analysis and Machine Intelligence |
||
Line 115: | Line 108: | ||
| doi=10.1109/tpami.2012.208 |
| doi=10.1109/tpami.2012.208 |
||
| pmid = 23599063 |
| pmid = 23599063 |
||
| hdl = 10044/1/12614 |
|||
}}</ref> |
|||
| s2cid = 690627 |
|||
constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations.<ref>{{cite conference |
|||
| hdl-access = free |
|||
}}</ref> constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations.<ref>{{cite conference |
|||
| title = Improvements to the Sequence Memoizer |
| title = Improvements to the Sequence Memoizer |
||
| last1 = Gasthaus |
| last1 = Gasthaus |
||
| first1 = Jan |
| first1 = Jan |
||
| author-link1 = |
|||
| last2 = Teh |
| last2 = Teh |
||
| first2 = Yee Whye |
| first2 = Yee Whye |
||
| author-link2 = |
|||
| year = 2010 |
| year = 2010 |
||
| book-title = Proc. NIPS |
| book-title = Proc. NIPS |
||
| url = https://papers.nips.cc/paper/3938-improvements-to-the-sequence-memoizer.pdf |
| url = https://papers.nips.cc/paper/3938-improvements-to-the-sequence-memoizer.pdf |
||
}}</ref> To render such a model computationally tractable, CRF-infinity employs a mean-field approximation |
}}</ref> To render such a model computationally tractable, CRF-infinity employs a [[mean-field approximation]]<ref>{{cite journal |
||
| title = EM Procedures Using Mean Field-Like Approximations for Markov Model-Based Image Segmentation |
| title = EM Procedures Using Mean Field-Like Approximations for Markov Model-Based Image Segmentation |
||
| last1 = Celeux |
| last1 = Celeux |
||
| first1 = G. |
| first1 = G. |
||
| author-link1 = |
|||
| last2 = Forbes |
| last2 = Forbes |
||
| first2 = F. |
| first2 = F. |
||
| author-link2 = |
|||
| last3 = Peyrard |
| last3 = Peyrard |
||
| first3 = N. |
| first3 = N. |
||
| author-link3 = |
|||
| year = 2003 |
| year = 2003 |
||
| journal = Pattern Recognition |
| journal = Pattern Recognition |
||
Line 143: | Line 133: | ||
| volume = 36 | issue = 1 |
| volume = 36 | issue = 1 |
||
| doi=10.1016/s0031-3203(02)00027-4 |
| doi=10.1016/s0031-3203(02)00027-4 |
||
| bibcode = 2003PatRe..36..131C |
|||
| citeseerx = 10.1.1.6.9064 |
| citeseerx = 10.1.1.6.9064 |
||
}}</ref> of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length. |
|||
}}</ref> |
|||
of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length. |
|||
There exists another generalization of CRFs, the '''semi-Markov conditional random field (semi-CRF)''', which models variable-length ''segmentations'' of the label sequence <math>Y</math>.<ref>{{Cite book |
There exists another generalization of CRFs, the '''semi-Markov conditional random field (semi-CRF)''', which models variable-length ''segmentations'' of the label sequence <math>Y</math>.<ref>{{Cite book |
||
| |
|publisher = MIT Press |
||
| |
|pages = 1185–1192 |
||
| |
|editor = Lawrence K. Saul |
||
|editor2 = Yair Weiss |
|||
| last = Sarawagi |
|||
|editor3 = Léon Bottou |
|||
| first = Sunita |
|||
|last1 = Sarawagi |
|||
| last2 = Cohen |
|||
|first1 = Sunita |
|||
| first2 = William W. |
|||
|author1-link = Sunita Sarawagi |
|||
| chapter = Semi-Markov conditional random fields for information extraction |
|||
|last2 = Cohen |
|||
| chapter-url = http://papers.nips.cc/paper/2648-semi-markov-conditional-random-fields-for-information-extraction.pdf |
|||
|first2 = William W. |
|||
| title = Advances in Neural Information Processing Systems 17 |
|||
|chapter = Semi-Markov conditional random fields for information extraction |
|||
| url = http://papers.nips.cc/book/advances-in-neural-information-processing-systems-17-2004 |
|||
|chapter-url = http://papers.nips.cc/paper/2648-semi-markov-conditional-random-fields-for-information-extraction.pdf |
|||
| location = Cambridge, MA |
|||
|title = Advances in Neural Information Processing Systems 17 |
|||
| year = 2005 |
|||
|url = http://papers.nips.cc/book/advances-in-neural-information-processing-systems-17-2004 |
|||
}}</ref> This provides much of the power of higher-order CRFs to model long-range dependencies of the <math>Y_{i}</math>, at a reasonable computational cost. |
|||
|location = Cambridge, MA |
|||
|year = 2005 |
|||
|access-date = 2015-11-12 |
|||
|archive-date = 2019-11-30 |
|||
|archive-url = https://web.archive.org/web/20191130080920/http://papers.nips.cc/book/advances-in-neural-information-processing-systems-17-2004 |
|||
}}</ref> This provides much of the power of higher-order CRFs to model long-range dependencies of the <math>Y_{i}</math>, at a reasonable computational cost. |
|||
Finally, large-margin models for [[structured prediction]], such as the [[Structured SVM|structured Support Vector Machine]] can be seen as an alternative training procedure to CRFs. |
Finally, large-margin models for [[structured prediction]], such as the [[Structured SVM|structured Support Vector Machine]] can be seen as an alternative training procedure to CRFs. |
||
Line 168: | Line 164: | ||
'''Latent-dynamic conditional random fields''' ('''LDCRF''') or '''discriminative probabilistic latent variable models''' ('''DPLVM''') are a type of CRFs for sequence tagging tasks. They are [[latent variable model]]s that are trained discriminatively. |
'''Latent-dynamic conditional random fields''' ('''LDCRF''') or '''discriminative probabilistic latent variable models''' ('''DPLVM''') are a type of CRFs for sequence tagging tasks. They are [[latent variable model]]s that are trained discriminatively. |
||
In an LDCRF, like in any sequence tagging task, given a sequence of observations '''x''' = <math>x_1,\dots,x_n</math>, the main problem the model must solve is how to assign a sequence of labels '''y''' = <math>y_1,\dots,y_n</math> from one finite set of labels {{mvar|Y}}. Instead of directly modeling {{mvar|P}}('''y'''|'''x''') as an ordinary linear-chain CRF would do, a set of latent variables '''h''' is "inserted" between '''x''' and '''y''' using the [[chain rule of probability]]:<ref name="lvperceptron">{{cite conference |author1=Xu Sun |author2=Takuya Matsuzaki |author3=Daisuke Okanohara |author4=Jun'ichi Tsujii |title=Latent Variable Perceptron Algorithm for Structured Classification |conference=IJCAI |year=2009 |pages=1236–1242|url=http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/download/356/970}}</ref> |
In an LDCRF, like in any sequence tagging task, given a sequence of observations '''x''' = <math>x_1,\dots,x_n</math>, the main problem the model must solve is how to assign a sequence of labels '''y''' = <math>y_1,\dots,y_n</math> from one finite set of labels {{mvar|Y}}. Instead of directly modeling {{mvar|P}}('''y'''|'''x''') as an ordinary linear-chain CRF would do, a set of latent variables '''h''' is "inserted" between '''x''' and '''y''' using the [[chain rule of probability]]:<ref name="lvperceptron">{{cite conference |author1=Xu Sun |author2=Takuya Matsuzaki |author3=Daisuke Okanohara |author4=Jun'ichi Tsujii |title=Latent Variable Perceptron Algorithm for Structured Classification |conference=IJCAI |year=2009 |pages=1236–1242 |url=http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/download/356/970 |access-date=2018-12-06 |archive-date=2018-12-06 |archive-url=https://web.archive.org/web/20181206145357/http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/download/356/970 }}</ref> |
||
:<math>P(\mathbf{y} | \mathbf{x}) = \sum_\mathbf{h} P(\mathbf{y}|\mathbf{h}, \mathbf{x}) P(\mathbf{h} | \mathbf{x})</math> |
:<math>P(\mathbf{y} | \mathbf{x}) = \sum_\mathbf{h} P(\mathbf{y}|\mathbf{h}, \mathbf{x}) P(\mathbf{h} | \mathbf{x})</math> |
||
This allows capturing latent structure between the observations and labels.<ref name="morency">{{Cite book | last1 = Morency | first1 = L. P. | last2 = Quattoni | first2 = A. | last3 = Darrell | first3 = T. | doi = 10.1109/CVPR.2007.383299 | chapter = Latent-Dynamic Discriminative Models for Continuous Gesture Recognition | title = 2007 IEEE Conference on Computer Vision and Pattern Recognition | |
This allows capturing latent structure between the observations and labels.<ref name="morency">{{Cite book | last1 = Morency | first1 = L. P. | last2 = Quattoni | first2 = A. | last3 = Darrell | first3 = T. | doi = 10.1109/CVPR.2007.383299 | chapter = Latent-Dynamic Discriminative Models for Continuous Gesture Recognition | title = 2007 IEEE Conference on Computer Vision and Pattern Recognition | page = 1| year = 2007 | isbn = 978-1-4244-1179-5 | chapter-url = http://dspace.mit.edu/bitstream/handle/1721.1/35276/MIT-CSAIL-TR-2007-002.pdf| citeseerx = 10.1.1.420.6836 | s2cid = 7117722 }}</ref> While LDCRFs can be trained using quasi-Newton methods, a specialized version of the [[perceptron]] algorithm called the '''latent-variable perceptron''' has been developed for them as well, based on Collins' [[structured perceptron]] algorithm.<ref name="lvperceptron"/> These models find applications in [[computer vision]], specifically [[gesture recognition]] from video streams<ref name="morency"/> and [[shallow parsing]].<ref name="lvperceptron"/> |
||
== Software == |
|||
This is a partial list of software that implement generic CRF tools. |
|||
* [https://github.com/zhongkaifu/RNNSharp RNNSharp] CRFs based on recurrent neural networks ([[C Sharp (programming language)|C#]], [[.NET Framework|.NET]]) |
|||
* [https://web.archive.org/web/20131224113826/http://klcl.pku.edu.cn/member/sunxu/code.htm CRF-ADF] Linear-chain CRFs with fast online ADF training ([[C Sharp (programming language)|C#]], [[.NET Framework|.NET]]) |
|||
* [https://github.com/zhongkaifu/CRFSharp CRFSharp] Linear-chain CRFs ([[C Sharp (programming language)|C#]], [[.NET Framework|.NET]]) |
|||
* [http://vision.csd.uwo.ca/code/ GCO] CRFs with submodular energy functions ([[C++]], [[Matlab]]) |
|||
* [http://research.project-10.de/dgm DGM] General CRFs ([[C++]]) |
|||
* [http://mallet.cs.umass.edu/grmm/index.php GRMM] General CRFs ([[Java (programming language)|Java]]) |
|||
* [http://factorie.cs.umass.edu/ factorie] General CRFs ([[Scala (programming language)|Scala]]) |
|||
* [http://www.cs.ubc.ca/~murphyk/Software/CRFall.zip CRFall] General CRFs ([[MATLAB|Matlab]]) |
|||
* [http://crf.sourceforge.net/ Sarawagi's CRF] Linear-chain CRFs ([[Java (programming language)|Java]]) |
|||
* [http://sourceforge.net/projects/hcrf/ HCRF library] Hidden-state CRFs ([[C++]], [[MATLAB|Matlab]]) |
|||
* [http://accord-framework.net Accord.NET] Linear-chain CRF, HCRF and HMMs ([[C Sharp (programming language)|C#]], [[.NET Framework|.NET]]) |
|||
* [http://wapiti.limsi.fr/ Wapiti] Fast linear-chain CRFs ([[C (programming language)|C]])<ref>T. Lavergne, O. Cappé and F. Yvon (2010). [http://acl.eldoc.ub.rug.nl/mirror/P/P10/P10-1052.pdf Practical very large scale CRFs] {{webarchive|url=https://web.archive.org/web/20130718001211/http://acl.eldoc.ub.rug.nl/mirror/P/P10/P10-1052.pdf |date=2013-07-18 }}. Proc. 48th Annual Meeting of the [[Association for Computational Linguistics|ACL]], pp. 504-513.</ref> |
|||
* [http://www.chokkan.org/software/crfsuite/ CRFSuite] Fast restricted linear-chain CRFs ([[C (programming language)|C]]) |
|||
* [https://web.archive.org/web/20100421020327/http://crfpp.sourceforge.net/ CRF++] Linear-chain CRFs ([[C++]]) |
|||
* [http://flexcrfs.sourceforge.net/ FlexCRFs] First-order and second-order Markov CRFs ([[C++]]) |
|||
* [http://hackage.haskell.org/package/crf-chain1 crf-chain1] First-order, linear-chain CRFs ([[Haskell (programming language)|Haskell]]) |
|||
* [http://www.cs.rochester.edu/~bhole/code/crf/ imageCRF] CRF for segmenting images and image volumes ([[C++]]) |
|||
* [http://mallet.cs.umass.edu/ MALLET] Linear-chain for sequence tagging ([[Java (programming language)|Java]]) |
|||
* [https://pystruct.github.io/ PyStruct] Structured Learning in Python ([[Python (programming language)|Python]]) |
|||
* [https://github.com/scrapinghub/python-crfsuite Pycrfsuite] A python binding for crfsuite ([[Python (programming language)|Python]]) |
|||
* [https://github.com/p2t2/figaro Figaro] Probabilistic programming language capable of defining CRFs and other graphical models ([[Scala (programming language)|Scala]]) |
|||
* [https://cran.r-project.org/web/packages/CRF/CRF.pdf CRF] Modeling and computational tools for CRFs and other undirected graphical models ([[R (programming language)|R]]) |
|||
* [http://hciweb2.iwr.uni-heidelberg.de/opengm/index.php OpenGM] Library for discrete factor graph models and distributive operations on these models ([[C++]]) |
|||
* [https://github.com/jotaraul/upgmpp UPGMpp]<ref name="Rui:Gal:Gon15"/> Library for building, training, and performing inference with Undirected Graphical Models ([[C++]]) |
|||
* [http://keg.cs.tsinghua.edu.cn/jietang/software/KEG_CRF/ KEG_CRF] Fast Linear CRFs ([[C++]]) |
|||
This is a partial list of software that implement CRF related tools. |
|||
* [https://github.com/NLPatVCU/medaCy MedaCy] Medical Named Entity Recognizer ([[Python (programming language)|Python]]) |
|||
* [https://web.archive.org/web/20100112222803/http://www.broadinstitute.org/annotation/conrad/ Conrad] CRF based gene predictor ([[Java (programming language)|Java]]) |
|||
* [http://nlp.stanford.edu/software/CRF-NER.shtml Stanford NER] Named Entity Recognizer ([[Java (programming language)|Java]]) |
|||
* [https://web.archive.org/web/20100707042144/http://cbioc.eas.asu.edu/banner/ BANNER] Named Entity Recognizer ([[Java (programming language)|Java]]) |
|||
== See also == |
== See also == |
||
* [[Hammersley–Clifford theorem]] |
* [[Hammersley–Clifford theorem]] |
||
* [[Graphical model]] |
|||
* [[Markov random field]] |
|||
* [[Maximum entropy Markov model]] (MEMM) |
* [[Maximum entropy Markov model]] (MEMM) |
||
Line 220: | Line 178: | ||
==Further reading== |
==Further reading== |
||
* McCallum, A.: [https://arxiv.org/ |
* McCallum, A.: [https://arxiv.org/abs/1212.2504 Efficiently inducing features of conditional random fields]. In: ''Proc. 19th Conference on Uncertainty in Artificial Intelligence''. (2003) |
||
* Wallach, H.M.: [http://www.cs.umass.edu/~wallach/technical_reports/wallach04conditional.pdf Conditional random fields: An introduction]. Technical report MS-CIS-04-21, University of Pennsylvania (2004) |
* [[Hanna Wallach|Wallach, H.M.]]: [http://www.cs.umass.edu/~wallach/technical_reports/wallach04conditional.pdf Conditional random fields: An introduction]. Technical report MS-CIS-04-21, University of Pennsylvania (2004) |
||
* Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In "Introduction to Statistical Relational Learning". Edited by [[Lise Getoor]] and Ben Taskar. MIT Press. (2006) [http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf Online PDF] |
* Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In "Introduction to Statistical Relational Learning". Edited by [[Lise Getoor]] and Ben Taskar. MIT Press. (2006) [http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf Online PDF] |
||
* Klinger, R., Tomanek, K.: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503. [ |
* Klinger, R., Tomanek, K.: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503. [http://ls11-www.cs.uni-dortmund.de/_media/techreports/tr07-13.pdf Online PDF] |
||
[[Category:Graphical models]] |
[[Category:Graphical models]] |
Revision as of 13:31, 19 August 2023
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
Part of a series on |
Machine learning and data mining |
---|
Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.
Other examples where CRFs are used are: labeling or parsing of sequential data for natural language processing or biological sequences,[1] part-of-speech tagging, shallow parsing,[2] named entity recognition,[3] gene finding, peptide critical functional region finding,[4] and object recognition[5] and image segmentation in computer vision.[6]
Description
CRFs are a type of discriminative undirected probabilistic graphical model.
Lafferty, McCallum and Pereira[1] define a CRF on observations and random variables as follows:
Let be a graph such that , so that is indexed by the vertices of .
Then is a conditional random field when each random variable , conditioned on , obeys the Markov property with respect to the graph; that is, its probability is dependent only on its neighbours in G:
, where means that and are neighbors in .
What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets and , the observed and output variables, respectively; the conditional distribution is then modeled.
Inference
For general graphs, the problem of exact inference in CRFs is intractable. The inference problem for a CRF is basically the same as for an MRF and the same arguments hold.[7] However, there exist special cases for which exact inference is feasible:
- If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of HMMs.
- If the CRF only contains pair-wise potentials and the energy is submodular, combinatorial min cut/max flow algorithms yield exact solutions.
If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include:
- Loopy belief propagation
- Alpha expansion
- Mean field inference
- Linear programming relaxations
Parameter Learning
Learning the parameters is usually done by maximum likelihood learning for . If all nodes have exponential family distributions and all nodes are observed during training, this optimization is convex.[7] It can be solved for example using gradient descent algorithms, or Quasi-Newton methods such as the L-BFGS algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used.
Examples
In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables represents a sequence of observations and represents a hidden (or unknown) state variable that needs to be inferred given the observations. The are structured to form a chain, with an edge between each and . As well as having a simple interpretation of the as "labels" for each element in the input sequence, this layout admits efficient algorithms for:
- model training, learning the conditional distributions between the and feature functions from some corpus of training data.
- decoding, determining the probability of a given label sequence given .
- inference, determining the most likely label sequence given .
The conditional dependency of each on is defined through a fixed set of feature functions of the form , which can be thought of as measurements on the input sequence that partially determine the likelihood of each possible value for . The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for .
Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions. An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence.
Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the feature functions can inspect the entire input sequence at any point during inference, and the range of the feature functions need not have a probabilistic interpretation.
Variants
Higher-order CRFs and semi-Markov CRFs
CRFs can be extended into higher order models by making each dependent on a fixed number of previous variables . In conventional formulations of higher order CRFs, training and inference are only practical for small values of (such as k ≤ 5),[8] since their computational cost increases exponentially with .
However, another recent advance has managed to ameliorate these issues by leveraging concepts and tools from the field of Bayesian nonparametrics. Specifically, the CRF-infinity approach[9] constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations.[10] To render such a model computationally tractable, CRF-infinity employs a mean-field approximation[11] of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length.
There exists another generalization of CRFs, the semi-Markov conditional random field (semi-CRF), which models variable-length segmentations of the label sequence .[12] This provides much of the power of higher-order CRFs to model long-range dependencies of the , at a reasonable computational cost.
Finally, large-margin models for structured prediction, such as the structured Support Vector Machine can be seen as an alternative training procedure to CRFs.
Latent-dynamic conditional random field
Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent variable models (DPLVM) are a type of CRFs for sequence tagging tasks. They are latent variable models that are trained discriminatively.
In an LDCRF, like in any sequence tagging task, given a sequence of observations x = , the main problem the model must solve is how to assign a sequence of labels y = from one finite set of labels Y. Instead of directly modeling P(y|x) as an ordinary linear-chain CRF would do, a set of latent variables h is "inserted" between x and y using the chain rule of probability:[13]
This allows capturing latent structure between the observations and labels.[14] While LDCRFs can be trained using quasi-Newton methods, a specialized version of the perceptron algorithm called the latent-variable perceptron has been developed for them as well, based on Collins' structured perceptron algorithm.[13] These models find applications in computer vision, specifically gesture recognition from video streams[14] and shallow parsing.[13]
See also
References
- ^ a b Lafferty, J.; McCallum, A.; Pereira, F. (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data". Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann. pp. 282–289.
- ^ Sha, F.; Pereira, F. (2003). shallow parsing with conditional random fields.
- ^ Settles, B. (2004). "Biomedical named entity recognition using conditional random fields and rich feature sets" (PDF). Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. pp. 104–107.
- ^ Chang KY; Lin T-p; Shih L-Y; Wang C-K (2015). "Analysis and Prediction of the Critical Regions of Antimicrobial Peptides Based on Conditional Random Fields". PLOS ONE. 10 (3): e0119490. Bibcode:2015PLoSO..1019490C. doi:10.1371/journal.pone.0119490. PMC 4372350. PMID 25803302.
- ^ J.R. Ruiz-Sarmiento; C. Galindo; J. Gonzalez-Jimenez (2015). "UPGMpp: a Software Library for Contextual Object Recognition.". 3rd. Workshop on Recognition and Action for Scene Understanding (REACTS).
- ^ He, X.; Zemel, R.S.; Carreira-Perpinñán, M.A. (2004). "Multiscale conditional random fields for image labeling". IEEE Computer Society. CiteSeerX 10.1.1.3.7826.
- ^ a b Sutton, Charles; McCallum, Andrew (2010). "An Introduction to Conditional Random Fields". arXiv:1011.4088v1 [stat.ML].
- ^ Lavergne, Thomas; Yvon, François (September 7, 2017). "Learning the Structure of Variable-Order CRFs: a Finite-State Perspective". Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics. p. 433.
- ^ Chatzis, Sotirios; Demiris, Yiannis (2013). "The Infinite-Order Conditional Random Field Model for Sequential Data Modeling". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (6): 1523–1534. doi:10.1109/tpami.2012.208. hdl:10044/1/12614. PMID 23599063. S2CID 690627.
- ^ Gasthaus, Jan; Teh, Yee Whye (2010). "Improvements to the Sequence Memoizer" (PDF). Proc. NIPS.
- ^ Celeux, G.; Forbes, F.; Peyrard, N. (2003). "EM Procedures Using Mean Field-Like Approximations for Markov Model-Based Image Segmentation". Pattern Recognition. 36 (1): 131–144. Bibcode:2003PatRe..36..131C. CiteSeerX 10.1.1.6.9064. doi:10.1016/s0031-3203(02)00027-4.
- ^ Sarawagi, Sunita; Cohen, William W. (2005). "Semi-Markov conditional random fields for information extraction". In Lawrence K. Saul; Yair Weiss; Léon Bottou (eds.). Advances in Neural Information Processing Systems 17. Cambridge, MA: MIT Press. pp. 1185–1192. Archived from the original (PDF) on 2019-11-30. Retrieved 2015-11-12.
- ^ a b c Xu Sun; Takuya Matsuzaki; Daisuke Okanohara; Jun'ichi Tsujii (2009). Latent Variable Perceptron Algorithm for Structured Classification. IJCAI. pp. 1236–1242. Archived from the original on 2018-12-06. Retrieved 2018-12-06.
- ^ a b Morency, L. P.; Quattoni, A.; Darrell, T. (2007). "Latent-Dynamic Discriminative Models for Continuous Gesture Recognition" (PDF). 2007 IEEE Conference on Computer Vision and Pattern Recognition. p. 1. CiteSeerX 10.1.1.420.6836. doi:10.1109/CVPR.2007.383299. ISBN 978-1-4244-1179-5. S2CID 7117722.
Further reading
- McCallum, A.: Efficiently inducing features of conditional random fields. In: Proc. 19th Conference on Uncertainty in Artificial Intelligence. (2003)
- Wallach, H.M.: Conditional random fields: An introduction. Technical report MS-CIS-04-21, University of Pennsylvania (2004)
- Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In "Introduction to Statistical Relational Learning". Edited by Lise Getoor and Ben Taskar. MIT Press. (2006) Online PDF
- Klinger, R., Tomanek, K.: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503. Online PDF