Concept drift: Difference between revisions
→Reviews: rm permadead link survey |
|||
(242 intermediate revisions by more than 100 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Change of statistical properties over time}} |
|||
In [[predictive analytics]] and [[machine learning]], the '''concept drift''' means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as the times passes. |
|||
In [[predictive analytics]], [[data science]], [[machine learning]] and related fields, '''concept drift''' or '''drift''' is an evolution of data that invalidates the [[data model]]. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. '''Drift detection''' and '''drift adaptation''' are of paramount importance in the fields that involve dynamically changing data and data models. |
|||
==Predictive model decay== |
|||
The term ''concept'' refers to the quantity you are looking to predict. More generally, it can also refer to other phenomena of interest besides the target concept, such as an input, but, in the context of concept drift, the term commonly refers to the target variable. |
|||
In machine learning and [[predictive analytics]] this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as [[probability distribution]] of the actual data. If they deviate from the statistical properties of the [[training data set]], then the learned predictions may become invalid, if the drift is not addressed.<ref>{{Cite book|doi = 10.1007/978-981-16-8531-6_4|chapter = A Drift Aware Hierarchical Test Based Approach for Combating Social Spammers in Online Social Networks|title = Data Mining|series = Communications in Computer and Information Science|year = 2021|last1 = Koggalahewa|first1 = Darshika|last2 = Xu|first2 = Yue|last3 = Foo|first3 = Ernest|volume = 1504|pages = 47–61|isbn = 978-981-16-8530-9|s2cid = 245009299}}</ref><ref>{{Cite journal|doi = 10.1007/BF00116900|title = Learning in the presence of concept drift and hidden contexts|year = 1996|last1 = Widmer|first1 = Gerhard|last2 = Kubat|first2 = Miroslav|journal = Machine Learning|volume = 23|pages = 69–101|s2cid = 206767784|doi-access = free}}</ref><ref>{{Cite book|doi = 10.1007/978-3-030-64243-3_9|chapter = A Drift Detection Method Based on Diversity Measure and McDiarmid's Inequality in Data Streams|title = Green, Pervasive, and Cloud Computing|series = Lecture Notes in Computer Science|year = 2020|last1 = Xia|first1 = Yuan|last2 = Zhao|first2 = Yunlong|volume = 12398|pages = 115–122|isbn = 978-3-030-64242-6|s2cid = 227275380}}</ref><ref>{{Cite journal|doi=10.1109/TKDE.2018.2876857|title=Learning under Concept Drift: A Review|year=2018|last1=Lu|first1=Jie|last2=Liu|first2=Anjin|last3=Dong|first3=Fan|last4=Gu|first4=Feng|last5=Gama|first5=Joao|last6=Zhang|first6=Guangquan|journal=IEEE Transactions on Knowledge and Data Engineering|page=1|arxiv=2004.05785|s2cid=69449458}}</ref> |
|||
==Data configuration decay== |
|||
==Examples== |
|||
Another important area is [[software engineering]], where three types of data drift affecting [[data fidelity]] may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data [[database schema|schema]] changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.<ref name=devto>[https://dev.to/stack-labs/driftctl-and-terraform-they-re-two-of-a-kind-22p1 "Driftctl and Terraform, they're two of a kind!"]</ref><ref name=gipa>Girish Pancha, [https://www.cmswire.com/big-data/big-datas-hidden-scourge-data-drift/ Big Data's Hidden Scourge: Data Drift], ''CMSWire'', April 8, 2016</ref> |
|||
In a [[fraud detection]] application the target concept may be a [[Binary numeral system|binary]] attribute FRAUDULENT with values "yes" or "no" that indicates whether a given transaction is fraudulent. Or, in a [[weather prediction]] application, there may be several target concepts such as TEMPERATURE, PRESSURE, and HUMIDITY. |
|||
For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.<ref name=gipa/> |
|||
The behavior of the customers in an [[online shop]] may change over time. Let's say you want to predict weekly merchandise sales, and you have developed a predictive model that works to your satisfaction. The model may use inputs such as the amount of money spent on [[advertising]], [[promotions]] you are running, and other metrics that may affect sales. What you are likely to experience is that the model will become less and less accurate over time - you will be a victim of concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. You will likely have higher sales in the winter holiday season than during the summer. |
|||
In the case of [[cloud computing]], infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.<ref name=devto/> |
|||
==Possible remedies== |
|||
To prevent [[deterioration]] in [[prediction]] accuracy over time the model has to be refreshed periodically. One approach is to retrain the model using only the most recently observed samples (Widmer and Kubat, 1996). Another approach is to add new inputs which may be better at explaining the causes of the concept drift. For our sales prediction application you may be able to reduce concept drift by adding information about the season to your model. By providing information about the time of the year you will likely reduce rate of deterioration of your model, but you likely will never be able to prevent concept drift altogether. This is because actual shopping behavior does not follow any static, [[finite model]]. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change. |
|||
There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream the data processing pipeline, but somewhere downstream there data fields are absent.<ref name=gipa/> |
|||
Concept drift cannot be avoided if you are looking to predict a complex phenomenon that is not governed by fixed [[Physical law|laws of nature]]. All processes that arise from human activity, such as [[socioeconomic]] processes, and [[biological processes]] are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing of your model is inescapable. |
|||
==Inconsistent data== |
|||
==Software== |
|||
"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.<ref>Matthew Magne, [https://www.informationweek.com/big-data/data-drift-happens-7-pesky-problems-with-people-data "Data Drift Happens: 7 Pesky Problems with People Data"], ''[[InformationWeek]]'', July 19, 2017</ref> |
|||
* [[RapidMiner]] ([http://rapid-i.com/ RapidMiner, formerly YALE (Yet Another Learning Environment)]): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin)) |
|||
* EDDM ([http://iaia.lcc.uma.es/Members/mbaena/papers/eddm/ EDDM (Early Drift Detection Method)]): free open-source implementation of drift detection methods in [[Weka (machine learning)]]. |
|||
"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run [[checksum]] regularly. However the remedy may be not so easy.<ref>Daniel Nichter, ''Efficient MySQL Performance'', 2021, {{ISBN|1098105060}}, [https://books.google.com/books?id=CzZTEAAAQBAJ&pg=PA299 p. 299]</ref> |
|||
==Benchmark datasets== |
|||
===Real=== |
|||
* '''Elec2''', electricity demand, 2 classes, 45312 instances. Reference: M.Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. [http://www.liaad.up.pt/~jgama/ales/ales_5.html/ Download] from J.Gama webpage. |
|||
* '''Text mining''', a collection of text mining datasets with concept drift, maintained by I.Katakis. [http://mlkd.csd.auth.gr/concept_drift.html Download] |
|||
== |
==Examples== |
||
The behavior of the customers in an [[online shop]] may change over time. For example, if weekly merchandise sales are to be predicted, and a [[predictive modelling|predictive model]] has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on [[advertising]], [[Promotion (marketing)|promotions]] being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some [[confounding]] variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift. |
|||
* '''STAGGER''', J.C.Schlimmer, R.H.Granger, Incremental Learning from Noisy Data, Mach. Learn., vol.1, no.3, 1986. |
|||
* '''SEA concepts''', N.W.Street, Y.Kim, A streaming ensemble algorithm (SEA) for large-scale classification, KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001. [http://www.liaad.up.pt/~jgama/ales/ales_5.html/ Download] from J.Gama webpage. |
|||
* '''A toolbox''', A.Narasimhamurthy, L.I.Kuncheva, A framework for generating data to simulate changing environments, Proc. IASTED, Artificial Intelligence and Applications, Innsbruck, Austria, 2007. [http://www.bangor.ac.uk/~mas00a/activities/epsrc2006/changing_environments.html Access]. |
|||
==Possible remedies== |
|||
===Data generation frameworks=== |
|||
* Lindstrom P, SJ Delany & B MacNamee (2008) Autopilot: Simulating Changing Concepts in Real Data In: Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science, D Bridge, K Brown, B O'Sullivan & H Sorensen (eds.) |
|||
p272-263 [http://www.comp.dit.ie/sjdelany/publications/aics08-pl.pdf PDF] |
|||
* Narasimhamurthy A., L.I. Kuncheva, A framework for generating data to simulate changing environments, Proc. IASTED, Artificial Intelligence and Applications, Innsbruck, Austria, 2007, 384-389 [http://www.bangor.ac.uk/~mas00a/papers/anlkAIA07.pdf PDF] |
|||
To prevent deterioration in [[prediction]] accuracy because of concept drift, ''reactive'' and ''tracking'' solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test,<ref>{{Cite book|first=Michele|last=Basseville|url=http://worldcat.org/oclc/876004326|title=Detection of abrupt changes: theory and application|date=1993|publisher=Prentice Hall|isbn=0-13-126780-9|oclc=876004326}}</ref><ref>{{cite book |last1=Alippi |first1=C. |last2=Roveri |first2=M. |chapter=Adaptive Classifiers in Stationary Conditions |chapter-url= |title=2007 International Joint Conference on Neural Networks |publisher=IEEE |date=2007 |isbn=978-1-4244-1380-5 |pages=1008–13 |doi=10.1109/ijcnn.2007.4371096|s2cid=16255206 }}</ref> to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy.<ref>{{cite book |last1=Gama |first1=J. |last2=Medas |first2=P. |last3=Castillo |first3=G. |last4=Rodrigues |first4=P. |chapter=Learning with Drift Detection |chapter-url= |title=Advances in Artificial Intelligence – SBIA 2004 |publisher=Springer |date=2004 |isbn=978-3-540-28645-5 |pages=286–295 |doi=10.1007/978-3-540-28645-5_29|s2cid=2606652 }}</ref><ref>{{cite journal |last1=Alippi |first1=C. |last2=Boracchi |first2=G. |last3=Roveri |first3=M. |title=A just-in-time adaptive classification system based on the intersection of confidence intervals rule |journal=Neural Networks |volume=24 |issue=8 |pages=791–800 |date=2011 |doi=10.1016/j.neunet.2011.05.012 |pmid=21723706 |url=}}</ref> A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include [[online machine learning]], frequent retraining on the most recently observed samples,<ref>{{cite journal |last1=Widmer |first1=G. |last2=Kubat |first2=M. |title=Learning in the presence of concept drift and hidden contexts |journal=Machine Learning |volume=23 |issue=1 |pages=69–101 |date=1996 |doi=10.1007/bf00116900 |s2cid=206767784 |url=|doi-access=free }}</ref> and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.<ref>{{cite journal |last1=Elwell |first1=R. |last2=Polikar |first2=R. |title=Incremental Learning of Concept Drift in Nonstationary Environments |journal=IEEE Transactions on Neural Networks |volume=22 |issue=10 |pages=1517–31 |date=2011 |doi=10.1109/tnn.2011.2160459 |pmid=21824845 |s2cid=9136731 |url=}}</ref> |
|||
==Researchers working on concept drift problems== |
|||
* [http://www.lsi.upc.edu/~abifet/ Albert Bifet], Universitat Politècnica de Catalunya |
|||
* [http://www.mat.ua.pt/gladys Gladys Castillo], University of Aveiro, Portugal |
|||
* [http://www.liaad.up.pt/~jgama/ Joao Gama], University of Porto, Portugal |
|||
* [http://iaia.lcc.uma.es/~rfm Raúl Fidalgo], University of Málaga, Spain |
|||
* [http://users.auth.gr/~katak/ Ioannis Katakis], Aristotle University of Thessaloniki, Greece |
|||
* [http://www-ai.cs.uni-dortmund.de/PERSONAL/klinkenberg.html Ralf Klinkenberg], University of Dortmund, Germany |
|||
* [http://www.math.bas.bg/~koychev/ Ivan Koychev], Institute of Mathematics and Informatics, Bulgarian Academy of Science |
|||
* [http://www6.miami.edu/UMH/CDA/UMH_Main/1,1770,44604-1;45098-3,00.html Miroslav Kubat], University of Miami, USA |
|||
* [http://www.bangor.ac.uk/~mas00a/ Ludmila Kuncheva], University of Wales, Bangor, UK |
|||
* [http://www.cs.georgetown.edu/~maloof/ Mark Maloof], Georgetown University, USA |
|||
* [http://louisville.edu/~o0nasr01/ Olfa Nasraoui], University of Louisville, USA |
|||
* [http://lis2.huie.hokudai.ac.jp/~knishida/index_en.html Kyosuke Nishida], Hokkaido University, Japan |
|||
* [http://www.win.tue.nl/~mpechen/ Mykola Pechenizkiy], Eindhoven University of Technology, the Netherlands |
|||
* [http://users.rowan.edu/~polikar/RESEARCH Robi Polikar], Rowan University, Glassboro, NJ, USA |
|||
* [http://www.cp.jku.at/people/widmer/ Gerhard Widmer], Johannes Kepler University (JKU) Linz, Austria |
|||
* [http://zliobaite.googlepages.com/ Indre Zliobaite], Vilnius University, Lithuania |
|||
Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, [[finite model]]. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change. |
|||
== Bibliographic references == |
|||
===Reviews === |
|||
Concept drift cannot be avoided for complex phenomena that are not governed by fixed [[Physical law|laws of nature]]. All processes that arise from human activity, such as [[socioeconomic]] processes, and [[biological processes]] are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary. |
|||
* Kuncheva L.I. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives, Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), Patras, Greece, 2008, 5-10, [http://www.bangor.ac.uk/~mas00a/papers/lkSUEMA2008.pdf PDF] |
|||
* Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., Mining Data Streams: A Review, in ACM SIGMOD Record, Vol. 34, No. 1, June 2005, ISSN: 0163-5808 |
|||
* Tsymbal, A., The problem of concept drift: Definitions and related work. Technical Report. 2004, Department of Computer Science, Trinity College: Dublin, Ireland. [https://www.cs.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf PDF] |
|||
* Kuncheva L.I., Classifier ensembles for changing environments, Proceedings 5th International Workshop on Multiple Classifier Systems, MCS2004, Cagliari, Italy, in F. Roli, J. Kittler and T. Windeatt (Eds.), Lecture Notes in Computer Science, Vol 3077, 2004, 1-15, [http://www.bangor.ac.uk/~mas00a/papers/lkMCS04.pdf PDF]. |
|||
===Other=== |
|||
* Carroll J. and Rosson M. B. The paradox of the active user. In J.M. Carroll (Ed.), Interfacing Thought: Cognitive Aspects of Human-Computer Interaction. Cambridge, MA, MIT Press, 1987. |
|||
* Grabtree I. Soltysiak S. Identifying and Tracking Changing Interests. International Journal of Digital Libraries, Springer Verlag, vol. 2, 38-53. |
|||
* Harries M. B., Sammut C., Horn K. Extracting Hidden Context, Machine Learning 32, 1998, pp. 101-126. |
|||
* Klinkenberg, Ralf: ''Learning Drifting Concepts: Example Selection vs. Example Weighting''. In Intelligent Data Analysis (IDA), Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, Vol. 8, No. 3, pages 281--300, 2004. |
|||
* Klinkenberg, Ralf. Predicting Phases in Business Cycles Under Concept Drift. In Hotho, Andreas and Stumme, Gerd (editors), Proceedings of LLWA-2003 / FGML-2003, pages 3--10, Karlsruhe, Germany, 2003. |
|||
* Klinkenberg, Ralf and Rüping, Stefan: ''Concept Drift and the Importance of Examples''. In Franke, Jürgen and Nakhaeizadeh, Gholamreza and Renz, Ingrid (editors), Text Mining -- Theoretical Aspects and Applications, pages 55--77, Berlin, Germany, Physica-Verlag, 2003. |
|||
* Klinkenberg, Ralf: ''Using Labeled and Unlabeled Data to Learn Drifting Concepts''. In Kubat, Miroslav and Morik, Katharina (editors), Workshop notes of the IJCAI-01 Workshop on \em Learning from Temporal and Spatial Data, pages 16--24, IJCAI, Menlo Park, CA, USA, AAAI Press, 2001. |
|||
* Klinkenberg, Ralf and Joachims, Thorsten: ''Detecting Concept Drift with Support Vector Machines''. In Langley, Pat (editor), Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 487--494, San Francisco, CA, USA, Morgan Kaufmann, 2000. |
|||
* Klinkenberg, Ralf and Renz, Ingrid: ''Adaptive Information Filtering: Learning in the Presence of Concept Drifts''. In Sahami, Mehran and Craven, Mark and Joachims, Thorsten and McCallum, Andrew (editors), Workshop Notes of the ICML/AAAI-98 Workshop \em Learning for Text Categorization, pages 33--40, Menlo Park, CA, USA, AAAI Press, 1998. |
|||
* Kolter, J.Z. and Maloof, M.A. Dynamic Weighted Majority: A new ensemble method for tracking concept drift. Proceedings of the Third International IEEE Conference on Data Mining, pages 123-130, Los Alamitos, CA: IEEE Press, 2003. |
|||
* Kolter J.Z. and Maloof, M.A. Using additive expert ensembles to cope with concept drift. In Proceedings of the Twenty-second International Conference on Machine Learning, pages 449-456. New York, NY: ACM Press, 2005. |
|||
* Kolter, J.Z. and Maloof, M.A. [http://jmlr.csail.mit.edu/papers/volume8/kolter07a/kolter07a.pdf Dynamic Weighted Majority: An ensemble method for drifting concepts.] Journal of Machine Learning Research 8:2755--2790, 2007. |
|||
* Koychev I. Gradual Forgetting for Adaptation to Concept Drift. In Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning. Berlin, Germany, 2000, pp. 101-106. |
|||
* Koychev I. and Schwab I., Adaptation to Drifting User’s Interests, Proc. of ECML2000 Workshop: Machine Learning in New Information Age, Barcelona, Spain, 2000, pp. 39-45. |
|||
* Maloof M.A. and Michalski R.S. Selecting examples for partial memory learning. Machine Learning, 41(11), 2000, pp. 27-52. |
|||
* Maloof M.A. and Michalski R.S. Incremental learning with partial instance memory. Artificial Intelligence 154, 2004, pp. 95-126. |
|||
* Mitchell T., Caruana R., Freitag D., McDermott, J. and Zabowski D. Experience with a Learning Personal Assistant. Communications of the ACM 37(7), 1994, pp. 81-91. |
|||
* [http://webmining.spd.louisville.edu/Websites/PAPERS/journal/Computer-Networks-Jnl-Spec-Issue-Web-Dynamics-2006-Mining-Evolving-Streams-Retrospective-Validation.pdf Nasraoui O. , Rojas C., and Cardona C., “ A Framework for Mining Evolving Trends in Web Data Streams using Dynamic Learning and Retrospective Validation ”, Journal of Computer Networks- Special Issue on Web Dynamics, 50(10), 1425-1652, July 2006] |
|||
* [http://webmining.spd.louisville.edu/Websites/PAPERS/conference/CIKM-2006-Collaborative-Filtering-Recommender-Sys-in-Evolving-Web-Clickstreams.pdf Nasraoui O. , Cerwinske J., Rojas C., and Gonzalez F., "Collaborative Filtering in Dynamic Usage Environments", in Proc. of CIKM 2006 – Conference on Information and Knowledge Management, Arlington VA , Nov. 2006] |
|||
* [http://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/icmlc07.pdf Mulhbaier D., and Polikar, R. "Multiple Classifiers Based Incremental Learning Algorithm for Learning in Nonstationary Environments," IEEE International Conference on Machine Learning and Cybernetics, Volume 6, Page(s):3618 - 3623, 19-22 August 2007.] |
|||
* [http://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/ijcnn08.pdf Karnick M., Ahiskali M., Muhlbaier, M.D., and Polikar R., "Learning Concept Drift in Nonstationary Environments Using an Ensemble of Classifiers Based Approach," World Congress on Computational Intelligence / IEEE International Joint Conference on Neural Networks, Hong Kong, 1-6 June 2008.] |
|||
* Núñez M., Fidalgo R., and Morales R., [http://www.jmlr.org/papers/volume8/nunez07a/nunez07a.pdf Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners], Journal of Machine Learning Research, 8, (2007) 2595-2628 |
|||
* Schlimmer J., and Granger R. Incremental Learning from Noisy Data, Machine Learning, 1(3), 1986, 317-357. |
|||
* Scholz, Martin and Klinkenberg, Ralf: ''Boosting Classifiers for Drifting Concepts''. In Intelligent Data Analysis (IDA), Special Issue on Knowledge Discovery from Data Streams, Vol. 11, No. 1, pages 3-28, March 2007. |
|||
* Scholz, Martin and Klinkenberg, Ralf: ''An Ensemble Classifier for Drifting Concepts''. In Gama, J. and Aguilar-Ruiz, J. S. (editors), Proceedings of the Second International Workshop on Knowledge Discovery in Data Streams, pages 53--64, Porto, Portugal, 2005. |
|||
* Schwab I., Pohl W. and Koychev I. Learning to Recommend from Positive Evidence, Proceedings of Intelligent User Interfaces 2000, ACM Press, 241 - 247. |
|||
* Widmer G. Tracking Changes through Meta-Learning, Machine Learning 27, 1997, pp. 256-286. |
|||
* Widmer G. and Kubat M. Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 1996, pp. 69-101. |
|||
==See also== |
==See also== |
||
* [[Data stream mining]] |
* [[Data stream mining]] |
||
* [[Data mining]] |
* [[Data mining]] |
||
* [[Snyk]], a company whose portfolio includes drift detection in software applications |
|||
* [[Machine learning]] |
|||
== Further reading == |
|||
Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here: |
|||
===Reviews=== |
|||
{{refbegin}} |
|||
*{{cite journal |last1=Souza |first1=V.M.A. |last2=Reis |first2=D.M. |last3=Maletzke |first3=A.G. |last4=Batista |first4=G.E.A.P.A. |title=Challenges in Benchmarking Stream Learning Algorithms with Real-world Data |journal=Data Mining and Knowledge Discovery |volume=34 |pages=1805–58 |date=2020 |issue=6 |doi=10.1007/s10618-020-00698-5 |arxiv=2005.00113 |s2cid=218470010 |url=https://link.springer.com/article/10.1007/s10618-020-00698-5}} |
|||
*{{cite journal |last1=Krawczyk |first1=B. |last2=Minku |first2=L.L. |last3=Gama |first3=J. |last4=Stefanowski |first4=J. |last5=Wozniak |first5=M. |title=Ensemble Learning for Data Stream Analysis: a survey |journal=Information Fusion |volume=37 |pages=132–156 |date=2017 |doi=10.1016/j.inffus.2017.02.004 |s2cid=1372281 |url=https://scholarscompass.vcu.edu/cmsc_pubs/39|hdl=2381/39321 |hdl-access=free }} |
|||
*{{cite book |last1=Dal Pozzolo |first1=A. |last2=Boracchi |first2=G. |last3=Caelen |first3=O. |last4=Alippi |first4=C. |last5=Bontempi |first5=G. |chapter=Credit card fraud detection and concept-drift adaptation with delayed supervised information |chapter-url=http://www.ulb.ac.be/di/map/adalpozz/pdf/IJCNN2015_final.pdf |title=2015 International Joint Conference on Neural Networks (IJCNN) |publisher=IEEE |date=2015 |pages=1–8 |doi=10.1109/IJCNN.2015.7280527|isbn=978-1-4799-1960-4 |s2cid=3947699 }} |
|||
*{{cite book |first=C. |last=Alippi |chapter=Learning in Nonstationary and Evolving Environments |chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-05278-6_9 |title=Intelligence for Embedded Systems |publisher=Springer |date=2014 |isbn=978-3-319-05278-6 |pages=211–247 |doi=10.1007/978-3-319-05278-6_9 |url=}} |
|||
*{{Cite Q|Q58204632 |author1=Gama, J. |author2=Žliobaitė, I. |author3=Bifet, A. |author4=Pechenizkiy, M. |author5=Bouchachia, A. | mode=cs2}} |
|||
*{{cite journal |first1=C. |last1=Alippi |first2=R. |last2=Polikar |title=Guest Editorial Learning in Nonstationary and Evolving Environments |journal=IEEE Transactions on Neural Networks and Learning Systems |volume=25 |issue=1 |pages= 9–11|date=January 2014 |doi=10.1109/TNNLS.2013.2283547 |pmid=24806640 |s2cid=16547472 |url=https://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6684327&punumber=5962385}} |
|||
*{{cite journal |last1=Dal Pozzolo |first1=A. |last2=Caelen |first2=O. |last3=Le Borgne |first3=Y.A. |last4=Waterschoot |first4=S. |last5=Bontempi |first5=G. |title=Learned lessons in credit card fraud detection from a practitioner perspective |journal=Expert Systems with Applications |volume=41 |issue=10 |pages=4915–28 |date=2014 |doi=10.1016/j.eswa.2014.02.026 |s2cid=12656644 |url=http://www.ulb.ac.be/di/map/adalpozz/pdf/FraudDetectionPaper_8.pdf}} |
|||
*{{cite web |first=J. |last=Jiang |title=A Literature Survey on Domain Adaptation of Statistical Classifiers |date=2008 |publisher=School of Computing and Information Systems, Singapore Management University |url=http://www.mysmu.edu/faculty/jingjiang/papers/da_survey.pdf}} |
|||
*{{cite book |author-link=Ludmila Kuncheva |last=Kuncheva |first=L.I. |chapter=Classifier ensembles for detecting concept change in streaming data: Overview and perspectives |chapter-url=https://lucykuncheva.co.uk/papers/lkSUEMA2008.pdf |title=Proceedings of the 2nd Workshop SUEMA 2008 (ECAI 2008) |publisher= |date=2008 |isbn= |pages= |url=}} |
|||
*{{cite journal |last1=Gaber |first1=M.M. |last2=Zaslavsky |first2=A. |last3=Krishnaswamy |first3=S. |title=Mining Data Streams: A Review |journal=ACM SIGMOD Record |volume=34 |issue=2 |pages=18–26 |date=June 2005 |doi=10.1145/1083784.1083789 |s2cid=705946 |url=http://www09.sigmod.org/sigmod/record/issues/0506/p18-survey-gaber.pdf}} |
|||
*{{cite book |author-link=Ludmila Kuncheva |last=Kuncheva |first=L.I. |chapter=Classifier ensembles for changing environments |chapter-url=https://lucykuncheva.co.uk/papers/kuncheva_slides_mcs04.PDF |title=Multiple Classifier Systems. MCS 2004 |publisher=Springer |series=Lecture Notes in Computer Science |volume=3077 |date=2004 |isbn=978-3-540-25966-4 |pages= 1–15|doi=10.1007/978-3-540-25966-4_1}} |
|||
*{{cite tech report |first=A. |last=Tsymbal |title=The problem of concept drift: Definitions and related work |date=2004 |id=TCD-CS-2004-15 |publisher=Department of Computer Science, Trinity College |location=Dublin, Ireland |url=https://www.cs.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf}} |
|||
{{refend}} |
|||
== External links == |
|||
{{External links|date=August 2023}} |
|||
=== Software === |
|||
* [https://github.com/IFCA-Advanced-Computing/frouros Frouros]: An open-source [[Python (programming language)|Python]] library for drift detection in [[machine learning]] systems.<ref>{{Cite journal |last=Céspedes Sisniega |first=Jaime |last2=López García |first2=Álvaro |date=2024 |title=Frouros: An open-source Python library for drift detection in machine learning systems |url=https://www.softxjournal.com/action/showPdf?pii=S2352-7110%2824%2900104-3 |format=PDF |journal=SoftwareX |publisher=Elsevier |volume=26 |page=101733 |doi=10.1016/j.softx.2024.101733|doi-access=free |hdl=10261/358367 |hdl-access=free }}</ref> |
|||
* [https://www.nannyml.com/ NannyML]: An open-source [[Python (programming language)|Python]] library for detecting [[Univariate (statistics)|univariate]] and [[multivariate distribution]] drift and estimating [[machine learning]] model performance without ground truth labels. |
|||
* [[RapidMiner]]: Formerly ''Yet Another Learning Environment'' (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin). |
|||
* EDDM ([https://web.archive.org/web/20070322063617/http://iaia.lcc.uma.es/Members/mbaena/papers/eddm/ Early Drift Detection Method]): free open-source implementation of drift detection methods in [[Weka (machine learning)|Weka]]. |
|||
* [[MOA (Massive Online Analysis)]]: free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with [[Weka (machine learning)|Weka]]. |
|||
=== Datasets === |
|||
==== Real ==== |
|||
* '''USP Data Stream Repository''', 27 real-world stream datasets with concept drift compiled by Souza et al. (2020). [https://sites.google.com/view/uspdsrepository Access] |
|||
* '''Airline''', approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competition [http://stat-computing.org/dataexpo/2009/]. [http://kt.ijs.si/elena_ikonomovska/data.html Access] |
|||
* '''Chess.com''' (online games) and '''Luxembourg''' (social survey) datasets compiled by I. Zliobaite. [https://sites.google.com/site/zliobaite/resources-1 Access] |
|||
* '''ECUE spam''' 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. [https://web.archive.org/web/20110513025937/http://www.comp.dit.ie/sjdelany/dataset.htm Access] from S.J.Delany webpage |
|||
* '''Elec2''', electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. [http://www.inescporto.pt/~jgama/ales/ales_5.html Access] from J.Gama webpage. [[arxiv:1301.3524|Comment on applicability]]. |
|||
* '''PAKDD'09 competition''' data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. [https://web.archive.org/web/20150315224049/http://sede.neurotech.com.br/PAKDD2009/ Access] |
|||
* '''Sensor stream''' and '''Power supply stream''' datasets are available from X. Zhu's Stream Data Mining Repository. [http://www.cse.fau.edu/~xqzhu/stream.html Access] |
|||
* '''SMEAR''' is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. [https://github.com/zliobaite/paper-missing-values Access] |
|||
* '''Text mining''', a collection of [[text mining]] datasets with concept drift, maintained by I. Katakis. [https://web.archive.org/web/20100704072013/http://mlkd.csd.auth.gr/concept_drift.html Access] |
|||
* '''Gas Sensor Array Drift Dataset''', a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations. [https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset Access] |
|||
==== Other ==== |
|||
* '''KDD'99 competition''' data contains ''simulated'' intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. [http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Access] |
|||
==== Synthetic ==== |
|||
* '''Extreme verification latency benchmark''' {{cite book |last1=Souza |first1=V.M.A. |last2=Silva |first2=D.F. |last3=Gama |first3=J. |last4=Batista |first4=G.E.A.P.A. |chapter=Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency |chapter-url=https://epubs.siam.org/doi/abs/10.1137/1.9781611974010.98 |doi=10.1137/1.9781611974010.98 |title=Proceedings of the 2015 SIAM International Conference on Data Mining (SDM) |publisher=SIAM |date=2015 |isbn=9781611974010 |pages=873–881 |s2cid=19198944 |url=http://repositorio.inesctec.pt/handle/123456789/5325}} [https://sites.google.com/site/nonstationaryarchive/ Access] from Nonstationary Environments – Archive. |
|||
* '''Sine, Line, Plane, Circle and Boolean Data Sets''' {{cite journal |first1=L.L. |last1=Minku |first2=A.P. |last2=White |first3=X. |last3=Yao |title=The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift |journal=IEEE Transactions on Knowledge and Data Engineering |volume=22 |issue=5 |pages=730–742 |date=2010 |doi=10.1109/TKDE.2009.156 |s2cid=16592739 |url=http://cs.bham.ac.uk/~xin/papers/MinkuWhiteYaoTKDE09.pdf}} [https://www.cs.bham.ac.uk/~minkull/opensource/ArtificialConceptDriftDataSets.zip Access] from L.Minku webpage. |
|||
* '''SEA concepts''' {{cite book |first1=N.W. |last1=Street |first2=Y. |last2=Kim |chapter=A streaming ensemble algorithm (SEA) for large-scale classification |chapter-url=https://dollar.biz.uiowa.edu/~street/research/kdd01.pdf |doi=10.1145/502512.502568 |title=KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining |publisher= |date=2001 |isbn=978-1-58113-391-2 |pages=377–382 |s2cid=11868540 |url=}} [https://web.archive.org/web/20080315131143/http://www.liaad.up.pt/~jgama/ales/ales_5.html Access] from J.Gama webpage. |
|||
* '''STAGGER''' {{cite journal |first1=J.C. |last1=Schlimmer |first2=R.H. |last2=Granger |title=Incremental Learning from Noisy Data |journal=Mach. Learn. |volume=1 |pages=317–354 |date=1986 |issue=3 |doi=10.1007/BF00116895 |s2cid=33776987 |doi-access=free }} |
|||
* '''Mixed''' {{cite book |first1=J. |last1=Gama |first2=P. |last2=Medas |first3=G. |last3=Castillo |first4=P. |last4=Rodrigues |chapter=Learning with drift detection |chapter-url=https://link.springer.com/chapter/10.1007/978-3-540-28645-5_29 |doi=10.1007/978-3-540-28645-5_29 |title=Brazilian symposium on artificial intelligence |publisher=Springer |date=2004 |isbn=978-3-540-28645-5 |pages=286–295 |s2cid=2606652 |url=}} |
|||
==== Data generation frameworks ==== |
|||
* {{harvnb|Minku|White|Yao|2010}} [https://www.cs.bham.ac.uk/~minkull/opensource/DriftsGenerator.zip Download] from L.Minku webpage. |
|||
*{{cite book |last1=Lindstrom |first1=P. |first2=S.J. |last2=Delany |first3=B. |last3=MacNamee |chapter=Autopilot: Simulating Changing Concepts in Real Data |chapter-url=https://core.ac.uk/download/pdf/301312613.pdf |editor= |title=Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science |publisher= |location= |date=2008 |isbn= |pages=272–263 |url=}} |
|||
*{{cite book |last1=Narasimhamurthy |first1=A. |author2-link=Ludmila Kuncheva |first2=L.I. |last2=Kuncheva |chapter=A framework for generating data to simulate changing environments |chapter-url=https://dl.acm.org/doi/abs/10.5555/1295303.1295369 |title=AIAP'07: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications |publisher= |location= |date=2007 |isbn= |pages=384–389 |url=}} [http://pages.bangor.ac.uk/~mas00a/EPSRC_simulation_framework/changing_environments_stage1a.htm Code] |
|||
=== Projects === |
|||
* [http://www.infer.eu/ INFER]: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland) |
|||
* [http://www.win.tue.nl/~mpechen/projects/hacdais/ HaCDAIS]: Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands) |
|||
* [http://www.liaad.up.pt/~kdus/ KDUS]: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal) |
|||
* [http://www.cs.man.ac.uk/~gbrown/adept/ ADEPT]: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK) |
|||
* [https://web.archive.org/web/20090309132402/http://www.aladdinproject.org/ ALADDIN]: autonomous learning agents for decentralised data and information networks (2005–2010) |
|||
* [https://github.com/greenfish77/gaenari GAENARI]: C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022) |
|||
=== Benchmarks === |
|||
* [https://github.com/numenta/NAB NAB]: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018) |
|||
=== Meetings === |
|||
*2014 |
|||
** [] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014 |
|||
*2013 |
|||
** [https://sites.google.com/site/realstream2013/ RealStream] Real-World Challenges for Data Stream Mining Workshop-Discussion at the [[ECML PKDD]] 2013, Prague, Czech Republic. |
|||
** [https://web.archive.org/web/20150908134145/http://aiai2013.cut.ac.cy/leaps-2013/ LEAPS 2013] The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments |
|||
*2011 |
|||
** [http://www.icmla-conference.org/icmla11/LEE.htm LEE 2011] Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11 |
|||
** [http://wwwis.win.tue.nl/hacdais2011/ HaCDAIS 2011] The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems |
|||
** [https://web.archive.org/web/20101031152019/http://icais.uni-klu.ac.at/cfp.php ICAIS 2011] Track on Incremental Learning |
|||
** [https://web.archive.org/web/20110128002602/http://www.ijcnn2011.org/special_section.php IJCNN 2011] Special Session on Concept Drift and Learning Dynamic Environments |
|||
** [http://www.soft-computing.de/CIDUE2011.html CIDUE 2011] Symposium on Computational Intelligence in Dynamic and Uncertain Environments |
|||
*2010 |
|||
** [http://wwwis.win.tue.nl/hacdais2010/ HaCDAIS 2010] International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions |
|||
** [http://www.icmla-conference.org/icmla10/CFP_SpecialSession9.html ICMLA10] Special Session on Dynamic learning in non-stationary environments |
|||
** [https://web.archive.org/web/20100425011804/http://www.liaad.up.pt/~jgama/SAC10/ SAC 2010] Data Streams Track at ACM Symposium on Applied Computing |
|||
** [https://web.archive.org/web/20100418214526/http://www.ornl.gov/sci/knowledgediscovery/SensorKDD-2010/ SensorKDD 2010] International Workshop on Knowledge Discovery from Sensor Data |
|||
** [https://web.archive.org/web/20100419123949/http://lyle.smu.edu/cse/dbgroup/IDA/StreamKDD2010/ StreamKDD 2010] Novel Data Stream Pattern Mining Techniques |
|||
** Concept Drift and Learning in Nonstationary Environments at [http://www.wcci2010.org/ IEEE World Congress on Computational Intelligence] |
|||
** [http://cig.iet.unipi.it/isda2010/files/MLMD.pdf MLMDS’2010] Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10 |
|||
== References == |
|||
{{reflist}} |
|||
[[Category:Data mining]] |
[[Category:Data mining]] |
||
[[Category:Machine learning]] |
[[Category:Machine learning]] |
||
[[Category:Data analysis]] |
Latest revision as of 07:01, 15 September 2024
In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.
Predictive model decay
[edit]In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.[1][2][3][4]
Data configuration decay
[edit]Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.[5][6]
For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.[6]
In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.[5]
There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream the data processing pipeline, but somewhere downstream there data fields are absent.[6]
Inconsistent data
[edit]"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.[7]
"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.[8]
Examples
[edit]The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.
Possible remedies
[edit]To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test,[9][10] to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy.[11][12] A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples,[13] and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.[14]
Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.
Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.
See also
[edit]- Data stream mining
- Data mining
- Snyk, a company whose portfolio includes drift detection in software applications
Further reading
[edit]Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:
Reviews
[edit]- Souza, V.M.A.; Reis, D.M.; Maletzke, A.G.; Batista, G.E.A.P.A. (2020). "Challenges in Benchmarking Stream Learning Algorithms with Real-world Data". Data Mining and Knowledge Discovery. 34 (6): 1805–58. arXiv:2005.00113. doi:10.1007/s10618-020-00698-5. S2CID 218470010.
- Krawczyk, B.; Minku, L.L.; Gama, J.; Stefanowski, J.; Wozniak, M. (2017). "Ensemble Learning for Data Stream Analysis: a survey". Information Fusion. 37: 132–156. doi:10.1016/j.inffus.2017.02.004. hdl:2381/39321. S2CID 1372281.
- Dal Pozzolo, A.; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G. (2015). "Credit card fraud detection and concept-drift adaptation with delayed supervised information" (PDF). 2015 International Joint Conference on Neural Networks (IJCNN). IEEE. pp. 1–8. doi:10.1109/IJCNN.2015.7280527. ISBN 978-1-4799-1960-4. S2CID 3947699.
- Alippi, C. (2014). "Learning in Nonstationary and Evolving Environments". Intelligence for Embedded Systems. Springer. pp. 211–247. doi:10.1007/978-3-319-05278-6_9. ISBN 978-3-319-05278-6.
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. (1 March 2014), "A survey on concept drift adaptation" (PDF), ACM Computing Surveys, 46 (4): 1–37, doi:10.1145/2523813, ISSN 0360-0300, Zbl 1305.68141, Wikidata Q58204632
- Alippi, C.; Polikar, R. (January 2014). "Guest Editorial Learning in Nonstationary and Evolving Environments". IEEE Transactions on Neural Networks and Learning Systems. 25 (1): 9–11. doi:10.1109/TNNLS.2013.2283547. PMID 24806640. S2CID 16547472.
- Dal Pozzolo, A.; Caelen, O.; Le Borgne, Y.A.; Waterschoot, S.; Bontempi, G. (2014). "Learned lessons in credit card fraud detection from a practitioner perspective" (PDF). Expert Systems with Applications. 41 (10): 4915–28. doi:10.1016/j.eswa.2014.02.026. S2CID 12656644.
- Jiang, J. (2008). "A Literature Survey on Domain Adaptation of Statistical Classifiers" (PDF). School of Computing and Information Systems, Singapore Management University.
- Kuncheva, L.I. (2008). "Classifier ensembles for detecting concept change in streaming data: Overview and perspectives" (PDF). Proceedings of the 2nd Workshop SUEMA 2008 (ECAI 2008).
- Gaber, M.M.; Zaslavsky, A.; Krishnaswamy, S. (June 2005). "Mining Data Streams: A Review" (PDF). ACM SIGMOD Record. 34 (2): 18–26. doi:10.1145/1083784.1083789. S2CID 705946.
- Kuncheva, L.I. (2004). "Classifier ensembles for changing environments" (PDF). Multiple Classifier Systems. MCS 2004. Lecture Notes in Computer Science. Vol. 3077. Springer. pp. 1–15. doi:10.1007/978-3-540-25966-4_1. ISBN 978-3-540-25966-4.
- Tsymbal, A. (2004). The problem of concept drift: Definitions and related work (PDF) (Technical report). Dublin, Ireland: Department of Computer Science, Trinity College. TCD-CS-2004-15.
External links
[edit]This article's use of external links may not follow Wikipedia's policies or guidelines. (August 2023) |
Software
[edit]- Frouros: An open-source Python library for drift detection in machine learning systems.[15]
- NannyML: An open-source Python library for detecting univariate and multivariate distribution drift and estimating machine learning model performance without ground truth labels.
- RapidMiner: Formerly Yet Another Learning Environment (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin).
- EDDM (Early Drift Detection Method): free open-source implementation of drift detection methods in Weka.
- MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka.
Datasets
[edit]Real
[edit]- USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al. (2020). Access
- Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competition [1]. Access
- Chess.com (online games) and Luxembourg (social survey) datasets compiled by I. Zliobaite. Access
- ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. Access from S.J.Delany webpage
- Elec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. Access from J.Gama webpage. Comment on applicability.
- PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. Access
- Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository. Access
- SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. Access
- Text mining, a collection of text mining datasets with concept drift, maintained by I. Katakis. Access
- Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations. Access
Other
[edit]- KDD'99 competition data contains simulated intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. Access
Synthetic
[edit]- Extreme verification latency benchmark Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A. (2015). "Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency". Proceedings of the 2015 SIAM International Conference on Data Mining (SDM). SIAM. pp. 873–881. doi:10.1137/1.9781611974010.98. ISBN 9781611974010. S2CID 19198944. Access from Nonstationary Environments – Archive.
- Sine, Line, Plane, Circle and Boolean Data Sets Minku, L.L.; White, A.P.; Yao, X. (2010). "The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift" (PDF). IEEE Transactions on Knowledge and Data Engineering. 22 (5): 730–742. doi:10.1109/TKDE.2009.156. S2CID 16592739. Access from L.Minku webpage.
- SEA concepts Street, N.W.; Kim, Y. (2001). "A streaming ensemble algorithm (SEA) for large-scale classification" (PDF). KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 377–382. doi:10.1145/502512.502568. ISBN 978-1-58113-391-2. S2CID 11868540. Access from J.Gama webpage.
- STAGGER Schlimmer, J.C.; Granger, R.H. (1986). "Incremental Learning from Noisy Data". Mach. Learn. 1 (3): 317–354. doi:10.1007/BF00116895. S2CID 33776987.
- Mixed Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with drift detection". Brazilian symposium on artificial intelligence. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652.
Data generation frameworks
[edit]- Minku, White & Yao 2010 Download from L.Minku webpage.
- Lindstrom, P.; Delany, S.J.; MacNamee, B. (2008). "Autopilot: Simulating Changing Concepts in Real Data" (PDF). Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science. pp. 272–263.
- Narasimhamurthy, A.; Kuncheva, L.I. (2007). "A framework for generating data to simulate changing environments". AIAP'07: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications. pp. 384–389. Code
Projects
[edit]- INFER: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
- HaCDAIS: Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands)
- KDUS: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
- ADEPT: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
- ALADDIN: autonomous learning agents for decentralised data and information networks (2005–2010)
- GAENARI: C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022)
Benchmarks
[edit]- NAB: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018)
Meetings
[edit]- 2014
- [] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014
- 2013
- RealStream Real-World Challenges for Data Stream Mining Workshop-Discussion at the ECML PKDD 2013, Prague, Czech Republic.
- LEAPS 2013 The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments
- 2011
- LEE 2011 Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
- HaCDAIS 2011 The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
- ICAIS 2011 Track on Incremental Learning
- IJCNN 2011 Special Session on Concept Drift and Learning Dynamic Environments
- CIDUE 2011 Symposium on Computational Intelligence in Dynamic and Uncertain Environments
- 2010
- HaCDAIS 2010 International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
- ICMLA10 Special Session on Dynamic learning in non-stationary environments
- SAC 2010 Data Streams Track at ACM Symposium on Applied Computing
- SensorKDD 2010 International Workshop on Knowledge Discovery from Sensor Data
- StreamKDD 2010 Novel Data Stream Pattern Mining Techniques
- Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence
- MLMDS’2010 Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10
References
[edit]- ^ Koggalahewa, Darshika; Xu, Yue; Foo, Ernest (2021). "A Drift Aware Hierarchical Test Based Approach for Combating Social Spammers in Online Social Networks". Data Mining. Communications in Computer and Information Science. Vol. 1504. pp. 47–61. doi:10.1007/978-981-16-8531-6_4. ISBN 978-981-16-8530-9. S2CID 245009299.
- ^ Widmer, Gerhard; Kubat, Miroslav (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning. 23: 69–101. doi:10.1007/BF00116900. S2CID 206767784.
- ^ Xia, Yuan; Zhao, Yunlong (2020). "A Drift Detection Method Based on Diversity Measure and McDiarmid's Inequality in Data Streams". Green, Pervasive, and Cloud Computing. Lecture Notes in Computer Science. Vol. 12398. pp. 115–122. doi:10.1007/978-3-030-64243-3_9. ISBN 978-3-030-64242-6. S2CID 227275380.
- ^ Lu, Jie; Liu, Anjin; Dong, Fan; Gu, Feng; Gama, Joao; Zhang, Guangquan (2018). "Learning under Concept Drift: A Review". IEEE Transactions on Knowledge and Data Engineering: 1. arXiv:2004.05785. doi:10.1109/TKDE.2018.2876857. S2CID 69449458.
- ^ a b "Driftctl and Terraform, they're two of a kind!"
- ^ a b c Girish Pancha, Big Data's Hidden Scourge: Data Drift, CMSWire, April 8, 2016
- ^ Matthew Magne, "Data Drift Happens: 7 Pesky Problems with People Data", InformationWeek, July 19, 2017
- ^ Daniel Nichter, Efficient MySQL Performance, 2021, ISBN 1098105060, p. 299
- ^ Basseville, Michele (1993). Detection of abrupt changes: theory and application. Prentice Hall. ISBN 0-13-126780-9. OCLC 876004326.
- ^ Alippi, C.; Roveri, M. (2007). "Adaptive Classifiers in Stationary Conditions". 2007 International Joint Conference on Neural Networks. IEEE. pp. 1008–13. doi:10.1109/ijcnn.2007.4371096. ISBN 978-1-4244-1380-5. S2CID 16255206.
- ^ Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. (2004). "Learning with Drift Detection". Advances in Artificial Intelligence – SBIA 2004. Springer. pp. 286–295. doi:10.1007/978-3-540-28645-5_29. ISBN 978-3-540-28645-5. S2CID 2606652.
- ^ Alippi, C.; Boracchi, G.; Roveri, M. (2011). "A just-in-time adaptive classification system based on the intersection of confidence intervals rule". Neural Networks. 24 (8): 791–800. doi:10.1016/j.neunet.2011.05.012. PMID 21723706.
- ^ Widmer, G.; Kubat, M. (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning. 23 (1): 69–101. doi:10.1007/bf00116900. S2CID 206767784.
- ^ Elwell, R.; Polikar, R. (2011). "Incremental Learning of Concept Drift in Nonstationary Environments". IEEE Transactions on Neural Networks. 22 (10): 1517–31. doi:10.1109/tnn.2011.2160459. PMID 21824845. S2CID 9136731.
- ^ Céspedes Sisniega, Jaime; López García, Álvaro (2024). "Frouros: An open-source Python library for drift detection in machine learning systems" (PDF). SoftwareX. 26. Elsevier: 101733. doi:10.1016/j.softx.2024.101733. hdl:10261/358367.