Jump to content

Concept drift

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Inzl2555 (talk | contribs) at 12:12, 27 February 2009 (Benchmark datasets). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In predictive analytics and machine learning, the concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as the times passes.

The term concept refers to the quantity you are looking to predict. More generally, it can also refer to other phenomena of interest besides the target concept, such as an input, but, in the context of concept drift, the term commonly refers to the target variable.

Examples

In a fraud detection application the target concept may be a binary attribute FRAUDULENT with values "yes" or "no" that indicates whether a given transaction is fraudulent. Or, in a weather prediction application, there may be several target concepts such as TEMPERATURE, PRESSURE, and HUMIDITY.

The behavior of the customers in an online shop may change over time. Let's say you want to predict weekly merchandise sales, and you have developed a predictive model that works to your satisfaction. The model may use inputs such as the amount of money spent on advertising, promotions you are running, and other metrics that may affect sales. What you are likely to experience is that the model will become less and less accurate over time - you will be a victim of concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. You will likely have higher sales in the winter holiday season than during the summer.

Possible remedies

To prevent deterioration in prediction accuracy over time the model has to be refreshed periodically. One approach is to retrain the model using only the most recently observed samples (Widmer and Kubat, 1996). Another approach is to add new inputs which may be better at explaining the causes of the concept drift. For our sales prediction application you may be able to reduce concept drift by adding information about the season to your model. By providing information about the time of the year you will likely reduce rate of deterioration of your model, but you likely will never be able to prevent concept drift altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.

Concept drift cannot be avoided if you are looking to predict a complex phenomenon that is not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing of your model is inescapable.

Software

Benchmark datasets

Real

  • Elec2, electricity demand, 2 classes, 45312 instances. Reference: M.Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. Download from J.Gama webpage.
  • Text mining, a collection of text mining datasets with concept drift, maintained by I.Katakis. Download

Artificial

  • STAGGER, J.C.Schlimmer, R.H.Granger, Incremental Learning from Noisy Data, Mach. Learn., vol.1, no.3, 1986.
  • SEA concepts, N.W.Street, Y.Kim, A streaming ensemble algorithm (SEA) for large-scale classification, KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001. Download from J.Gama webpage.
  • A toolbox, A.Narasimhamurthy, L.I.Kuncheva, A framework for generating data to simulate changing environments, Proc. IASTED, Artificial Intelligence and Applications, Innsbruck, Austria, 2007. Access.

Data generation frameworks

  • Lindstrom P, SJ Delany & B MacNamee (2008) Autopilot: Simulating Changing Concepts in Real Data In: Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science, D Bridge, K Brown, B O'Sullivan & H Sorensen (eds.)

p272-263 PDF

  • Narasimhamurthy A., L.I. Kuncheva, A framework for generating data to simulate changing environments, Proc. IASTED, Artificial Intelligence and Applications, Innsbruck, Austria, 2007, 384-389 PDF

Researchers working on concept drift problems

Bibliographic references

Reviews

  • Kuncheva L.I. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives, Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), Patras, Greece, 2008, 5-10, PDF
  • Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., Mining Data Streams: A Review, in ACM SIGMOD Record, Vol. 34, No. 1, June 2005, ISSN: 0163-5808
  • Tsymbal, A., The problem of concept drift: Definitions and related work. Technical Report. 2004, Department of Computer Science, Trinity College: Dublin, Ireland. PDF
  • Kuncheva L.I., Classifier ensembles for changing environments, Proceedings 5th International Workshop on Multiple Classifier Systems, MCS2004, Cagliari, Italy, in F. Roli, J. Kittler and T. Windeatt (Eds.), Lecture Notes in Computer Science, Vol 3077, 2004, 1-15, PDF.

Other

See also