Jump to content

Sequential pattern mining

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 146.169.4.7 (talk) at 17:41, 26 January 2012. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Sequence mining is concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus Time series mining is closely related, but usually considered a different activity. Sequence mining is a special case of structured data mining.

There are two different kinds of sequence mining: string mining and itemset mining.

String Mining

String mining is widely used in biology, to examine gene and protein sequences, and is primarily concerned with sequences with a single member at each position. There exist a variety of prominent algorithms to perform alignment of a query sequence with those existing in databases. The kind of alignment could either involve matching a query with one subject e.g. BLAST or matching multiple query sets with each other e.g. ClustalW.

String Mining in Bioinformatics

A taxonomy of the key algorithms for sequence comparison for bioinformatics is presented in the paper [1]

  • Alignment problems: Global, semi-global and local sequence alignment and biological database search methods.
  • Repeat-related problems: Exact and approximate methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.


Itemset Mining

Itemset mining is used more often in marketing and CRM applications, and is concerned with multiple-symbols at each position. Itemset mining is also a popular approach to text mining. Two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more-recent FP-Growth technique. However, there is nothing in these techniques that restricts them to sequences, per se.

Challenges

There are several key problems within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members.


See also

References

  1. ^ M. Abouelhoda, M. Ghanem. String Mining in Bioinformatics. In M. M. Gaber (Editor) Scientific Data Mining and Knowledge Discovery. Springer, ISBN 3642027873, 2009