Sequential pattern mining: Difference between revisions
m Disambiguating links to Blast (link changed to BLAST (biotechnology)) using DisamAssist. |
|||
(40 intermediate revisions by 30 users not shown) | |||
Line 1: | Line 1: | ||
'''Sequential Pattern mining''' is a topic of [[data mining]] concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.<ref>{{cite doi|10.1145/1824795.1824798 }}</ref> It is usually presumed that the values are discrete, and thus [[time series]] mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of [[structured data mining]]. |
|||
'''Sequential pattern mining''' is a topic of [[data mining]] concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.<ref>{{Cite journal | last1 = Mabroukeh | first1 = N. R. | last2 = Ezeife | first2 = C. I. | doi = 10.1145/1824795.1824798 | title = A taxonomy of sequential pattern mining algorithms | journal = ACM Computing Surveys | volume = 43 | pages = 1–41 | year = 2010 | citeseerx = 10.1.1.332.4745 | s2cid = 207180619 }}</ref><ref>{{Cite journal | last1 = Bechini | first1 = A. | last2 = Bondielli | first2 = A. | last3 = Dell'Oglio | first3 = P.| last4 = Marcellonii | first4 = F. | doi = 10.3934/aci.2023004 | title = From basic approaches to novel challenges and applications in Sequential Pattern Mining | journal = Applied Computing and Intelligence | volume = 3 | issue = 1 | pages = 44–78 | year = 2023 | url = https://www.aimspress.com/article/id/63e21839ba35de2c6e2b0152 | doi-access = free }}</ref> It is usually presumed that the values are discrete, and thus [[time series]] mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of [[structured data mining]]. |
|||
⚫ | There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as ''string mining'' which is typically based on [[ |
||
⚫ | There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for [[similarity measure|similarity]], and recovering missing sequence members. In general, sequence mining problems can be classified as ''string mining'' which is typically based on [[String (computer science)|string processing algorithms]] and ''itemset mining'' which is typically based on [[association rule learning]]. ''Local process models'' <ref>{{Cite journal | last1 = Tax| first1 = N. | last2 = Sidorova | first2 = N. | last3 = Haakma | first3 = R. | last4 = van der Aalst | first4 = Wil M. P. | doi = 10.1016/j.jides.2016.11.001 | title = Mining Local Process Models | journal = Journal of Innovation in Digital Ecosystems | volume = 3 |issue=2 | pages = 183–196 | year = 2016 | arxiv = 1606.06066 | s2cid = 10872379 }}</ref> extend sequential pattern mining to more complex patterns that can include (exclusive) choices, loops, and concurrency constructs in addition to the sequential ordering construct. |
||
⚫ | |||
⚫ | String mining typically deals with a limited [[alphabet]] for items that appear in a [[sequence]], but the sequence itself may be typically very long. Examples of an alphabet can be those in the [[ASCII]] character set used in natural language text, [[nucleotide]] bases 'A', 'G', 'C' and 'T' in [[DNA sequences]], or [[amino acids]] for [[protein sequences]]. In [[biology]] applications analysis of the arrangement of the alphabet in strings can be used to examine [[gene]] and [[protein]] sequences to determine their properties. |
||
⚫ | |||
⚫ | A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include:<ref>{{cite book | |
||
⚫ | String mining typically deals with a limited [[alphabet]] for items that appear in a [[sequence]], but the sequence itself may be typically very long. Examples of an alphabet can be those in the [[ASCII]] character set used in natural language text, [[nucleotide]] bases 'A', 'G', 'C' and 'T' in [[DNA sequences]], or [[amino acids]] for [[protein sequences]]. In [[biology]] applications analysis of the arrangement of the alphabet in strings can be used to examine [[gene]] and [[protein]] sequences to determine their properties. Knowing the sequence of letters of a [[DNA]] or a [[protein]] is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and [[Function (biology)|biological function]]. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when [[insertion (genetics)|insertions]], [[Deletion (genetics)|deletions]] and [[mutations]] occur in a string. |
||
⚫ | A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include:<ref>{{cite book |first1=M. |last1=Abouelhoda |first2=M. |last2=Ghanem |chapter=String Mining in Bioinformatics |editor-first=M. M. |editor-last=Gaber |title=Scientific Data Mining and Knowledge Discovery |publisher=Springer |year=2010 |isbn=978-3-642-02787-1 |doi=10.1007/978-3-642-02788-8_9 }}</ref> |
||
* '''Repeat-related problems:''' that deal with operations on single sequences and can be based on [[String searching algorithm|exact string matching]] or [[approximate string matching]] methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences. |
* '''Repeat-related problems:''' that deal with operations on single sequences and can be based on [[String searching algorithm|exact string matching]] or [[approximate string matching]] methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences. |
||
⚫ | * '''Alignment problems:''' that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include [[BLAST (biotechnology)|BLAST]] for comparing a single sequence with multiple sequences in a database, and [[ClustalW]] for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See [[sequence alignment]]. |
||
⚫ | |||
⚫ | * '''Alignment problems:''' that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include [[BLAST]] for comparing a single sequence with multiple sequences in a database, and [[ClustalW]] for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See [[sequence alignment]]. |
||
⚫ | Some problems in sequence mining lend themselves to discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson up}, it is likely that {Motorola up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction". |
||
⚫ | A survey and taxonomy of the key algorithms for item set mining is presented by Han et al. (2007).<ref>{{cite journal |first1=J. |last1=Han |first2=H. |last2=Cheng |first3=D. |last3=Xin |first4=X. |last4=Yan |title=Frequent pattern mining: current status and future directions |journal=Data Mining and Knowledge Discovery |year=2007 |volume=15 |issue=1 |pages=55–86 |doi=10.1007/s10618-006-0059-1 |doi-access=free }}</ref> |
||
⚫ | |||
⚫ | Some problems in sequence mining lend themselves discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson |
||
⚫ | |||
⚫ | A survey and taxonomy of the key algorithms for item set mining is presented by Han et al. (2007).<ref>{{cite journal | |
||
==Applications== |
|||
⚫ | |||
⚫ | With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user [[buying pattern]]s using PrefixSpan algorithm and place the products on shelves based on the order of mined purchasing patterns.<ref>{{cite journal |first1=A. |last1=George |first2=D. |last2=Binu |title=An Approach to Products Placement in Supermarkets Using PrefixSpan Algorithm |journal=Journal of King Saud University-Computer and Information Sciences |volume=25 |issue=1 |year=2013 |pages=77–87 |doi=10.1016/j.jksuci.2012.07.001 |doi-access=free }}</ref> |
||
==Application== |
|||
⚫ | With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user buying |
||
==Algorithms== |
==Algorithms== |
||
Commonly used algorithms include: |
Commonly used algorithms include: |
||
* [[GSP |
* [[GSP algorithm]] |
||
* |
* Sequential Pattern Discovery using Equivalence classes (SPADE) |
||
⚫ | |||
* [[Apriori algorithm]] |
|||
⚫ | |||
⚫ | |||
⚫ | * MAPres<ref>{{cite journal|last=Ahmad|first=Ishtiaq|author2=Qazi, Wajahat M. |author3=Khurshid, Ahmed |author4=Ahmad, Munir |author5=Hoessli, Daniel C. |author6=Khawaja, Iffat |author7=Choudhary, M. Iqbal |author8=Shakoori, Abdul R. |author9= Nasir-ud-Din |title=MAPRes: Mining association patterns among preferred amino acid residues in the vicinity of amino acids targeted for post-translational modifications|journal=Proteomics|date=1 May 2008|volume=8|issue=10|pages=1954–1958|doi=10.1002/pmic.200700657|pmid=18491291|s2cid=22362167}}</ref> |
||
⚫ | |||
* Seq2Pat (for constraint-based sequential pattern mining)<ref name="hosseininasab2019">{{cite journal | doi = 10.1609/aaai.v33i01.33011495 |vauthors=Hosseininasab A, van Hoeve WJ, Cire AA | year = 2019 | title = Constraint-Based Sequential Pattern Mining with Decision Diagrams | url = https://www.aaai.org/ojs/index.php/AAAI/article/view/3962 | journal = Proceedings of the AAAI Conference on Artificial Intelligence | volume = 33 | pages = 1495–1502 |arxiv=1811.06086 |s2cid=53427299 | doi-access = free }}</ref><ref>{{Cite web|url=https://github.com/fidelity/seq2pat|title = Seq2Pat: Sequence-to-Pattern Generation Library|website = [[GitHub]]|date = 9 April 2022}}</ref> |
|||
⚫ | * |
||
==See also== |
==See also== |
||
* {{annotated link|Collocation extraction}} |
|||
* [[Association rule learning]] |
|||
⚫ | |||
* [[Data Mining]] |
|||
* {{annotated link|Sequence analysis}} |
|||
⚫ | |||
* |
* {{annotated link|Sequence analysis in social sciences}} |
||
* |
* {{annotated link|Sequence clustering}} |
||
* |
* {{annotated link|Sequence labeling}} |
||
* [[string (computer science)]] |
|||
* [[Sequence alignment]] |
|||
* [[Time series]] |
|||
==References== |
==References== |
||
Line 47: | Line 43: | ||
==External links== |
==External links== |
||
* [http://www.philippe-fournier-viger.com/spmf/ SPMF] includes open-source implementations of GSP, PrefixSpan, SPADE, SPAM |
* [http://www.philippe-fournier-viger.com/spmf/ SPMF] includes open-source implementations of GSP, PrefixSpan, SPADE, SPAM and many others. |
||
{{Strings |state=collapsed}} |
|||
{{DEFAULTSORT:Sequential Pattern Mining}} |
{{DEFAULTSORT:Sequential Pattern Mining}} |
Latest revision as of 18:28, 11 December 2023
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.[1][2] It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.
There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning. Local process models [3] extend sequential pattern mining to more complex patterns that can include (exclusive) choices, loops, and concurrency constructs in addition to the sequential ordering construct.
String mining
[edit]String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA or a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string.
A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include:[4]
- Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.
- Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.
Itemset mining
[edit]Some problems in sequence mining lend themselves to discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson up}, it is likely that {Motorola up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".
A survey and taxonomy of the key algorithms for item set mining is presented by Han et al. (2007).[5]
The two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more-recent FP-growth technique.
Applications
[edit]With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user buying patterns using PrefixSpan algorithm and place the products on shelves based on the order of mined purchasing patterns.[6]
Algorithms
[edit]Commonly used algorithms include:
- GSP algorithm
- Sequential Pattern Discovery using Equivalence classes (SPADE)
- FreeSpan
- PrefixSpan
- MAPres[7]
- Seq2Pat (for constraint-based sequential pattern mining)[8][9]
See also
[edit]- Collocation extraction – Computational technique to find word sequences
- Process mining – Data mining technique using event logs
- Sequence analysis – Identification and study of genomic sequences
- Sequence analysis in social sciences – Analysis of sets of categorical sequences
- Sequence clustering – algorithm
- Sequence labeling – pattern recognition
References
[edit]- ^ Mabroukeh, N. R.; Ezeife, C. I. (2010). "A taxonomy of sequential pattern mining algorithms". ACM Computing Surveys. 43: 1–41. CiteSeerX 10.1.1.332.4745. doi:10.1145/1824795.1824798. S2CID 207180619.
- ^ Bechini, A.; Bondielli, A.; Dell'Oglio, P.; Marcellonii, F. (2023). "From basic approaches to novel challenges and applications in Sequential Pattern Mining". Applied Computing and Intelligence. 3 (1): 44–78. doi:10.3934/aci.2023004.
- ^ Tax, N.; Sidorova, N.; Haakma, R.; van der Aalst, Wil M. P. (2016). "Mining Local Process Models". Journal of Innovation in Digital Ecosystems. 3 (2): 183–196. arXiv:1606.06066. doi:10.1016/j.jides.2016.11.001. S2CID 10872379.
- ^ Abouelhoda, M.; Ghanem, M. (2010). "String Mining in Bioinformatics". In Gaber, M. M. (ed.). Scientific Data Mining and Knowledge Discovery. Springer. doi:10.1007/978-3-642-02788-8_9. ISBN 978-3-642-02787-1.
- ^ Han, J.; Cheng, H.; Xin, D.; Yan, X. (2007). "Frequent pattern mining: current status and future directions". Data Mining and Knowledge Discovery. 15 (1): 55–86. doi:10.1007/s10618-006-0059-1.
- ^ George, A.; Binu, D. (2013). "An Approach to Products Placement in Supermarkets Using PrefixSpan Algorithm". Journal of King Saud University-Computer and Information Sciences. 25 (1): 77–87. doi:10.1016/j.jksuci.2012.07.001.
- ^ Ahmad, Ishtiaq; Qazi, Wajahat M.; Khurshid, Ahmed; Ahmad, Munir; Hoessli, Daniel C.; Khawaja, Iffat; Choudhary, M. Iqbal; Shakoori, Abdul R.; Nasir-ud-Din (1 May 2008). "MAPRes: Mining association patterns among preferred amino acid residues in the vicinity of amino acids targeted for post-translational modifications". Proteomics. 8 (10): 1954–1958. doi:10.1002/pmic.200700657. PMID 18491291. S2CID 22362167.
- ^ Hosseininasab A, van Hoeve WJ, Cire AA (2019). "Constraint-Based Sequential Pattern Mining with Decision Diagrams". Proceedings of the AAAI Conference on Artificial Intelligence. 33: 1495–1502. arXiv:1811.06086. doi:10.1609/aaai.v33i01.33011495. S2CID 53427299.
- ^ "Seq2Pat: Sequence-to-Pattern Generation Library". GitHub. 9 April 2022.
External links
[edit]- SPMF includes open-source implementations of GSP, PrefixSpan, SPADE, SPAM and many others.