Scikit-learn: Associative Learning Algorithms

Created on 13 Dec 2013  ·  32Comments  ·  Source: scikit-learn/scikit-learn

I noticed that there were no Associative Learning Algorithms such as:

Apiori Alogorithm
Equivalence Classification Algorithm (Eclat)
PrefixSpan
FP-Growth

All of them are used to detect combination of patterns in a dataset.

Some of them are kind of difficult to implement I would say about 200 lines of code?

New Feature

Most helpful comment

Hi,

I have some knowledge of Apriori and FP growth algorithm. I'd like to work on this issue. Is there anyone else already working on it, and if so I'd like to help with that too.

All 32 comments

I'm not sure item set mining is in the scope of sklearn. I only know the apriori algorithm but I know there are more advanced ones. I guess one could fit them in the API using sparse indicator matrices but somehow they seems very disjoint from the rest of sklearn.

They can be used for as a precursor for the CBA algorithm, a decision tree algorithm for categorical data

There are no decision trees (or any other algorithm) for categorical data without a one-hot-transform in sklearn.

I think frequent item mining should be considered OT. None of the core developers works in that area, so any submitted code is likely to become orphaned. We've been trying to reduce the scope of the library for this very reason.

I think frequent item mining should be considered OT. None of the core
developers works in that area, so any submitted code is likely to become
orphaned.

Also, I believe that the kind of code patterns will be very different
then what we currently have.

Not saying that it is not interesting, just saying the tool should be a
different one.

Hi,

I have some knowledge of Apriori and FP growth algorithm. I'd like to work on this issue. Is there anyone else already working on it, and if so I'd like to help with that too.

Closing this issue. I think association learning should be prototyped in a separate package; if it turns out that the code and interfaces are similar enough to ours, we can consider the code for merging into scikit-learn.

Sad decision!

Closing this issue. I think association learning should be prototyped in a separate package; if it turns out that the code and interfaces are similar enough to ours, we can consider the code for merging into scikit-learn.

Very reasonable decision :)

:+1: for focus
:-1: for excluding a whole class of well known unsupervised learning algorithms

@joernhees could you explain how this formulation of unsupervised learning even fits into the scikit-learn API? If not easily, then it probably belongs in scope of a different project that can establish its own API. I think @larsmans made that quite clear above, and it doesn't deserve a snide response.

sorry if this came across as snide, that wasn't my intention.

I originally arrived here searching for association rule learning algorithms and just expected to find them in sklearn (as it's a pretty awesome collection of machine learning algorithms and usually i find most things i need in it (big thank you for that)).

After reading this thread i was both: pleased and disappointed, and wanted to voice both:

  • Pleased to see that you make the good software engineering decision to focus (which is difficult).
  • Disappointed that association rule mining isn't part of it and there's another person out there who misses it. As i said it can be seen as an own class of unsupervised learning algorithms and it's quite successful (amazon). Maybe it's a bit too much data mining and a bit too little machine learning for sklearn, but just twist it a bit and you get rule learning which is quite useful for the explainable prediction of the next action an actor might take for example.

You're right that association rule mining doesn't fully fit into the current API. Conceptually i see it somewhere in between dimensionality reduction techniques and hierarchical clustering. API wise it's probably closest to hierarchical clustering.

As two lines were probably too short to express that in a friendly way, please accept my apologies.

no problem. There are definitely Python implementations of apriori.
Building a good library that collects together alternatives, and gives them
a consistent (scikit-learn-like) API seems like a nice project... I think
classifiers based on association rule mining may well be in scope for
scikit-learn, but unless they are sufficiently popular and standardised
already, it runs the risk of becoming code without a maintainer.

On 24 September 2014 07:52, Jörn Hees [email protected] wrote:

sorry if this came across as snide, that wasn't my intention.

I originally arrived here searching for association rule learning
algorithms and just expected to find them in sklearn (as it's a pretty
awesome collection of machine learning algorithms and usually i find most
things i need in it (big thank you for that)).

After reading this thread i was both: pleased and disappointed, and wanted
to voice both:

  • Pleased to see that you make the good software engineering decision
    to focus (which is difficult).
  • Disappointed that association rule mining isn't part of it and
    there's another person out there who misses it. As i said it can be seen as
    an own class of unsupervised learning algorithms and it's quite successful
    (amazon). Maybe it's a bit too much data mining and a bit too little
    machine learning for sklearn, but just twist it a bit and you get rule
    learning which is quite useful for the explainable prediction of the next
    action an actor might take for example.

You're right that association rule mining doesn't fully fit into the
current API. Conceptually i see it somewhere in between dimensionality
reduction techniques and hierarchical clustering. API wise it's probably
closest to hierarchical clustering.

As two lines were probably too short to express that in a friendly way,
please accept my apologies.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/2662#issuecomment-56595906
.

I think this would be worthwhile, this article: Comparing Association Rules and Decision Trees
for Disease Prediction
demonstrates clear advantages in comparison with decision trees.

This blog post includes Python code for A-Priori, it might be interesting to have a go at implementing these algorithms sometime. Is there any work on a separate prototyping package?

None so far. Maybe you can try to gather support for this on the mailing list?

I am, for one, disappointed that these algorithms are not implemented in sklearn. My advisor is Jiawei Han, the author of FP-growth and PrefixSpan, and the number of citations for both of those papers ("Mining frequent patterns without candidate generation" and "Mining sequential patterns by pattern-growth") is proof that both of those algorithms have a place in sklearn.

Just because scikit-learn has a popularity criterion for included
algorithms, that doesn't mean every popular algorithm should be included.
Scikit-learn needs to have limited scope, and this is simply too far from
classification and regression-like problems (although I'd be interested to
see a successful association-based classifier implemented).

Feel free to be disappointed, but I strongly doubt that ARL techniques will
be directly included in scikit-learn in the foreseeable future (although
another project may provide them with a scikit-learn-like API). There are
other projects where these algorithms are more appropriate, but if you're
disappointed with them too, go make your own.

On 25 March 2015 at 09:11, Henry [email protected] wrote:

I am, for one, disappointed that these algorithms are not implemented in
sklearn. My advisor is Jiawei Han, the author of FP-growth and PrefixSpan,
and the number of citations for both of those papers (
"Mining frequent patterns without candidate generation" and "Mining
sequential patterns by pattern-growth") is proof that both of those
algorithms have a place in sklearn.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/2662#issuecomment-85713120
.

Association learning algorithms are simply too far from classification and regression-like problems. Although we can consider Frequent Itemset/ pattern mining algorithm instead as a feature generation algorithm like countvectorizer and tfidfvectorizer. Those frequent patterns might be used in any classifier algorithm as input features, and will be much more intuitive and somewhat different than applying information gain based decision tree learning

That's an option. Kudo and Matsumoto show how to sample a subset of the polykernel with PrefixSpan.

I can lookup and check scikit-learn documentation, but I will ask you directly, Is this option (Kudo and Matsumoto) available in scikit-learn.

No. I'm just saying it could be.

+1 for Apiori Alogorithm

Note that there are ML algorithms which depend up frequent item lists as input. For example, see Cynthia Rudin's Bayesian Rule Lists (c.f., http://www.stat.washington.edu/research/reports/2012/tr609%20-%20old.pdf).

Consider a data set with a response variable to be predicted for which all the features are binary indicators (perhaps as a result of one-hot-encoding). We can consider a training set row to be a 'basket' and the presence of a feature for that training set row to be an 'item' within the basket. Thus, fairly generic data sets could be operated upon by apriori, FP-growth, and other frequent itemset mining techniques.

In the Bayesian Rule List algorithm, the frequent itemsets are evaluated and eventually an if-then-else structure is created from them. See the referenced paper for more details.

The point is that having frequent itemset mining approaches available could support classifiers and regressors --- already within the scope of sklearn --- not just market basket analysis.

That's motivation for such algorithms to be available in scipy, perhaps. Of
course, if a classifier or similar that meets scikit-learn's inclusion
guidelines were implemented with itemset mining, it's got a good chance of
inclusion, apriori and all.

On 19 April 2016 at 01:14, rmenich [email protected] wrote:

Note that there are ML algorithms which depend up frequent item lists as
input. For example, see Cynthia Rudin's Bayesian Rule Lists (c.f.,
http://www.stat.washington.edu/research/reports/2012/tr609%20-%20old.pdf).

Consider a data set with a response variable to be predicted for which all
the features are binary indicators (perhaps as a result of
one-hot-encoding). We can consider a training set row to be a 'basket' and
the presence of a feature for that training set row to be an 'item' within
the basket. Thus, fairly generic data sets could be operated upon by
apriori, FP-growth, and other frequent itemset mining techniques.

In the Bayesian Rule List algorithm, the frequent itemsets are evaluated
and eventually an if-then-else structure is created from them. See the
referenced paper for more details.

The point is that having frequent itemset mining approaches available
could support classifiers and regressors --- already within the scope of
sklearn --- not just market basket analysis.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/2662#issuecomment-211424583

I don't know how much of sklearn has changed since this conversation started but there's an entire "cluster" package that's not regression/classification either. I think a good implementation of the latest algorithms for association rules and frequent itemsets would be welcome by many in sklearn.

Clustering is much like classification, but unsupervised, and has long been part of scikit-learn. Association rule mining remains outside the primary tasks scikit-learn focuses on, and does not neatly fit its API, but might be relevant in the context of an association-based classifier.

"latest algorithms" isn't what scikit-learn is about. See our FAQ.

It would be nice not to have to repeat myself.

@actsasgeek if you want to implement association rule mining in a scikit-learn compatible way, we'd be happy to include it into scikit-learn-contrib: https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/README.md

I hope my repetitive question does not bother you, as I see a feeling of opposite toward adding association rule mining in such a great lib like scikit learn. I just want to get updated is there any frequent item set implemented in scikit learn after three years of the creation of this thred?.

Association rule mining is outside of the scope of machine learning, and
certainly out of the scope of scikit-learn.

Classification based on association rules is the only context in which we
would consider it, and then it would still need to be a hard sell.

On 17 August 2017 at 15:59, saria85 notifications@github.com wrote:

I hope my repetitive question does not bother you, as I see a feeling of
opposite toward adding association rule mining in such a great lib like
scikit learn. I just want to get updated is there any frequent item set
implemented in scikit learn after three years of the creation of this
thred?.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/2662#issuecomment-322976532,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz67fCICLgV-3OpYiV3ErpJSW0mobgks5sY9a4gaJpZM4BT5PS
.

For those who are interested,

A library called mlxtend implements the a priori algorithm:
http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/

yes everybody needs it, so it will be great to have in scikit-learn.
one more link for using it in ML
http://www2.cs.uh.edu/~ordonez/pdfwww/w-2006-HIKM-ardtmed.pdf
Comparing Association Rules and Decision Trees
for Disease Prediction

That is pattern mining not ML

Was this page helpful?
0 / 5 - 0 ratings