Scikit-learn: Support for multi-class roc_auc scores

Created on 19 Jun 2014 · 47Comments · Source: scikit-learn/scikit-learn

Low priority feature request: support for multi-class roc_auc score calculation in sklearn.metrics using the one against all methodology would be incredibly useful.

New Feature

Source

madisonmay

Most helpful comment

In micro-averaging, your true positive rate (TPR) is calculated by taking the sum of all TPs of all the classes, and dividing by the sum of all TPs and FNs of all the classes, i.e. for a 3-class problem:
TPR = (TP1+TP2+TP3)/(TP1+TP2+TP3+FN1+FN2+FN3)

Example confusion matrix:
[[1,2,3],
[4,5,6],
[7,8,9]]
TPR = (1+5+9)/(1+5+9+(2+3)+(4+6)+(7+8))
Do the same for the false positive rate and you can compute AUC.

Macro averaging just computes the TPR for each class separately and averages them (weighted by the number of examples in that class or not):
TPR = (1/3)* (TP1/(TP1+FN1) + TP2/(TP2+FN2) + TP2/(TP2+FN2))

With the same example:
TPR = (1/3)* (1/(1+(2+3)) + 5/(5+(4+6)) + 9/(9+(7+8)))

Maybe this helps (this uses precision, but the idea is the same):
http://stats.stackexchange.com/questions/156923/should-i-make-decisions-based-on-micro-averaged-or-macro-averaged-evaluation-mea

I would personally never use an unweighted macro-average, but l'll see if I can find the papers that studied this.

joaquinvanschoren on 26 Sep 2016

👍6 ❤2

All 47 comments

I am not certain what that means. Do you have a reference for it?

On 19 June 2014 09:51, Madison May [email protected] wrote:

Low priority feature request: support for multi-class roc_auc score
calculation in sklearn.metrics using the one against all methodology
would be incredibly useful.

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298.

jnothman on 19 Jun 2014

Here's a pretty decent explanation, along with references: https://www.cs.bris.ac.uk/~flach/ICML04tutorial/ROCtutorialPartIII.pdf

madisonmay on 19 Jun 2014

hmm, what is a recommended scorer while multi-class auc is not implemented?

manugarri on 25 Jun 2014

👍7

support for multi-class roc_auc score calculation in sklearn.metrics using the one against all methodology would be incredibly useful

Are you talking about what those slides consider an approximation to volume under surface in which the frequency-weighted average of AUC for each class is taken? This would seem to be identical to using the current roc_auc_score with a binarized representation and average='weighted'. (@arjoly, why do these curve-based scores disallow multiclass?)

Otherwise, those slides, and most references I can find to "multi-class ROC", focus on multi-class calibration of OvR, not on an evaluation metric. Is this what you're interested in? I have no idea how widespread this technique is, whether it's worth having such available in scikit-learn, and whether the greedy optimisation should be improved upon.

jnothman on 1 Aug 2014

(@arjoly, why do these curve-based scores disallow multiclass?)

Whenever one class is missing from y_true, it's not possible to compute the score. I didn't want to add the magic for the class inference and got users into troubles.

arjoly on 1 Aug 2014

It's possible that we're not dealing appropriately in the case of y_pred
having a label that y_true does not. That label probably shouldn't
participate in anything like a macro average (in accordance with Weka,
too), or an ROC score.

On 1 August 2014 17:08, Arnaud Joly [email protected] wrote:

(@arjoly https://github.com/arjoly, why do these curve-based scores
disallow multiclass?)

Whenever one class is missing from y_true, it's not possible to compute
the score. I didn't wanted to add the magic for the class inference and got
user into problems.

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-50855460
.

jnothman on 2 Aug 2014

@jnothman @arjoly there has been a lot of progress on the averaging front. How hard is it to implement this now?

amueller on 9 Dec 2015

it could perhaps be similar to the R function from the pROC package
http://www.inside-r.org/packages/cran/pROC/docs/multiclass.roc

shuckle16 on 8 Jan 2016

Hi, I implemented a draft of the macro-averaged ROC/AUC score, but I am unsure if it will fit for sklearn.

Here is the code:

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

def multiclass_roc_auc_score(truth, pred, average="macro"):

    lb = LabelBinarizer()
    lb.fit(truth)

    truth = lb.transform(truth)
    pred = lb.transform(pred)

    return roc_auc_score(truth, pred, average=average)

Could it be as simple as this?

fbrundu on 28 Jul 2016

👍5

@fbrundu if this is the standard meaning. It is certainly one possible interpretation.

amueller on 28 Jul 2016

There is a nice summary here:
http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf

The pROC package implements Hand and Till:
http://download.springer.com/static/pdf/398/art%253A10.1023%252FA%253A1010920819831.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1023%2FA%3A1010920819831&token2=exp=1469743016~acl=%2Fstatic%2Fpdf%2F398%2Fart%25253A10.1023%25252FA%25253A1010920819831.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1023%252FA%253A1010920819831*~hmac=bc68686d3782ac6af3c3cda13c1b36aad6de5d01d16a25870cace5fe9699fb8a

The version of Hand and Till seems to be generally accepted and I vote we implement that.
There is also a version of Provost and Domingos which I probably should root for given that Provost is currently my director, but that hasn't caught on.
The Provost-Domingos is what @fbrundu said only with average='weighted'.

TLDR: PR for Hand and Till welcome. Optionally Provost and Domingos with option to change the averaging.

amueller on 28 Jul 2016

👍2

Hi, has there been any progress on implementing this?
What I've seen in most other libraries (e.g. WEKA) is that they use the weighted average. I would think this is what @fbrundu proposed using average='micro' ?

joaquinvanschoren on 25 Aug 2016

@joaquinvanschoren R uses the Hand and Till. I'd prefer that one, too. I have a student that will work on this soon.

amueller on 25 Aug 2016

@amueller I can work on this :)

kathyxchen on 29 Aug 2016

@kchen17 thanks!

amueller on 29 Aug 2016

We discussed this at OpenML quite a bit. For multiclass AUC there are no guarantees that one approach (macro-averaging, micro-averaging, weighted averaging, ...) is better than the other. In R you can find at least 5 different approaches (all also available in MLR now).
When implementing this in scikit-learn, it would be great if there is at least the possibility to choose the one that makes most sense for your application, even if you use Hand-Till as the default. Hand-Till is a non-weighted approach by the way, it does not take label imbalance into account.

joaquinvanschoren on 7 Sep 2016

👍2

I'm happy to have multiple versions. non-weighted and "not taking label imbalance into account" are two different things ;) Do you have a list and references?

What's micro-averaging in this case?

amueller on 7 Sep 2016

Note that we already micro- and macro-averaged ROC AUC for multiclass problems implemented in this example:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#multiclass-settings

ogrisel on 26 Sep 2016

Actually, I think that documentation is incorrect and should say
multilabel...

On 26 September 2016 at 23:16, Olivier Grisel [email protected]
wrote:

Not that we already micro- and macro-averaged ROC AUC for multiclass
problems implemented in this example:

http://scikit-learn.org/stable/auto_examples/model_
selection/plot_roc.html#multiclass-settings

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-249566346,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65IeU7k2CFwyHxTTAjk-5orIxWe6ks5qt8WsgaJpZM4CFzud
.

jnothman on 26 Sep 2016

Example confusion matrix:
[[1,2,3],
[4,5,6],
[7,8,9]]
TPR = (1+5+9)/(1+5+9+(2+3)+(4+6)+(7+8))
Do the same for the false positive rate and you can compute AUC.

With the same example:
TPR = (1/3)* (1/(1+(2+3)) + 5/(5+(4+6)) + 9/(9+(7+8)))

Maybe this helps (this uses precision, but the idea is the same):
http://stats.stackexchange.com/questions/156923/should-i-make-decisions-based-on-micro-averaged-or-macro-averaged-evaluation-mea

I would personally never use an unweighted macro-average, but l'll see if I can find the papers that studied this.

joaquinvanschoren on 26 Sep 2016

👍6 ❤2

Paper:
https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf

This is what is supported in R (with additional literature):
https://mlr-org.github.io/mlr-tutorial/devel/html/measures/index.html

joaquinvanschoren on 26 Sep 2016

Hi! I was able to start looking into this issue last week, and I wanted to post a quick update/some questions, just to make sure I am on the right track.

So far: I am starting off with implementation of a function multiclass_roc_auc_score which will, by default, have some average parameter set to None. This default will use the Hand-Till algorithm (as discussed, this doesn't take into account label imbalance).
Would the method accept the same parameters as those in roc_auc_score?
And going off of that, the difference would be then that y_true could have more than 2 classes of labels. Hand-Till would involve finding all possible pairs of labels, computing roc_auc_score for each of these pairs, and then taking the mean of these.

Let me know what corrections/suggestions you may have!

kathyxchen on 28 Sep 2016

Ordinarily, we would avoid creating another function if reusing roc_auc_score is reasonably feasible. I think leaving the default as 'macro' is acceptable.

One key thing you should be thinking about is how to test these changes, including changing the traits of roc_auc_score in metrics/tests/test_common.py

jnothman on 29 Sep 2016

~~yeah we should update the docs. I think the multi-class part is not properly documented.~~ there is no multi-class support currently.

amueller on 29 Sep 2016

@joaquinvanschoren interestingly that paper didn't discuss any of the multi-class AUC papers mentioned above, in particular not the Fawcett paper from 2005.... hm I guess it's a renormalization of the 1-vs-1 multi-class?

amueller on 5 Oct 2016

so currently we only have multi-label, and so we want to add multi-class with 1vs1 and 1vsRest and they each have weighted and unweighted variants.
I don't really understand how the sample and micro averaging work for AUC :(

So... I propose we add a parameter multi-class to AUC and that can be ovo or ovr, and will consider the weighting parameter. I'm not sure we want to allow sample and micro as that doesn't really make sense to me.

@arjoly so micro and sample operate on the rows rather than the columns of the matrix? Is there any papers about that? I didn't find that in the ROC literature.

The problem with that is that to make the hand-till measure default we'd have to do weighted average OvO and we can't really change the weighting option. So maybe we do OVR by default and explain in the narrative that OvO with weighting is also a good choice and add a reference?

amueller on 5 Oct 2016

The summary of the paper @joaquinvanschoren cited also says that all the AUC versions give pretty much the same results.

amueller on 5 Oct 2016

@amueller: Had a chance to read your comment again, and I'm a little confused about this part:

The problem with that is that to make the hand-till measure default we'd have to do weighted average OvO and we can't really change the weighting option. So maybe we do OVR by default and explain in the narrative that OvO with weighting is also a good choice and add a reference?

I was going to modify the roc_auc_score to incorporate a multiclass=['ovo', 'ovr'] parameter as per your response. If OvR is default (roc_auc_score(y_true, y_score, multiclass="ovo" ... )), but Hand & Till is OvO, what do I do w.r.t addressing the OvR part of the implementation? (i.e. if I detect that y_true is multiclass, just raise an error if "ovr" is unimplemented and instruct users to pass in "ovo"?)

kathyxchen on 6 Oct 2016

Sorry, I was expecting you to implement both ovo and ovr ;) I think that should be fairly straight-forward.

amueller on 7 Oct 2016

👍1

@amueller: Noted and that will be incorporated as well! Also wanted to ask: is there any advice on how to detect the difference between multiclass and multilabel? At first, I was just checking the dimensions of y_score but very quickly realized this would not be sufficient. (i.e. just checking that the labels are only 0s and 1s?)

kathyxchen on 8 Oct 2016

Multilabel means that multiple labels are predicted at once: you get a
vector of predictions per instance. Multiclass means you get a single
prediction but that prediction can have more than two values (it is not
binary).

Sometimes, people solve the multiclass case by binarizing the output, hence
you get multiple binary values per instance (hence multilabel) and this
often causes confusion.
On Sat, 8 Oct 2016 at 16:33, Kathy Chen [email protected] wrote:

@amueller https://github.com/amueller: Noted and that will be
incorporated as well! Also wanted to ask: is there any advice on how to
detect the difference between multiclass and multilabel? At first, I was
just checking the dimensions of y_score but very quickly realized this
would not be sufficient.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-252427642,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABpQV7Mv0rHGEfrkYi5Xezz3PItyrLZ6ks5qx6mdgaJpZM4CFzud
.

joaquinvanschoren on 8 Oct 2016

👍1

Hi, I hope type_of_target could solve the purpose of differentiating between multi-label and multi-class output. HTH

maniteja123 on 8 Oct 2016

👍2 🎉1

using type_of_target is a good idea. Though in scikit-learn the dimensionality of y is actually the indicator whether we want to do multi-label or multi-target. If you binarize the output as @joaquinvanschoren suggested scikit-learn will always assume multi-label.

amueller on 8 Oct 2016

👍2

type_of_target is fine to distinguish between the y_trues, @amueller

On 9 October 2016 at 05:18, Andreas Mueller [email protected]
wrote:

using type_of_target is a good idea. Though in scikit-learn the
dimensionality of y is actually the indicator whether we want to do
multi-label or multi-target. If you binarize the output as
@joaquinvanschoren https://github.com/joaquinvanschoren suggested
scikit-learn will always assume multi-label.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-252439908,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wa5fnE_LX3LLXbCoc0Z4hBbSAQ0ks5qx95rgaJpZM4CFzud
.

jnothman on 8 Oct 2016

Hi all, I just wanted to let you know that I submitted a "preliminary" PR. I am interested in hearing some feedback on implementation (e.g. I'm sure there are ways to leverage numpy/etc. in a better way than I am doing right now), along with best practices for adding new tests, documentation wording, etc.

Thank you for all of the help so far!

kathyxchen on 13 Oct 2016

Any progress on adding multiclass support for AUC?

joaquinvanschoren on 25 Feb 2017

👍2

@joaquinvanschoren: working on revisions after a code review by @jnothman in #7663. Will likely submit another update on that next week when I've finished with midterms

kathyxchen on 27 Feb 2017

👍4

Hi @kathyxchen, @jnothman,

Any updates on the PR?

joaquinvanschoren on 26 Apr 2017

👍7

Just checking in to see if there is any progress on adding multiclass support for AUC?

jcharit1 on 2 Sep 2017

we have trouble determining what is both an accepted and principled
formulation of ROC AUC for multiclass. See
https://github.com/scikit-learn/scikit-learn/pull/7663#issuecomment-307566895
and below.

jnothman on 3 Sep 2017

So fellows. Is there any progress with multiclass auc score? I found very confusing official documentation code with iris dataset. Because this method shows that my model predicts random numbers fairly well.

trendsearcher on 27 Jun 2019

This is almost done, we need to decide on a API detail before merging: https://github.com/scikit-learn/scikit-learn/pull/12789#discussion_r295693965

thomasjpfan on 27 Jun 2019

@trendsearcher can you provide an example please? It's now merged but I'd like to see the issue you experienced.

amueller on 17 Jul 2019

Glad to help. How can I give an example (it has lots of code and may be not
intuitive)? Maybe I can write it in plain text?

чт, 18 июл. 2019 г. в 00:35, Andreas Mueller notifications@github.com:

@trendsearcher https://github.com/trendsearcher can you provide an
example please? It's now merged but I'd like to see the issue you
experienced.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298?email_source=notifications&email_token=AKS7QOFYRQY7RZJBWUVVJSTP76GDFA5CNFSM4AQXHOO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GU7EI#issuecomment-512577425,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKS7QOFQ5LAIZ2ZBR4M4EATP76GDFANCNFSM4AQXHOOQ
.

trendsearcher on 18 Jul 2019

Hi, I implemented a draft of the macro-averaged ROC/AUC score, but I am unsure if it will fit for sklearn.

Here is the code:
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

def multiclass_roc_auc_score(truth, pred, average="macro"):

    lb = LabelBinarizer()
    lb.fit(truth)

    truth = lb.transform(truth)
    pred = lb.transform(pred)

    return roc_auc_score(truth, pred, average=average)
Could it be as simple as this?

@fbrundu Thank you for sharing! I tried your code. But when I call this function, I meet a problem saying "Multioutput target data is not supported with label binarization". Then I remove the code "pred=lb.transform(pred)" in the function. However, I meet another problem that "Found input variables with inconsistent numbers of samples: [198, 4284]".

May I ask if you could help me solve this? Thank you!

Junting-Wang on 12 Jun 2020

@Junting-Wang

 I meet a problem saying "Multioutput target data is not supported with label binarization".

you have to use predict instead of predict_proba

hanzigs on 28 Aug 2020

@fbrundu is your implementation correct? I use it and works.

luismiguells on 13 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add azure pipelines badge to readme?

amueller · 3Comments

Improve class design for AgglomerativeClustering and FeatureAgglomeration (was pooling_func in AgglomerativeClustering doesn't work)

yinruiqing · 3Comments

[0.23.1] doctest GradientBoostingClassifier failes on arm(rhel) processors

murata-yu · 3Comments

min_weight_fraction_leaf suggested improvements

ben519 · 3Comments

Import error when loading a pickled model pulled from Pipeline

bmulas1535 · 3Comments