Scikit-learn: Support for multi-class roc_auc scores

Created on 19 Jun 2014  ·  47Comments  ·  Source: scikit-learn/scikit-learn

Low priority feature request: support for multi-class roc_auc score calculation in sklearn.metrics using the one against all methodology would be incredibly useful.

New Feature

Most helpful comment

In micro-averaging, your true positive rate (TPR) is calculated by taking the sum of all TPs of all the classes, and dividing by the sum of all TPs and FNs of all the classes, i.e. for a 3-class problem:
TPR = (TP1+TP2+TP3)/(TP1+TP2+TP3+FN1+FN2+FN3)

Example confusion matrix:
[[1,2,3],
[4,5,6],
[7,8,9]]
TPR = (1+5+9)/(1+5+9+(2+3)+(4+6)+(7+8))
Do the same for the false positive rate and you can compute AUC.

Macro averaging just computes the TPR for each class separately and averages them (weighted by the number of examples in that class or not):
TPR = (1/3)* (TP1/(TP1+FN1) + TP2/(TP2+FN2) + TP2/(TP2+FN2))

With the same example:
TPR = (1/3)* (1/(1+(2+3)) + 5/(5+(4+6)) + 9/(9+(7+8)))

Maybe this helps (this uses precision, but the idea is the same):
http://stats.stackexchange.com/questions/156923/should-i-make-decisions-based-on-micro-averaged-or-macro-averaged-evaluation-mea

I would personally never use an unweighted macro-average, but l'll see if I can find the papers that studied this.

All 47 comments

I am not certain what that means. Do you have a reference for it?

On 19 June 2014 09:51, Madison May [email protected] wrote:

Low priority feature request: support for multi-class roc_auc score
calculation in sklearn.metrics using the one against all methodology
would be incredibly useful.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298.

Here's a pretty decent explanation, along with references: https://www.cs.bris.ac.uk/~flach/ICML04tutorial/ROCtutorialPartIII.pdf

hmm, what is a recommended scorer while multi-class auc is not implemented?

support for multi-class roc_auc score calculation in sklearn.metrics using the one against all methodology would be incredibly useful

Are you talking about what those slides consider an approximation to volume under surface in which the frequency-weighted average of AUC for each class is taken? This would seem to be identical to using the current roc_auc_score with a binarized representation and average='weighted'. (@arjoly, why do these curve-based scores disallow multiclass?)

Otherwise, those slides, and most references I can find to "multi-class ROC", focus on multi-class calibration of OvR, not on an evaluation metric. Is this what you're interested in? I have no idea how widespread this technique is, whether it's worth having such available in scikit-learn, and whether the greedy optimisation should be improved upon.

(@arjoly, why do these curve-based scores disallow multiclass?)

Whenever one class is missing from y_true, it's not possible to compute the score. I didn't want to add the magic for the class inference and got users into troubles.

It's possible that we're not dealing appropriately in the case of y_pred
having a label that y_true does not. That label probably shouldn't
participate in anything like a macro average (in accordance with Weka,
too), or an ROC score.

On 1 August 2014 17:08, Arnaud Joly [email protected] wrote:

(@arjoly https://github.com/arjoly, why do these curve-based scores
disallow multiclass?)

Whenever one class is missing from y_true, it's not possible to compute
the score. I didn't wanted to add the magic for the class inference and got
user into problems.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-50855460
.

@jnothman @arjoly there has been a lot of progress on the averaging front. How hard is it to implement this now?

it could perhaps be similar to the R function from the pROC package
http://www.inside-r.org/packages/cran/pROC/docs/multiclass.roc

Hi, I implemented a draft of the macro-averaged ROC/AUC score, but I am unsure if it will fit for sklearn.

Here is the code:

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

def multiclass_roc_auc_score(truth, pred, average="macro"):

    lb = LabelBinarizer()
    lb.fit(truth)

    truth = lb.transform(truth)
    pred = lb.transform(pred)

    return roc_auc_score(truth, pred, average=average)

Could it be as simple as this?

@fbrundu if this is the standard meaning. It is certainly one possible interpretation.

There is a nice summary here:
http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf

The pROC package implements Hand and Till:
http://download.springer.com/static/pdf/398/art%253A10.1023%252FA%253A1010920819831.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1023%2FA%3A1010920819831&token2=exp=1469743016~acl=%2Fstatic%2Fpdf%2F398%2Fart%25253A10.1023%25252FA%25253A1010920819831.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1023%252FA%253A1010920819831*~hmac=bc68686d3782ac6af3c3cda13c1b36aad6de5d01d16a25870cace5fe9699fb8a

The version of Hand and Till seems to be generally accepted and I vote we implement that.
There is also a version of Provost and Domingos which I probably should root for given that Provost is currently my director, but that hasn't caught on.
The Provost-Domingos is what @fbrundu said only with average='weighted'.

TLDR: PR for Hand and Till welcome. Optionally Provost and Domingos with option to change the averaging.

Hi, has there been any progress on implementing this?
What I've seen in most other libraries (e.g. WEKA) is that they use the weighted average. I would think this is what @fbrundu proposed using average='micro' ?

@joaquinvanschoren R uses the Hand and Till. I'd prefer that one, too. I have a student that will work on this soon.

@amueller I can work on this :)

@kchen17 thanks!

We discussed this at OpenML quite a bit. For multiclass AUC there are no guarantees that one approach (macro-averaging, micro-averaging, weighted averaging, ...) is better than the other. In R you can find at least 5 different approaches (all also available in MLR now).
When implementing this in scikit-learn, it would be great if there is at least the possibility to choose the one that makes most sense for your application, even if you use Hand-Till as the default. Hand-Till is a non-weighted approach by the way, it does not take label imbalance into account.

I'm happy to have multiple versions. non-weighted and "not taking label imbalance into account" are two different things ;) Do you have a list and references?

What's micro-averaging in this case?

Note that we already micro- and macro-averaged ROC AUC for multiclass problems implemented in this example:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#multiclass-settings

Actually, I think that documentation is incorrect and should say
multilabel...

On 26 September 2016 at 23:16, Olivier Grisel [email protected]
wrote:

Not that we already micro- and macro-averaged ROC AUC for multiclass
problems implemented in this example:

http://scikit-learn.org/stable/auto_examples/model_
selection/plot_roc.html#multiclass-settings


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-249566346,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65IeU7k2CFwyHxTTAjk-5orIxWe6ks5qt8WsgaJpZM4CFzud
.

In micro-averaging, your true positive rate (TPR) is calculated by taking the sum of all TPs of all the classes, and dividing by the sum of all TPs and FNs of all the classes, i.e. for a 3-class problem:
TPR = (TP1+TP2+TP3)/(TP1+TP2+TP3+FN1+FN2+FN3)

Example confusion matrix:
[[1,2,3],
[4,5,6],
[7,8,9]]
TPR = (1+5+9)/(1+5+9+(2+3)+(4+6)+(7+8))
Do the same for the false positive rate and you can compute AUC.

Macro averaging just computes the TPR for each class separately and averages them (weighted by the number of examples in that class or not):
TPR = (1/3)* (TP1/(TP1+FN1) + TP2/(TP2+FN2) + TP2/(TP2+FN2))

With the same example:
TPR = (1/3)* (1/(1+(2+3)) + 5/(5+(4+6)) + 9/(9+(7+8)))

Maybe this helps (this uses precision, but the idea is the same):
http://stats.stackexchange.com/questions/156923/should-i-make-decisions-based-on-micro-averaged-or-macro-averaged-evaluation-mea

I would personally never use an unweighted macro-average, but l'll see if I can find the papers that studied this.

Hi! I was able to start looking into this issue last week, and I wanted to post a quick update/some questions, just to make sure I am on the right track.

  • So far: I am starting off with implementation of a function multiclass_roc_auc_score which will, by default, have some average parameter set to None. This default will use the Hand-Till algorithm (as discussed, this doesn't take into account label imbalance).
  • Would the method accept the same parameters as those in roc_auc_score?
  • And going off of that, the difference would be then that y_true could have more than 2 classes of labels. Hand-Till would involve finding all possible pairs of labels, computing roc_auc_score for each of these pairs, and then taking the mean of these.

Let me know what corrections/suggestions you may have!

Ordinarily, we would avoid creating another function if reusing roc_auc_score is reasonably feasible. I think leaving the default as 'macro' is acceptable.

One key thing you should be thinking about is how to test these changes, including changing the traits of roc_auc_score in metrics/tests/test_common.py

yeah we should update the docs. I think the multi-class part is not properly documented. there is no multi-class support currently.

@joaquinvanschoren interestingly that paper didn't discuss any of the multi-class AUC papers mentioned above, in particular not the Fawcett paper from 2005.... hm I guess it's a renormalization of the 1-vs-1 multi-class?

so currently we only have multi-label, and so we want to add multi-class with 1vs1 and 1vsRest and they each have weighted and unweighted variants.
I don't really understand how the sample and micro averaging work for AUC :(

So... I propose we add a parameter multi-class to AUC and that can be ovo or ovr, and will consider the weighting parameter. I'm not sure we want to allow sample and micro as that doesn't really make sense to me.

@arjoly so micro and sample operate on the rows rather than the columns of the matrix? Is there any papers about that? I didn't find that in the ROC literature.

The problem with that is that to make the hand-till measure default we'd have to do weighted average OvO and we can't really change the weighting option. So maybe we do OVR by default and explain in the narrative that OvO with weighting is also a good choice and add a reference?

The summary of the paper @joaquinvanschoren cited also says that all the AUC versions give pretty much the same results.

@amueller: Had a chance to read your comment again, and I'm a little confused about this part:

The problem with that is that to make the hand-till measure default we'd have to do weighted average OvO and we can't really change the weighting option. So maybe we do OVR by default and explain in the narrative that OvO with weighting is also a good choice and add a reference?

I was going to modify the roc_auc_score to incorporate a multiclass=['ovo', 'ovr'] parameter as per your response. If OvR is default (roc_auc_score(y_true, y_score, multiclass="ovo" ... )), but Hand & Till is OvO, what do I do w.r.t addressing the OvR part of the implementation? (i.e. if I detect that y_true is multiclass, just raise an error if "ovr" is unimplemented and instruct users to pass in "ovo"?)

Sorry, I was expecting you to implement both ovo and ovr ;) I think that should be fairly straight-forward.

@amueller: Noted and that will be incorporated as well! Also wanted to ask: is there any advice on how to detect the difference between multiclass and multilabel? At first, I was just checking the dimensions of y_score but very quickly realized this would not be sufficient. (i.e. just checking that the labels are only 0s and 1s?)

Multilabel means that multiple labels are predicted at once: you get a
vector of predictions per instance. Multiclass means you get a single
prediction but that prediction can have more than two values (it is not
binary).

Sometimes, people solve the multiclass case by binarizing the output, hence
you get multiple binary values per instance (hence multilabel) and this
often causes confusion.
On Sat, 8 Oct 2016 at 16:33, Kathy Chen [email protected] wrote:

@amueller https://github.com/amueller: Noted and that will be
incorporated as well! Also wanted to ask: is there any advice on how to
detect the difference between multiclass and multilabel? At first, I was
just checking the dimensions of y_score but very quickly realized this
would not be sufficient.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-252427642,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABpQV7Mv0rHGEfrkYi5Xezz3PItyrLZ6ks5qx6mdgaJpZM4CFzud
.

Hi, I hope type_of_target could solve the purpose of differentiating between multi-label and multi-class output. HTH

using type_of_target is a good idea. Though in scikit-learn the dimensionality of y is actually the indicator whether we want to do multi-label or multi-target. If you binarize the output as @joaquinvanschoren suggested scikit-learn will always assume multi-label.

type_of_target is fine to distinguish between the y_trues, @amueller

On 9 October 2016 at 05:18, Andreas Mueller [email protected]
wrote:

using type_of_target is a good idea. Though in scikit-learn the
dimensionality of y is actually the indicator whether we want to do
multi-label or multi-target. If you binarize the output as
@joaquinvanschoren https://github.com/joaquinvanschoren suggested
scikit-learn will always assume multi-label.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298#issuecomment-252439908,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wa5fnE_LX3LLXbCoc0Z4hBbSAQ0ks5qx95rgaJpZM4CFzud
.

Hi all, I just wanted to let you know that I submitted a "preliminary" PR. I am interested in hearing some feedback on implementation (e.g. I'm sure there are ways to leverage numpy/etc. in a better way than I am doing right now), along with best practices for adding new tests, documentation wording, etc.

Thank you for all of the help so far!

Any progress on adding multiclass support for AUC?

@joaquinvanschoren: working on revisions after a code review by @jnothman in #7663. Will likely submit another update on that next week when I've finished with midterms

Hi @kathyxchen, @jnothman,

Any updates on the PR?

Just checking in to see if there is any progress on adding multiclass support for AUC?

we have trouble determining what is both an accepted and principled
formulation of ROC AUC for multiclass. See
https://github.com/scikit-learn/scikit-learn/pull/7663#issuecomment-307566895
and below.

So fellows. Is there any progress with multiclass auc score? I found very confusing official documentation code with iris dataset. Because this method shows that my model predicts random numbers fairly well.

This is almost done, we need to decide on a API detail before merging: https://github.com/scikit-learn/scikit-learn/pull/12789#discussion_r295693965

@trendsearcher can you provide an example please? It's now merged but I'd like to see the issue you experienced.

Glad to help. How can I give an example (it has lots of code and may be not
intuitive)? Maybe I can write it in plain text?

чт, 18 июл. 2019 г. в 00:35, Andreas Mueller notifications@github.com:

@trendsearcher https://github.com/trendsearcher can you provide an
example please? It's now merged but I'd like to see the issue you
experienced.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3298?email_source=notifications&email_token=AKS7QOFYRQY7RZJBWUVVJSTP76GDFA5CNFSM4AQXHOO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GU7EI#issuecomment-512577425,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKS7QOFQ5LAIZ2ZBR4M4EATP76GDFANCNFSM4AQXHOOQ
.

Hi, I implemented a draft of the macro-averaged ROC/AUC score, but I am unsure if it will fit for sklearn.

Here is the code:

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

def multiclass_roc_auc_score(truth, pred, average="macro"):

    lb = LabelBinarizer()
    lb.fit(truth)

    truth = lb.transform(truth)
    pred = lb.transform(pred)

    return roc_auc_score(truth, pred, average=average)

Could it be as simple as this?

@fbrundu Thank you for sharing! I tried your code. But when I call this function, I meet a problem saying "Multioutput target data is not supported with label binarization". Then I remove the code "pred=lb.transform(pred)" in the function. However, I meet another problem that "Found input variables with inconsistent numbers of samples: [198, 4284]".

May I ask if you could help me solve this? Thank you!

@Junting-Wang

 I meet a problem saying "Multioutput target data is not supported with label binarization". 

you have to use predict instead of predict_proba

@fbrundu is your implementation correct? I use it and works.

Was this page helpful?
0 / 5 - 0 ratings