Scikit-learn: Implement Gower Similarity Coefficient

Created on 19 Nov 2015  ·  51Comments  ·  Source: scikit-learn/scikit-learn

As suggested by @lesshaste

Paper - http://cbio.ensmp.fr/~jvert/svn/bibli/local/Gower1971general.pdf

I can implement this if there is sufficient interest?

@jnothman @amueller @agramfort

New Feature

Most helpful comment

Hi,

In order to contribute somehow, I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

The results I obtained with this so far are the same from R´s daisy function.

The source code is available at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/

Feel free to use it

All 51 comments

Thanks.

This documentation for daisy from R might be relevant too https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html as it is a popular use case for the Gower coefficient.

suggested where? in what context?

@agramfort I suggested it on gitter. The main interest for this coefficient is when the variables have mixed types (that is categorical, numerical, ordinal) . One popular use case is in the R package daisy() mentioned before when clustering data with mixed types (see page 27 of https://cran.r-project.org/web/packages/cluster/cluster.pdf). More generally http://www.clustan.talktalk.net/gower_similarity.html claims "Gower's General Similarity Coefficient is one of the most popular measures of proximity for mixed data types." which seems like a plausible claim.

is there a benchmark or convincing example that would motivate this?

@agramfort I think it's more that we have no other way of calculating a dissimilarity coefficient for mixed data types currently and this appears to be the standard one. I can find lots of examples and question/answers online where people explain what the Gower coefficient is or suggest its use for mixed data types but nothing I could call a benchmark yet. The original paper has been cited 2298 times according to Google scholar.

ok I am convinced :)

@agramfort Great! This change would complement https://github.com/scikit-learn/scikit-learn/pull/4899 nicely which introduces native categorical variable support for trees.

Having said that, I now realise that scikit-learn has no native support for ordinals at all currently so this part of my suggestion would be slightly ahead of its time. I suppose one could regard it in a positive way as the first step in support for ordinal features.

@amueller To be tagged with [New Feature]...

Hi,

In order to contribute somehow, I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

The results I obtained with this so far are the same from R´s daisy function.

The source code is available at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/

Feel free to use it

I was just wondering if there was any update on this? Plus, is the issue noted by @marcelobeckmann still relevant?

@ashimb9 it seems we need someone to integrate the code from @marcelobeckmann

@agramfort Hmm, in that case I am going to have a go when I have some free time. By the way, do you happen to know anything about the current state of the issue noted above: "in the pdist module, because internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data"

Hi, there are some private functions (e.g., _convert_to_double, _copy_array_if_base_present) in pdist that assume the underlying data is completely numeric, which is not true when you have a Dataframe with categorical data.

I volunteer to integrate this code and make it available in a fork, you can assign this ticket to me.

The github assign feature only works for team members

On 17 Jul 2017 7:32 pm, "marcelobeckmann" notifications@github.com wrote:

Hi, there are some private functions (e.g., _convert_to_double,
_copy_array_if_base_present) in pdist that assume the underlying data is
completely numeric, which is not true when you have a Dataframe with
categorical data.

I volunteer to integrate this code and make it available in a fork, you
can assign this ticket to me.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/5884#issuecomment-315707830,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz62L3HHzGsSerW5G3n-Z8rrNoV6mwks5sOyoTgaJpZM4Glm0p
.

No worries, I'll fork it and you can get the get code later. For me the important is to contribute. I'll let you know when done.

Thanks @marcelobeckmann for taking this up. While you are at it (and if it is feasible for you), I was wondering if you would consider adding support for gower calculation on data with NaN values also, as implemented in the daisy package in R (which you have also referenced above)?

I finished the integration of Gower to sklearn.metrics.pairwise (also observing the treatment of NaN values). I'm going to prepare some unit tests before to submit my forked code.

@marcelobeckmann Great! Thank you so much, especially for including NaN support! :)

PS: If I may suggest, you might want to consider initiating a pull request so the reviewers can begin looking at your code while you work on the unit tests and so forth.

I made a pull request some days ago, b5884.

Yes, it's in the queue to be reviewed.

On 17 August 2017 at 23:40, Marcelo Beckmann notifications@github.com
wrote:

I made a pull request some days ago, b5884.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/5884#issuecomment-323076581,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69uMu0XsoAUfvwWikkadjGCk5yvKks5sZELKgaJpZM4Glm0p
.

I made the changes required by CI, and all the checks have passed.

@marcelobeckmann great work! you might want to change row 659 to something like:
ranges_of_numeric[col] = (1 - min / max, 0)[max == 0] if (max!=0) else 0.0

Im getting division by zero-warnings in your second test case otherwise.

Hi, I changed the code to avoid warnings as proposed by Pierre Wessman, and CI is green. I need someone to review my code.

@marcelobeckmann and potentially others.

Hi Marcelo (or potentially others), got a few quick question in regards to your implementation of gower coefficient which you have placed here: https://sourceforge.net/projects/gower-distance-4python/files/.

  1. Do I need a panda dataFrame for feeding the original data into the function or can I use a numpy array too?

  2. I am importing my data into a numpy array. All columns are numerical real numbers apart from the first column which is the unique ID. I am getting two issues,

  • firstly, when I run the function, it returns Data Conversion Warning saying the dtype U7 was converted to object!!. I assumed it was because the array entries for some reason appear in quotation marks and hence are string. So i cast the type of array entries to int32 for example and it still gives the conversion error saying int32 was converted to objects

  • Secondly, and probably linked to above, every time I run the function and plot the result I receive a different visualisation (different spread of the points).

Would you be able to advise me on the above please?

Thanks very much

Hi Ali,

Thanks for your interest in this implementation of Gower distance.

While the code that I made a pull request is not approved by scikit learn commiters (CI is green and just waiting for a review), I pushed this newest and stable implementation to: https://sourceforge.net/projects/gower-distance-4python/files/gower_function-v3.ipynb/download

Let's go for your questions:

  1. Do I need a Panda DataFrame for feeding the original data into the function or can I use a numpy array too?

Answer: You can use DataFrame or Numpy in this new version 3. Sparse matrices are also supported.

  1. . I am importing my data into a numpy array. All columns are numerical real numbers apart from the first column which is the unique ID. I am getting two issues,
  • firstly, when I run the function, it returns Data Conversion Warning saying the dtype U7 was converted to object!!. I assumed it was because the array entries for some reason appear in quotation marks and hence are string. So i cast the type of array entries to int32 for example and it still gives the conversion error saying int32 was converted to objects

Answer: This new version supports numeric categorical attributes, there is an extra parameter categorical_features, that you can set an array with false (for numerical attributes) or true (for the categorical ones)

  • Secondly, and probably linked to above, every time I run the function and plot the result I receive a different visualisation (different spread of the points).

Answer: The new version I pushed solved this problem.

note that I do intend to review this PR, but it is not very highest
priority atm

Hi Ali,

  1. The latest one is gower_function-v3.ipynb, and yes it deals with nan
    propagation

  2. You can use gower_distance(X) only, if your categorical att are not
    numeric, or gower_distance(X, categorical_features=[False, True,
    False,...]), if your cat attr are represented as numeric.

Let me know in private if you have any problem, because this implementation
I pushed to internet should not be the concern of scikit learn, they have a
lot to do, and here is not the best place to discuss this.

On 30 Nov 2017 11:51, "Ali-ry" notifications@github.com wrote:

@marcelobeckmann https://github.com/marcelobeckmann

Hi Marcelo (or potentially others), got a quick question in regards to your
implementation of gower coefficient which you have placed here:
https://sourceforge.net/projects/gower-distance-4python/files/

1.

Is the gower_single_function-v2.ipynb the final version and deals with
NaN as well?
2.

more importantly, does this implemetnation allows you to get the
similarities within one single sample data? because in most cases what you
need is to get the gower distance between each pair of observation within
one single sample data as opposed to comparing two different sample data.

Thanks very much


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/5884#issuecomment-348166596,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3G79jWVbpBNdAFOAim7wJS92-QGl0dks5s7pa8gaJpZM4Glm0p
.

Hi Ali,

  1. The latest one is gower_function-v3.ipynb, and it is a copy of the one I pushed to scikit learn, and yes, it deals with nan propagation

  2. You can use gower_distance(X) only, if your categorical att are not numeric, or gower_distance(X, categorical_features=[False, True, False,...]), if your categorical attr are represented as numeric.

Please let me know in private if you have any problem, because this implementation that I pushed to internet should not be the concern of scikit learn, they have a lot to do, and here is not the best place to discuss something that is outside scikit learn project.

@marcelobeckmann Hello Marcelo,
Should the categorical_features parameter's value be True or False if we have the categorical variables encoded into a numeric format?

I also get the following error:
ValueError: Found array with 0 sample(s) (shape=(0, 0)) while a minimum of 1 is required by check_pairwise_arrays.

It worked successful in the same data previously, but now it gives such an error. Why could it be?

Hi @bendiste,

If you represent True and False as 1 and 0 you will get the same results.

Are you using the newest notebook gower_function-v6.4.ipynb in
https://sourceforge.net/projects/gower-distance-4python/files/
?

I'm finishing to write an article, hopefully this month I'll be to make the requested changes to make my implementation accepted in the master of scikit-learn.

Hi @marcelobeckmann , thank you for your reply. And yes, I am using the newest version you indicated. When I redownloaded it, it worked successfully. I would like to ask a couple of things since I am a newbie in Machine Learning:
1- Can I use KPCA to decrease dimensions as an input to hierarchical clustering algorithm?
2- Or do I have to use the whole dataset with high dimensionality as an input to hierarchical clustering?

Hi @marcelobeckmann,
thank you for this implementation!

I have tried the gower_function-v6.4 version.
I can see that the distances in your unit testings is the same, no matter if you specify the categorical columns or not. I have also tried with my own data, where it also do not affect the result.

Is this correct?

Thank you!

Hi @annelaura,

Sorry about the delay to reply. Yes that is correct, that test was just to check if the categorical_features=[0, 1] parameters won't affect the results, if the non numeric columns can also be identified as objects. The input data is the same, so, the results must be the same.

After I finished some papers, I'm back to work to finally propose my implementation to scikit master branch! :)

@marcelobeckmann any news regarding this? :)

Hi Alex, I've finished all the modifications the reviewers asked so far in the pull request, and CI is green. I also pinged the reviewers to check if they are happy, then we can close this pull request and push this to release.

Any updates? @marcelobeckmann

Work in progress after review.

Has the PR been approved? @marcelobeckmann

Not yet, work is in progress after some recent code review.

Too bad I need it.

Is just the function available somewhere? So I can use it on my own (for research purpose)

Thanks

You can take the latest commit of this function in this PR:
https://github.com/scikit-learn/scikit-learn/pull/9555

I managed to make it work locally. Thanks!

Just a quick +1 on this ticket! Thanks for all the work on this.

Bump. This would be a great addition. I can't believe it has taken 4 years for a relatively simple calculation to make it into sklearn!!

Or you could say: thanks for your dedicated persistence over four years of
volunteered effort!

Or you could say: thanks for your dedicated persistence over four years of volunteered effort!

You are right, sorry. I didn't mean to come across as rude. I greatly appreciate the effort. I've been using this locally for a while now, and it would be great to see it added. It's the only distance metric that I know of for mixed data types.

Aside from the volunteer effort, and that the core devs have not considered
this urgent, there are indeed challenges around how to handle mixed types,
and around how to perform the scaling in a train-test setup.

Looking forward to it in sklearn.

Someone who claims to have "borrowed ideas" from this thread has released a package on github to calculate Gower distance (similarity, technically). Speaking of distance and similarity, the example is identical to the one from @marcelobeckmann. I've only glanced at the code so far, but here's a glimpse:

From @marcelobeckmann's notebook:

    # This is to normalize the numeric values between 0 and 1.
    X_num = np.divide(X_num ,max_of_numeric,out=np.zeros_like(X_num), where=max_of_numeric!=0)

From "Michael Yan":

    # This is to normalize the numeric values between 0 and 1.
    Z_num = np.divide(Z_num ,num_max,out=np.zeros_like(Z_num), where=num_max!=0)

Hi guys, thanks to keep an eye on this.

I'm glad people is taking the code and trying to improve it, that is the purpose to be open source, despite some credit is appreciated.

Hopefully this code will be part of scikit-learn, if this PR #9555 be accepted.

Best regards,

Marcelo Beckmann

Good luck in the process!!

Was this page helpful?
0 / 5 - 0 ratings