Scikit-learn: min_weight_fraction_leaf suggested improvements

Created on 28 Jun 2016  ·  3Comments  ·  Source: scikit-learn/scikit-learn

Description

I've been using the min_weight_fraction_leaf parameter of DecisionTreeClassifier and RandomForestClassifier incorrectly and I think it's likely other people are doing the same thing as me.

For example, the documentation for min_weight_fraction_leaf in DecisionTreeClassifier says

The minimum weighted fraction of the input samples required to be at a leaf node.

It was really unclear to me what the docs meant by "weighted fraction of the input samples". Initially I thought it was a weighting based on the size of the classes or the values given by class_weight. I think a slight change in the parameter description could clear up this confusion. Perhaps something like

The minimum weighted fraction of the input samples required to be at a leaf node where weights are determined by sample_weight in the fit() method.

Furthermore, it appears min_weight_fraction_leaf only applies if sample_weight is provided in the call fit(). If sample_weight is not provided in the call to fit(), min_weight_fraction_leaf is silently ignored. Here, I think min_weight_fraction_leaf should still apply under the assumption that all samples are equally weighted OR a warning should be given that min_weight_fraction_leaf will not be used since sample_weight was not provided.

Versions

Darwin-15.5.0-x86_64-i386-64bit
Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec 7 2015, 11:24:55)
[GCC 4.2.1 (Apple Inc. build 5577)]
NumPy 1.11.0
SciPy 0.17.1
Scikit-Learn 0.17.1

Also, I would love to make the changes I suggested (if they're deemed worthy), but I have little experience contributing to open-source libraries. Might need a bit of hand-holding if someone would be willing to help me out.

Bug

Most helpful comment

I think if min_weight_fraction_leaf is set and no sample_weights provided, it should either raise an error or assume uniform weights. In this case it's a bit redundant with min_samples_leaf but I think assuming uniform weights would still be better.

All 3 comments

Please submit a PR

On 29 June 2016 at 06:09, Ben [email protected] wrote:

Description

I've been using the min_weight_fraction_leaf parameter of
DecisionTreeClassifier and RandomForestClassifier incorrectly and I think
it's likely other people are doing the same thing as me.

For example, the documentation for min_weight_fraction_leaf in
DecisionTreeClassifier
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
says

The minimum weighted fraction of the input samples required to be at a
leaf node.

It was really unclear to me what the docs meant by "weighted fraction of
the input samples". Initially I thought it was a weighting based on the
size of the classes or the values given by class_weight. I think a slight
change in the parameter description could clear up this confusion. Perhaps
something like

The minimum weighted fraction of the input samples required to be at a
leaf node where weights are determined by sample_weight in the fit() method.

Furthermore, it appears min_weight_fraction_leaf only applies if
sample_weight is provided in the call fit(). If sample_weight is not
provided in the call to fit(), min_weight_fraction_leaf is silently
ignored. Here, I think min_weight_fraction_leaf should still apply under
the assumption that all samples are equally weighted OR a warning should be
given that min_weight_fraction_leaf will not be used since sample_weight
was not provided.
Versions

Darwin-15.5.0-x86_64-i386-64bit
Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec 7 2015, 11:24:55)
[GCC 4.2.1 (Apple Inc. build 5577)]
NumPy 1.11.0
SciPy 0.17.1
Scikit-Learn 0.17.1


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/6945, or mute the
thread
https://github.com/notifications/unsubscribe/AAEz6xE2BmEJHo6hGgTWoigsPutoD4_nks5qQX9zgaJpZM4JAe96
.

I think if min_weight_fraction_leaf is set and no sample_weights provided, it should either raise an error or assume uniform weights. In this case it's a bit redundant with min_samples_leaf but I think assuming uniform weights would still be better.

I think this is similar to min_samples_leaf. Instead of requiring an absolute number of samples in each leaf node, min_weight_fraction_leaf provides the option to require a fraction of samples (or weights) in each leaf. Whether the model is using weights for samples depends on the class_weight.

Was this page helpful?
0 / 5 - 0 ratings