Scikit-learn: Make random_state descriptions more informative and refer to Glossary

Created on 29 Jan 2018  ·  60Comments  ·  Source: scikit-learn/scikit-learn

We recently added a Glossary to our documentation, which describes common parameters among other things. We should now replace descriptions of random_state parameters to make them more concise and informative (see #10415). For example, instead of

    random_state : int, RandomState instance or None, optional, default: None
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

in both KMeans and MiniBatchKMeans, we might have:

KMeans:
    random_state : int, RandomState instance, default=None
        Determines random number generation for centroid initialization.
        Pass an int for reproducible results across multiple function calls.
        See :term:`Glossary <random_state>`.


MiniBatchKMeans:
    random_state : int, RandomState instance, default=None
        Determines random number generation for centroid initialization and
        random reassignment.
        Pass an int for reproducible results across multiple function calls.
        See :term:`Glossary <random_state>`.

Therefore, the description should focus on what is the impact of random_state on the algorithm.

Contributors interested in contributing this change should take on one module at a time, initially.

The list of estimators to be modified is the following:

List of files to modify using kwinata script

  • [x] [sklearn/dummy.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/dummy.py) - 59
  • [x] [sklearn/multioutput.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multioutput.py) - 578, 738
  • [ ] [sklearn/kernel_approximation.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_approximation.py) - 41, 143, 470
  • [ ] [sklearn/multiclass.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py) - 687
  • [x] [sklearn/random_projection.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/random_projection.py) - 178, 245, 464, 586
  • [x] [sklearn/feature_extraction/image.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/image.py) - 368, 502
  • [x] [sklearn/utils/random.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/random.py) - 39 open PR
  • [x] [sklearn/utils/extmath.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py) - 185, 297
  • [x] [sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py) - 736, 918
  • [x] [sklearn/ensemble/_hist_gradient_boosting/binning.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/binning.py) - 37, 112

  • [x] [sklearn/ensemble/_bagging.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_bagging.py) - 503, 902

  • [x] [sklearn/ensemble/_gb.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gb.py) - 887, 1360
  • [x] [sklearn/ensemble/_forest.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_forest.py) - 965, 1282, 1559, 1868, 2103
  • [x] [sklearn/ensemble/_iforest.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_iforest.py) - 109
  • [ ] [sklearn/ensemble/_base.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_base.py) - 52
  • [x] [sklearn/ensemble/_weight_boosting.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_weight_boosting.py) - 188, 324, 479, 900, 1022
  • [x] [sklearn/decomposition/_truncated_svd.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_truncated_svd.py) - 59 Open PR
  • [x] [sklearn/decomposition/_kernel_pca.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_kernel_pca.py) - 79 Open PR
  • [x] [sklearn/decomposition/_dict_learning.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_dict_learning.py) - 364, 485, 692, 1135, 1325 Open PR
  • [x] [sklearn/decomposition/_fastica.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_fastica.py) - 205, 344 Open PR
  • [x] [sklearn/decomposition/_nmf.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_nmf.py) - 290, 475, 966, 1159 Open PR
  • [x] [sklearn/decomposition/_pca.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_pca.py) - 192 Open PR
  • [x] [sklearn/decomposition/_sparse_pca.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_sparse_pca.py) - 82, 285 Open PR
  • [x] [sklearn/decomposition/_lda.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_lda.py) - 60, 79, 225 Open PR
  • [x] [sklearn/decomposition/_factor_analysis.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_factor_analysis.py) - 92 Open PR
  • [x] [sklearn/cluster/_kmeans.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_kmeans.py) - 56, 241, 380, 583, 700, 1150, 1370
  • [x] [sklearn/cluster/_spectral.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_spectral.py) - 41, 197, 313
  • [x] [sklearn/cluster/_bicluster.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_bicluster.py) - 236, 383
  • [x] [sklearn/cluster/_mean_shift.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_mean_shift.py) - 48
  • [x] [sklearn/preprocessing/_data.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/_data.py) - 2178, 2607
  • [x] [sklearn/impute/_iterative.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/impute/_iterative.py) - 125
  • [x] [sklearn/linear_model/_ransac.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_ransac.py) - 152 Open PR
  • [x] [sklearn/linear_model/_coordinate_descent.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_coordinate_descent.py) - 580, 860, 1313, 1487, 1665, 1851, 2016, 2192 Open PR
  • [x] [sklearn/linear_model/_sag.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_sag.py) - 154 Open PR
  • [x] [sklearn/linear_model/_perceptron.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_perceptron.py) - 55 Open PR
  • [x] [sklearn/linear_model/_passive_aggressive.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_passive_aggressive.py) - 76, 322 Open PR
  • [x] [sklearn/linear_model/_logistic.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_logistic.py) - 587, 924, 1100, 1658 Open PR
  • [x] [sklearn/linear_model/_base.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_base.py) - 65
  • [x] [sklearn/linear_model/_stochastic_gradient.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_stochastic_gradient.py) - 369, 811, 1419 Open PR
  • [x] [sklearn/linear_model/_theil_sen.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_theil_sen.py) - 243 Open PR
  • [x] [sklearn/linear_model/_ridge.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_ridge.py) - 325, 693, 853 Open PR
  • [x] [sklearn/tree/_classes.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py) - 653, 1033, 1322, 1552
  • [x] [sklearn/feature_selection/_mutual_info.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/_mutual_info.py) - 226, 335, 414
  • [x] [sklearn/metrics/cluster/_unsupervised.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/cluster/_unsupervised.py) - 80
  • [x] [sklearn/svm/_classes.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/_classes.py) - 90, 312, 546, 752 Open PR
  • [x] [sklearn/svm/_base.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/_base.py) - 853 Open PR
  • [x] [sklearn/inspection/_permutation_importance.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_permutation_importance.py) - 81
  • [x] [sklearn/gaussian_process/_gpr.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/gaussian_process/_gpr.py) - 109, 382
  • [x] [sklearn/gaussian_process/_gpc.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/gaussian_process/_gpc.py) - 110, 537
  • [x] [sklearn/manifold/_spectral_embedding.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/_spectral_embedding.py) - 171, 387
  • [x] [sklearn/manifold/_locally_linear.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/_locally_linear.py) - 146, 252, 584
  • [x] [sklearn/manifold/_t_sne.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/_t_sne.py) - 558
  • [x] [sklearn/manifold/_mds.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/_mds.py) - 51, 198, 314
  • [x] [sklearn/utils/_testing.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/_testing.py) - 521
  • [x] [sklearn/utils/__init__.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/__init__.py) - 478, 623
  • [x] [sklearn/datasets/_kddcup99.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_kddcup99.py) - 79
  • [x] [sklearn/datasets/_covtype.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_covtype.py) - 69
  • [x] [sklearn/datasets/_rcv1.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_rcv1.py) - 114
  • [x] [sklearn/datasets/_samples_generator.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_samples_generator.py) - 127, 323, 440, 531, 618, 688, 767, 904, 965, 1030, 1106, 1159, 1218, 1258, 1307, 1368, 1420, 1483, 1571, 1662
  • [x] [sklearn/datasets/_olivetti_faces.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_olivetti_faces.py) - 64
  • [x] [sklearn/datasets/_base.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_base.py) - 157
  • [x] [sklearn/datasets/_twenty_newsgroups.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_twenty_newsgroups.py) - 187
  • [x] [sklearn/mixture/_bayesian_mixture.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/_bayesian_mixture.py) - 166
  • [x] [sklearn/mixture/_base.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/_base.py) - 139
  • [x] [sklearn/mixture/_gaussian_mixture.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/_gaussian_mixture.py) - 504
  • [x] [sklearn/model_selection/_validation.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_validation.py) - 1006, 1176
  • [x] [sklearn/model_selection/_split.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py) - 382, 588, 1091, 1196, 1250, 1390, 1492, 1605, 2049 Open PR
  • [x] [sklearn/model_selection/_search.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_search.py) - 207, 1299
  • [x] [sklearn/neural_network/_multilayer_perceptron.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/_multilayer_perceptron.py) - 782, 1174
  • [x] [sklearn/neural_network/_rbm.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/_rbm.py) - 59
  • [x] [sklearn/neighbors/_kde.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/_kde.py) - 233
  • [x] [sklearn/neighbors/_nca.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/_nca.py) - 112
  • [x] [sklearn/covariance/_robust_covariance.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/covariance/_robust_covariance.py) - 63, 233, 328, 545
  • [x] [sklearn/covariance/_elliptic_envelope.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/covariance/_elliptic_envelope.py) - 40
Documentation Moderate Sprint good first issue

Most helpful comment

We want to work on sklearn/preprocessing/_data.py - 2178, 2607
@rachelcjordan and @fabi-cast

wimlds #SciKitLearnSprint

All 60 comments

Hi @jnothman, Can I take this issue? Thanks

Claim a module/subpackage and have a go...

On 30 January 2018 at 00:24, Somya Anand notifications@github.com wrote:

Hi @jnothman https://github.com/jnothman, Can I take this issue? Thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10548#issuecomment-361243951,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz62ie2pMFVg7uM6_MVnmWKRX-efgHks5tPcaHgaJpZM4Rwij3
.

@jnothman I am sorry for being naive but can you elaborate about the module/submodule? I mean are you referring to a sub-package like Kmeans for instance?

I think what @jnothman means is just start with one file, for example sklearn/cluster/k_means_.py, update the random_state docstring as in the top post and open a PR.

a subpackage is something like sklearn.cluster

Thanks. Will do that and open a PR.

Hi! @jnothman

Would you also like to replace the following comments as seen in grid_search.py? They have an extra line as compared to the one shared by you.

random_state : int, RandomState instance or None, optional (default=None)
        Pseudo random number generator state used for random uniform sampling
        from lists of possible values instead of scipy.stats distributions.
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

I can take grid_search.py and k_means.py(KMeans).

leave grid_search.py alone. it is deprecated. The idea is to minimise the
content that is repeated, and available in the glossary, so that we can
give the users to most informative description about random_state's role in
the particular estimator.

Thanks @jnothman. WIll I need to understand these algorithms before I can replace this random_state information?

You will need to understand the algorithms broadly, but not every detail of
their implementation. You will need to be able to find where random_state
is used, if the randomisation in the algorithm is not completely obvious.
In some cases, it may be appropriate to not even give much more detail than
just linking to the glossary; we'll have to see how it goes.

Okay, thank you. I will start going through the algorithms slowly.

Regards,
Shivam Rastogi

I have opened a pull request #10614

Since @aby0 has not claimed the sklearn.cluster module yet. I would like to claim the whole module. Please let me know if I can work on it or I should work on something else.

Any update guys? It is a long holiday for us so let me know if I can pick this.

I'll take the datasets module since I'm already poking around in the docstrings there for #10731.

I'm claiming the linear_model module. will raise a PR soon. #11900 raised.

Claiming decomposition module next.

Checklist of modules where this needs to be done:

  • [ ] developers
  • [ ] covariance
  • [x] decomposition
  • [ ] dummy.py
  • [ ] ensemble
  • [ ] feature_extraction
  • [ ] feature_selection
  • [ ] gaussian_process
  • [ ] kernel_approximation.py
  • [x] linear_model via #11900
  • [ ] manifold
  • [ ] metrics
  • [ ] mixture
  • [ ] model_selection
  • [ ] multiclass.py
  • [ ] multioutput.py
  • [ ] neighbors
  • [ ] neural_network
  • [ ] preprocessing
  • [ ] random_projection.py
  • [ ] svm
  • [ ] tree
  • [ ] utils

We had some trouble reaching consensus on how to strike the right balance
here, iirc

So do pay attention to the prior PRs merged above

@jnothman thanks! will update the PRs for to mention the reproducibility when passing an int.

willing to take up all the other modules in another PR, once these have been reviewed...

I'm claiming covariance.

@BlackTeaAndCoffee please be aware, the doc string format is not yet finalised, discussions have been happening on the other PRs listed here. So you might wanna have a look too.

I am claiming feature_extraction

@jnothman , @NicolasHug, just discovered #15222 and a number of PR related to it that I haven't taken into account in summarizing this one... some of them are never been reviewed... :(
In order to make things clear for sprints, I'm wondering if we can close one of those two issues: if yes, which one? As I can avoid duplicated information. Thanks for your collaboration.

I wasn't aware of this issue (should have checked better), I'm happy to close https://github.com/scikit-learn/scikit-learn/issues/15222 in favor of this one

Following @jnothman comment maybe this issue could deserve a 'Moderate' label?

We want to work on ensemble/_hist_gradient_boosting/binning.
@mojc and me.

wimlds

@anaisabeldhero and me want to work on manifold/*
#wimlds #SciKitLearnSprint

@daphn3k and I will work on sklearn/gaussian_process/

wimlds #SciKitLearnSprint

We want to work on sklearn/preprocessing/_data.py - 2178, 2607
@rachelcjordan and @fabi-cast

wimlds #SciKitLearnSprint

Me and @Malesche want to take the sklearn/inspection/_permutation_importance.py

WiMLDS

claiming sklearn/metrics/cluster/_unsupervised.py file! #wimlds

@daphn3k and I take also the covariance/* and neighbors/* #wimlds

claim:
sklearn/dummy.py - 59
sklearn/multioutput.py - 578, 738
sklearn/kernel_approximation.py - 41, 143, 470
sklearn/multiclass.py - 687
sklearn/random_projection.py - 178, 245, 464, 586

PSA: please use the original sentence

Pass an int for reproducible results across multiple function calls.

instead of what I'm seeing in PRs at the moment:

Use an int to make the randomness deterministic

which isn't correct, since the RNG is always deterministic regardless of what is passed

CC @adrinjalali since I think you're at the sprint

working on the neural network and mixture

PSA: please use the original sentence

Pass an int for reproducible results across multiple function calls.

instead of what I'm seeing in PRs at the moment:

Use an int to make the randomness deterministic

which isn't correct, since the RNG is always deterministic regardless of what is passed

CC @adrinjalali since I think you're at the sprint

Hi @NicolasHug this was meant to comment a PR I suppose... which one? :)

going to work on scikit-learn/sklearn/model_selection/_validation.py

@cmarmo That was a general comment for all PRs. I saw one and commented there, then saw a second one and figured out it was a pattern that would be better addressed at the source

@cmarmo That was a general comment for all PRs. I saw one and commented there, then saw a second one and figured out it was a pattern that would be better addressed at the source

Sorry @NicolasHug, my bad, I haven't found the comment easy to trace.

@NicolasHug Original sentence has been corrected in the commits from @anaisabeldhero and me

Me and @Olks claim sklearn/utils/extmath.py - 185, 297

Claim sklearn/ensemble/_iforest.py - 109

Claim sklearn/neural_network/_multilayer_perceptron.py - 782, 1174

Claim sklearn/ensemble/_weight_boosting.py - 188, 324, 479, 900, 1022

Claim sklearn/multioutput.py - 578, 738

Claim :
sklearn/mixture/_bayesian_mixture.py - 166
sklearn/mixture/_base.py - 139
sklearn/mixture/_gaussian_mixture.py - 504

Claim sklearn/ensemble/_gb.py - 887, 1360

Claim sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py - 736, 918

Claim sklearn/neural_network/_rbm.py - 59

Claim :

sklearn/svm/_classes.py - 90, 312, 546, 752
sklearn/svm/_base.py - 853

Claim:

sklearn/feature_selection/_mutual_info.py - 226, 335, 414
sklearn/metrics/cluster/_unsupervised.py - 80
sklearn/utils/_testing.py - 521
sklearn/utils/init.py - 478, 623

Claim :

sklearn/dummy.py - 59
sklearn/random_projection.py - 178, 245, 464, 586

@DatenBiene @GregoireMialon Thanks for all your contributions during last sprint. There are only 3 modules left unchecked !

Would you be interested / have time / have motivation to tackle those (no pressure !) ?

Hi Jérémie ! I'll try to have a look at it soon

Le mer. 12 févr. 2020 à 15:53, Jérémie du Boisberranger <
[email protected]> a écrit :

@DatenBiene https://github.com/DatenBiene @GregoireMialon
https://github.com/GregoireMialon Thanks for all your contributions
during last sprint. There are only 3 modules left unchecked !

Would you be interested / have time / have motivation to tackle those (no
pressure !) ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10548?email_source=notifications&email_token=AFY4624NQL3EAFLBGPUNAE3RCQEO3A5CNFSM4EOCFD32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELRBT2A#issuecomment-585243112,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFY4625457AU7OL4E4EUVOTRCQEO3ANCNFSM4EOCFD3Q
.

Hi @jeremiedbb! I will try to finish the 3 remaining modules today 😃

Claim:

sklearn/kernel_approximation.py - 41, 143, 470
sklearn/multiclass.py - 687
sklearn/ensemble/_base.py - 52

Hi @jnothman and @jeremiedbb, looks like all the files where modified. I would be happy to help if you find any remaining issues.

Thanks a lot @DatenBiene and all the contributors that worked to close this issue!
I think we can close this huge one!
Feel free to open new specific issues if something is still missing about random_state description.

Was this page helpful?
0 / 5 - 0 ratings