Scikit-learn: Add support for dropping collinear variables

Created on 3 Feb 2020  ·  3Comments  ·  Source: scikit-learn/scikit-learn

Describe the workflow you want to enable

Can we add a feature in LinearRegression that could remove collinearity (exact collinearity) in the data?.

Describe your proposed solution

My proposal is to add an extra argument like remove_collinearity if it is set by the user then we can remove exact collinear variables using the rank of the matrix or collinear variables using VIF. This can save some time instead of going for Ridge regression.

New Feature

Most helpful comment

This is being worked on as a feature selection transformer here: https://github.com/scikit-learn/scikit-learn/pull/14698

All 3 comments

It might be better to have this as a prepreprocessor in sklearn.feature_selection, that way it could be applied to multiple estimators. I'm not sure that exact collinearity is a frequent issue though. Maybe an estimator with a user defined feature correlation threshold?

I'm not sure if it's something that is often done, as opposed to say feature clustering? The latter can be done in scikit-learn with cluster.FeatureAgglomeration though maybe the interface with a required n_clusters is not ideal.

cc @glemaitre

This is being worked on as a feature selection transformer here: https://github.com/scikit-learn/scikit-learn/pull/14698

Indeed thanks. Closing this issue as a duplicate of https://github.com/scikit-learn/scikit-learn/issues/13405 then. If you have other comments or suggestions @divyaprabha123 please comment there.

Was this page helpful?
0 / 5 - 0 ratings