Scikit-learn: DecisionTreeClassifier unknown label type: 'continuous-multioutput'

Created on 31 Oct 2016 · 18Comments · Source: scikit-learn/scikit-learn

Description

DecisionTreeClassifier crashes with unknown label type: 'continuous-multioutput'. I've tried loading csv file using csv.reader, pandas.read_csv and some other stuff like parsing line-by-line.

Steps/Code to Reproduce

from sklearn import tree
feature_df = pd.read_csv(os.path.join(_PATH, 'features.txt'))
target_df = pd.read_csv(os.path.join(_PATH, 'target.txt'))
feature_df = feature_df._get_numeric_data()
target_df = target_df._get_numeric_data()
feature_df = feature_df.fillna(0)
target_df = target_df.fillna(0)
clf = tree.DecisionTreeClassifier()
clf_o = clf.fit(feature_df, target_df)

features.txt
target.txt

Expected Results

Error thrown informs user what REALLY is wrong, that f.e. his data set does not folllow assumptions (and what are those)

Actual Results

Traceback (most recent call last):
  File "D:\Piotr\Documents\uni\bap\BAPFingerprintLocalisation\main.py", line 19,
 in <module>
    decision_tree.treeClassification()
  File "D:\Piotr\Documents\uni\bap\BAPFingerprintLocalisation\code\decision_tree
.py", line 56, in treeClassification
    clf_o = clf.fit(feature_df, target_df)
  File "C:\Python35\lib\site-packages\sklearn\tree\tree.py", line 182, in fit
    check_classification_targets(y)
  File "C:\Python35\lib\site-packages\sklearn\utils\multiclass.py", line 172, in
 check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous-multioutput'

Versions

Windows-10-10.0.14393-SP0
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)]
NumPy 1.11.0
SciPy 0.17.1
Scikit-Learn 0.18

Update:

I've changed number of target variables to one, just to simplify things

clf_o = clf.fit(feature_df, target_df.ix[:,1])

Output: Unknown label type: 'continuous'

Source

KamodaP

👍1

Most helpful comment

You should be using DecisionTreeRegressor

jnothman on 31 Oct 2016

👍11 🎉1

All 18 comments

You should be using DecisionTreeRegressor

jnothman on 31 Oct 2016

👍11 🎉1

Again, documentation lacks information on how many classes can classification handle. I can see that my dataset has waaaay too many classes, but your error message mentioned something like 'labels' which was confusing enough to forget how the dataset actually look like and meddle with methods of passing datasets.
I've updated the issue and ask you to reopen it.

KamodaP on 1 Nov 2016

Classification targets should be represented as integers or as strings. You can ask Pandas to read the target data in as a string and you'll be fine.

jnothman on 1 Nov 2016

👎5 👍1

Or use a DecisionTreeRegressor

jnothman on 1 Nov 2016

👎11 👍1

That's not my problem

KamodaP on 1 Nov 2016

See 'Expected Results' section of my issue

KamodaP on 1 Nov 2016

You're right that the error message could be more useful, but the documentation for fit does say "class labels in classification". Feel free to submit a clearer issue about needing to document the expected data type for classification ys, and another for raising appropriate error messages when float data is passed as y to a classifier.

jnothman on 1 Nov 2016

👎2

Let me cite the whole section of documentation documenting parameter y of function fit in class DecisionTreeClassifier

The target values (class labels in classification, real numbers in regression). In the regression case, use dtype=np.float64 and order='C' for maximum efficiency.

That does not say that classes have a cap. What makes a target variable labeled continuous? How many classes have to be there to be considered regression type target variable? If it sais about regression, then can I do regression using DecisionTreeClassifier? Why not? Etc...

As for your previous comment:

Classification targets should be represented as integers or as strings. You can ask Pandas to read the target data in as a string and you'll be fine.

Does that mean that classes can't be represented as floats? Or as dicts? Lists? Tuples? Longs? Doubles? bytes? I know it is logical to represent classes as integers or strings, since they should not be plenty. But do they have to? What are the limitations?

And as to creating new ticket, isn't that useless since we've had quite a talk in here? Creating new ticket just to explain other guy the same thing?

KamodaP on 1 Nov 2016

It's not number of classes. It's use of non-integers and non-strings.

I like the issue descriptions to be focused. Your concern as raised here
seemed to be more of a usage problem.

And please don't hassle me about what I suggest. This isn't the only issue
I'm dealing with.

On 2 November 2016 at 00:24, Piotr Kamoda [email protected] wrote:

Let me cite the whole section of documentation documenting parameter y of
function fit in class DecisionTreeClassifier

The target values (class labels in classification, real numbers in
regression). In the regression case, use dtype=np.float64 and order='C' for
maximum efficiency.

That does not say that classes have a cap. What makes a target variable
labeled continuous? How many classes have to be there to be considered
regression type target variable? If it sais about regression, then can I do
regression using DecisionTreeClassifier? Why not? Etc...

And as to creating new ticket, isn't that useless since we've had quite a
talk in here? Creating new ticket just to explain other guy the same thing?

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/7801#issuecomment-257565248,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz63zNA0Qc4lzgHttXx-4VFkJFwFaDks5q5z16gaJpZM4KlSFw
.

jnothman on 1 Nov 2016

You don't have to solve it today, I'm only trying to make the issue of bad error descriptions and bad documentation on tree classifier and regressor to become active and a task for future releases.

KamodaP on 1 Nov 2016

For the error message would "Unsupported output type: 'continuous-multioutput'" be better? That is the _real_ issue. Also see #7809 for the docstring.

amueller on 1 Nov 2016

🎉2 👍2

That's better. But still I don't understand why you won't name it as it is. Because literature mostly calls that 'Target' variables, and output could be mistaken with function output. Exception was thrown from function 'check_classification_targets', so even you say that's 'target' variable, and still you want to call it 'label' or 'output'. I'm not a member of scikit-learn member, so you will do as you please, but I would recommend to use words 'Target variable' in doscstring and error message. And I ask you to describe anywhere rules that input data (or target) should follow. A short sentence - 'Target variable (parameter y) has to be int or str'.

KamodaP on 2 Nov 2016

👍1

Maybe it's worth mentioning in/alongside the new section (45cb11d / #7519)
on multiclass and multilabel fitting in the tutorial. Or maybe this all
belongs in a section of the user guide on data representation conventions,
describing input/output formats for all standard methods...?

On 2 November 2016 at 20:56, Piotr Kamoda [email protected] wrote:

That's better. But still I don't understand why you won't name it as it
is. Because literature mostly calls that 'Target' variables, and output
could be mistaken with function output. Exception was thrown from function
'check_classification_targets', so even you say that's 'target' variable,
and still you want to call it 'label' or 'output'. I'm not a member of
scikit-learn member, so you will do as you please, but I would recommend to
use words 'Target variable' in doscstring and error message. And I ask you
to describe anywhere rules that input data (or target) should follow. A
short sentence - 'Target variable (parameter y) has to be int or str'.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/7801#issuecomment-257820087,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68AqYGWrP1C-BinLXGGHNt0VtV_qks5q6F5rgaJpZM4KlSFw
.

jnothman on 2 Nov 2016

I've scanned the document and it seems a good place to mention those conventions. Also if you don't want to obfuscate the error messages too much then idea of putting that information in user guide isn't bad as well.
Well, the final solution (if any) will be as you wish it to be, I'm just saying that the idea seems ok, but you have your conventions. I wont make you do something.

KamodaP on 2 Nov 2016

'Target variable (parameter y) has to be int or str'. is not right, because we support multi-label and multi-output multi-target

amueller on 17 Nov 2016

Also, arbitrary objects that are not floats are supported as class labels, they don't have to be integers or strings.

amueller on 17 Nov 2016

If we put as imput training_data_X, training_scores_Y to fit method it cause error. To avoid it we will convert and encode labels

from sklearn import preprocessing
from sklearn import utils
lab_enc = preprocessing.LabelEncoder()
y_train = lab_enc.fit_transform(y_train)
print(y_train)
print(utils.multiclass.type_of_target(y_train))
print(utils.multiclass.type_of_target(y_train.astype('int')))
print(utils.multiclass.type_of_target(y_train))