Scikit-learn: ColumnTransformer 이후에 κΈ°λŠ₯ 이름을 κ°€μ Έμ˜¬ 수 μ—†μŠ΅λ‹ˆλ‹€.

에 λ§Œλ“  2018λ…„ 11μ›” 06일  Β·  13μ½”λ©˜νŠΈ  Β·  좜처: scikit-learn/scikit-learn

ColumnTransformerλ₯Ό μ‚¬μš©ν•˜μ—¬ νŒŒμ΄ν”„λΌμΈμœΌλ‘œ λ‹€λ₯Έ μ—΄(숫자, λ²”μ£Ό, ν…μŠ€νŠΈ 포함)을 μ „μ²˜λ¦¬ν•  λ•Œ μ΅œμ’… λ³€ν™˜λœ λ°μ΄ν„°μ˜ κΈ°λŠ₯ 이름을 κ°€μ Έμ˜¬ 수 μ—†μœΌλ―€λ‘œ 디버깅이 μ–΄λ ΅μŠ΅λ‹ˆλ‹€.

μ½”λ“œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

target = data.pop('survived')

numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']

numeric_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler()
    )
])
category_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='constant',fill_value='missing')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
    ('cntvec',CountVectorizer())
])

preprocesser = ColumnTransformer(transformers=[
    ('numeric',numeric_transformer,numeric_columns),
    ('category',category_transformer,category_columns),
    ('text',text_transformer,text_columns[0])
])

preprocesser.fit_transform(data)
  1. preprocesser.get_feature_names() 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€.
    AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
  2. ColumnTransformer μ—μ„œ text_transformer λŠ” λ¬Έμžμ—΄(예: 'Sex')만 μ²˜λ¦¬ν•  수 μžˆμ§€λ§Œ text_columns 와 같은 λ¬Έμžμ—΄ λͺ©λ‘μ€ μ²˜λ¦¬ν•  수 μ—†μŠ΅λ‹ˆλ‹€.

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

이것은 ColumnTransformer에 λŒ€ν•œ λ¬Έμ œκ°€ μ•„λ‹™λ‹ˆλ‹€.

  1. νŒŒμ΄ν”„λΌμΈμ— κ΄€ν•œ κ²ƒμž…λ‹ˆλ‹€. eli5 λŠ” νŒŒμ΄ν”„λΌμΈμ„ 지원할 수 μžˆλŠ” κΈ°λŠ₯ 이름 κΈ°λŠ₯을 κ΅¬ν˜„ν•©λ‹ˆλ‹€.

Re 2. μ•„λ§ˆλ„ 각 열에 ν…μŠ€νŠΈ 벑터라이저λ₯Ό μ μš©ν•˜λŠ” κΉ”λ”ν•œ 방법이 μ—†λ‹€λŠ” 것이 λΉ„μš°ν˜Έμ μ΄λΌλŠ” 말이 λ§žμ„ κ²ƒμž…λ‹ˆλ‹€. λ‹¨μˆœνžˆ CountVectorizer λ“±μ—μ„œ μ—¬λŸ¬ μž…λ ₯ 열을 μ§€μ›ν•˜κΈ° μ‹œμž‘ν•˜μ§€ μ•ŠλŠ” ν•œ 이것이 μ–΄λ–»κ²Œ κΉ”λ”ν•˜κ²Œ 달성될 수 μžˆλŠ”μ§€ 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€.

μΉœμ ˆν•œ λ‹΅λ³€ κ°μ‚¬ν•©λ‹ˆλ‹€!
λ‚΄κ°€ μ•„λŠ” 바와 같이 OneHotEncoder , CountVectorizer 와 같이 ν•˜λ‚˜μ˜ 열을 닀쀑 μ—΄λ‘œ λ³€κ²½ν•  수 μžˆλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 열을 사전 μ²˜λ¦¬ν•  λ•Œ νŒŒμ΄ν”„λΌμΈ λ§ˆμ§€λ§‰ λ‹¨κ³„μ˜ λ³€ν™˜κΈ°μ—μ„œ μƒˆ 데이터 μ—΄ 이름을 κ°€μ Έμ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€. get_feature_names ν•¨μˆ˜λŠ” μƒˆ 열을 μƒμ„±ν•˜μ§€ μ•ŠλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•  λ•Œ μ›μ‹œ μ—΄ μ΄λ¦„λ§Œ μ„€μ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

μœ„μ˜ μ½”λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ preprocesser 의 μ—΄ 이름을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
이 μ½”λ“œκ°€ 이 μ§ˆλ¬Έμ„ ν•΄κ²°ν•©λ‹ˆκΉŒ?
eli5 ν˜„μž¬ ν•΄λ‹Ή κΈ°λŠ₯을 찾을 수 μ—†μŠ΅λ‹ˆλ‹€. eli5의 λͺ…μ‹œμ  예제 λ˜λŠ” API에 λŒ€ν•œ 링크λ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆκΉŒ?

λͺ¨λ“  13 λŒ“κΈ€

이것은 ColumnTransformer에 λŒ€ν•œ λ¬Έμ œκ°€ μ•„λ‹™λ‹ˆλ‹€.

  1. νŒŒμ΄ν”„λΌμΈμ— κ΄€ν•œ κ²ƒμž…λ‹ˆλ‹€. eli5 λŠ” νŒŒμ΄ν”„λΌμΈμ„ 지원할 수 μžˆλŠ” κΈ°λŠ₯ 이름 κΈ°λŠ₯을 κ΅¬ν˜„ν•©λ‹ˆλ‹€.

Re 2. μ•„λ§ˆλ„ 각 열에 ν…μŠ€νŠΈ 벑터라이저λ₯Ό μ μš©ν•˜λŠ” κΉ”λ”ν•œ 방법이 μ—†λ‹€λŠ” 것이 λΉ„μš°ν˜Έμ μ΄λΌλŠ” 말이 λ§žμ„ κ²ƒμž…λ‹ˆλ‹€. λ‹¨μˆœνžˆ CountVectorizer λ“±μ—μ„œ μ—¬λŸ¬ μž…λ ₯ 열을 μ§€μ›ν•˜κΈ° μ‹œμž‘ν•˜μ§€ μ•ŠλŠ” ν•œ 이것이 μ–΄λ–»κ²Œ κΉ”λ”ν•˜κ²Œ 달성될 수 μžˆλŠ”μ§€ 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€.

이것은 ColumnTransformer에 λŒ€ν•œ λ¬Έμ œκ°€ μ•„λ‹™λ‹ˆλ‹€.

  1. νŒŒμ΄ν”„λΌμΈμ— κ΄€ν•œ κ²ƒμž…λ‹ˆλ‹€. eli5 λŠ” νŒŒμ΄ν”„λΌμΈμ„ 지원할 수 μžˆλŠ” κΈ°λŠ₯ 이름 κΈ°λŠ₯을 κ΅¬ν˜„ν•©λ‹ˆλ‹€.

Re 2. μ•„λ§ˆλ„ 각 열에 ν…μŠ€νŠΈ 벑터라이저λ₯Ό μ μš©ν•˜λŠ” κΉ”λ”ν•œ 방법이 μ—†λ‹€λŠ” 것이 λΉ„μš°ν˜Έμ μ΄λΌλŠ” 말이 λ§žμ„ κ²ƒμž…λ‹ˆλ‹€. λ‹¨μˆœνžˆ CountVectorizer λ“±μ—μ„œ μ—¬λŸ¬ μž…λ ₯ 열을 μ§€μ›ν•˜κΈ° μ‹œμž‘ν•˜μ§€ μ•ŠλŠ” ν•œ 이것이 μ–΄λ–»κ²Œ κΉ”λ”ν•˜κ²Œ 달성될 수 μžˆλŠ”μ§€ 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€.

μΉœμ ˆν•œ λ‹΅λ³€ κ°μ‚¬ν•©λ‹ˆλ‹€!
λ‚΄κ°€ μ•„λŠ” 바와 같이 OneHotEncoder , CountVectorizer 와 같이 ν•˜λ‚˜μ˜ 열을 닀쀑 μ—΄λ‘œ λ³€κ²½ν•  수 μžˆλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 열을 사전 μ²˜λ¦¬ν•  λ•Œ νŒŒμ΄ν”„λΌμΈ λ§ˆμ§€λ§‰ λ‹¨κ³„μ˜ λ³€ν™˜κΈ°μ—μ„œ μƒˆ 데이터 μ—΄ 이름을 κ°€μ Έμ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€. get_feature_names ν•¨μˆ˜λŠ” μƒˆ 열을 μƒμ„±ν•˜μ§€ μ•ŠλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•  λ•Œ μ›μ‹œ μ—΄ μ΄λ¦„λ§Œ μ„€μ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

μœ„μ˜ μ½”λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ preprocesser 의 μ—΄ 이름을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
이 μ½”λ“œκ°€ 이 μ§ˆλ¬Έμ„ ν•΄κ²°ν•©λ‹ˆκΉŒ?
eli5 ν˜„μž¬ ν•΄λ‹Ή κΈ°λŠ₯을 찾을 수 μ—†μŠ΅λ‹ˆλ‹€. eli5의 λͺ…μ‹œμ  예제 λ˜λŠ” API에 λŒ€ν•œ 링크λ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆκΉŒ?

eli5와 κ΄€λ ¨ν•˜μ—¬ transform_feature_names(explain_weightsμ—μ„œ μ‚¬μš©)λ₯Ό μ°Έμ‘°ν•˜μ‹­μ‹œμ˜€.

1은 #6425의 λ³΅μ œν’ˆμ΄μ£ ? λ‚˜λŠ” 그것에 μž μ„ μ“°κ³  μ‹Άλ‹€.
ColumnTransformer μ‚¬μš©ν•˜λ©΄ μ—¬λŸ¬ ν…μŠ€νŠΈ 열을 μ§€μ›ν•˜λŠ” 것이 맀우 쉽닀고 μƒκ°ν•©λ‹ˆλ‹€. κ°€μž₯ 예쁜 μ½”λ“œλŠ” μ•„λ‹ˆμ§€λ§Œ 각 ν…μŠ€νŠΈ 열에 λŒ€ν•΄ CountVectorizerλ₯Ό μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

그리고 get_feature_names κ°€ μ—†λ‹€κ³  ν•΄μ„œ μ—΄ μ΄λ¦„λ§Œ μ‚¬μš©ν•  수 μžˆλ‹€λŠ” μ˜λ―ΈλŠ” μ•„λ‹ˆκΈ° λ•Œλ¬Έμ— μŠ€λ‹ˆνŽ«μ€ μ‹€μ œλ‘œ 문제λ₯Ό ν•΄κ²°ν•˜μ§€ λͺ»ν•©λ‹ˆλ‹€.

1은 #6425의 λ³΅μ œν’ˆμ΄μ£ ? λ‚˜λŠ” 그것에 μž μ„ μ“°κ³  μ‹Άλ‹€.
ColumnTransformer μ‚¬μš©ν•˜λ©΄ μ—¬λŸ¬ ν…μŠ€νŠΈ 열을 μ§€μ›ν•˜λŠ” 것이 맀우 쉽닀고 μƒκ°ν•©λ‹ˆλ‹€. κ°€μž₯ 예쁜 μ½”λ“œλŠ” μ•„λ‹ˆμ§€λ§Œ 각 ν…μŠ€νŠΈ 열에 λŒ€ν•΄ CountVectorizerλ₯Ό μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

그리고 get_feature_names κ°€ μ—†λ‹€κ³  ν•΄μ„œ μ—΄ μ΄λ¦„λ§Œ μ‚¬μš©ν•  수 μžˆλ‹€λŠ” μ˜λ―ΈλŠ” μ•„λ‹ˆκΈ° λ•Œλ¬Έμ— μŠ€λ‹ˆνŽ«μ€ μ‹€μ œλ‘œ 문제λ₯Ό ν•΄κ²°ν•˜μ§€ λͺ»ν•©λ‹ˆλ‹€.

예, μ „μ²˜λ¦¬ νŒŒμ΄ν”„λΌμΈμ—μ„œ pandas DataFrame ν”Όλ“œ ν›„ μƒμ„±λœ λ°μ΄ν„°μ—μ„œ μ •ν™•νžˆ 무슨 일이 μΌμ–΄λ‚¬λŠ”μ§€ μ•Œ 수 μžˆλ„λ‘ κΈ°λŠ₯ 이름을 μ–»λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.

μ•Œκ² μŠ΅λ‹ˆλ‹€. μ€‘λ³΅μœΌλ‘œ λ‹«μŠ΅λ‹ˆλ‹€.

이것은 ColumnTransformer에 λŒ€ν•œ λ¬Έμ œκ°€ μ•„λ‹™λ‹ˆλ‹€.

  1. νŒŒμ΄ν”„λΌμΈμ— κ΄€ν•œ κ²ƒμž…λ‹ˆλ‹€. eli5 λŠ” νŒŒμ΄ν”„λΌμΈμ„ 지원할 수 μžˆλŠ” κΈ°λŠ₯ 이름 κΈ°λŠ₯을 κ΅¬ν˜„ν•©λ‹ˆλ‹€.

Re 2. μ•„λ§ˆλ„ 각 열에 ν…μŠ€νŠΈ 벑터라이저λ₯Ό μ μš©ν•˜λŠ” κΉ”λ”ν•œ 방법이 μ—†λ‹€λŠ” 것이 λΉ„μš°ν˜Έμ μ΄λΌλŠ” 말이 λ§žμ„ κ²ƒμž…λ‹ˆλ‹€. λ‹¨μˆœνžˆ CountVectorizer λ“±μ—μ„œ μ—¬λŸ¬ μž…λ ₯ 열을 μ§€μ›ν•˜κΈ° μ‹œμž‘ν•˜μ§€ μ•ŠλŠ” ν•œ 이것이 μ–΄λ–»κ²Œ κΉ”λ”ν•˜κ²Œ 달성될 수 μžˆλŠ”μ§€ 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€.

μΉœμ ˆν•œ λ‹΅λ³€ κ°μ‚¬ν•©λ‹ˆλ‹€!
λ‚΄κ°€ μ•„λŠ” 바와 같이 OneHotEncoder , CountVectorizer 와 같이 ν•˜λ‚˜μ˜ 열을 닀쀑 μ—΄λ‘œ λ³€κ²½ν•  수 μžˆλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 열을 사전 μ²˜λ¦¬ν•  λ•Œ νŒŒμ΄ν”„λΌμΈ λ§ˆμ§€λ§‰ λ‹¨κ³„μ˜ λ³€ν™˜κΈ°μ—μ„œ μƒˆ 데이터 μ—΄ 이름을 κ°€μ Έμ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€. get_feature_names ν•¨μˆ˜λŠ” μƒˆ 열을 μƒμ„±ν•˜μ§€ μ•ŠλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•  λ•Œ μ›μ‹œ μ—΄ μ΄λ¦„λ§Œ μ„€μ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

μœ„μ˜ μ½”λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ preprocesser 의 μ—΄ 이름을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
이 μ½”λ“œκ°€ 이 μ§ˆλ¬Έμ„ ν•΄κ²°ν•©λ‹ˆκΉŒ?
eli5 ν˜„μž¬ ν•΄λ‹Ή κΈ°λŠ₯을 찾을 수 μ—†μŠ΅λ‹ˆλ‹€. eli5의 λͺ…μ‹œμ  예제 λ˜λŠ” API에 λŒ€ν•œ 링크λ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆκΉŒ?

onehot ν˜•μ‹μ— λŒ€ν•΄ rawname_value와 같은 이름을 되돌리기 μœ„ν•΄ μ•½κ°„μ˜ κ°œμ„ μ„ ν–ˆμŠ΅λ‹ˆλ‹€.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        raw_col_name_reverse = raw_col_name[::-1]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
            exchange_name = [(_.split("_")) for _ in preprocessor.transformers_[:-1][0][1].steps[-1][1].get_feature_names()]
            last_pre_name = ""
            last_raw_name = ""
            for pre_name,value in exchange_name:
                if pre_name==last_pre_name:
                    col_name.append(last_raw_name+"_"+value)
                if pre_name!=last_pre_name:
                    last_pre_name=pre_name
                    last_raw_name=raw_col_name_reverse.pop()
                    col_name.append(last_raw_name+"_"+value)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

이것은 ColumnTransformer에 λŒ€ν•œ λ¬Έμ œκ°€ μ•„λ‹™λ‹ˆλ‹€.

  1. νŒŒμ΄ν”„λΌμΈμ— κ΄€ν•œ κ²ƒμž…λ‹ˆλ‹€. eli5 λŠ” νŒŒμ΄ν”„λΌμΈμ„ 지원할 수 μžˆλŠ” κΈ°λŠ₯ 이름 κΈ°λŠ₯을 κ΅¬ν˜„ν•©λ‹ˆλ‹€.

Re 2. μ•„λ§ˆλ„ 각 열에 ν…μŠ€νŠΈ 벑터라이저λ₯Ό μ μš©ν•˜λŠ” κΉ”λ”ν•œ 방법이 μ—†λ‹€λŠ” 것이 λΉ„μš°ν˜Έμ μ΄λΌλŠ” 말이 λ§žμ„ κ²ƒμž…λ‹ˆλ‹€. λ‹¨μˆœνžˆ CountVectorizer λ“±μ—μ„œ μ—¬λŸ¬ μž…λ ₯ 열을 μ§€μ›ν•˜κΈ° μ‹œμž‘ν•˜μ§€ μ•ŠλŠ” ν•œ 이것이 μ–΄λ–»κ²Œ κΉ”λ”ν•˜κ²Œ 달성될 수 μžˆλŠ”μ§€ 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€.

μΉœμ ˆν•œ λ‹΅λ³€ κ°μ‚¬ν•©λ‹ˆλ‹€!
λ‚΄κ°€ μ•„λŠ” 바와 같이 OneHotEncoder , CountVectorizer 와 같이 ν•˜λ‚˜μ˜ 열을 닀쀑 μ—΄λ‘œ λ³€κ²½ν•  수 μžˆλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 열을 사전 μ²˜λ¦¬ν•  λ•Œ νŒŒμ΄ν”„λΌμΈ λ§ˆμ§€λ§‰ λ‹¨κ³„μ˜ λ³€ν™˜κΈ°μ—μ„œ μƒˆ 데이터 μ—΄ 이름을 κ°€μ Έμ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€. get_feature_names ν•¨μˆ˜λŠ” μƒˆ 열을 μƒμ„±ν•˜μ§€ μ•ŠλŠ” λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•  λ•Œ μ›μ‹œ μ—΄ μ΄λ¦„λ§Œ μ„€μ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

μœ„μ˜ μ½”λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ preprocesser 의 μ—΄ 이름을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
이 μ½”λ“œκ°€ 이 μ§ˆλ¬Έμ„ ν•΄κ²°ν•©λ‹ˆκΉŒ?
eli5 ν˜„μž¬ ν•΄λ‹Ή κΈ°λŠ₯을 찾을 수 μ—†μŠ΅λ‹ˆλ‹€. eli5의 λͺ…μ‹œμ  예제 λ˜λŠ” API에 λŒ€ν•œ 링크λ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆκΉŒ?

νŒŒμ΄ν”„λΌμΈμ—μ„œ add_indicator와 ν•¨κ»˜ simpleimputerλ₯Ό μ μš©ν•˜λ©΄ μ–΄λ–»κ²Œ λ κΉŒμš”? 이 방법은 μž‘λ™ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

νŒŒμ΄ν”„λΌμΈμ—μ„œ add_indicator와 ν•¨κ»˜ simpleimputerλ₯Ό μ μš©ν•˜λ©΄ μ–΄λ–»κ²Œ λ κΉŒμš”? 이 방법은 μž‘λ™ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

이 ꡬ성에 λŒ€ν•΄ get_feature_names λ©”μ„œλ“œκ°€ 있으면 쒋을 κ²ƒμž…λ‹ˆλ‹€.

νŒŒμ΄ν”„λΌμΈμ—μ„œ add_indicator와 ν•¨κ»˜ simpleimputerλ₯Ό μ μš©ν•˜λ©΄ μ–΄λ–»κ²Œ λ κΉŒμš”? 이 방법은 μž‘λ™ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ 단기 μ†”λ£¨μ…˜μ— λŒ€ν•œ μ €μ˜ κΈ°μ—¬μž…λ‹ˆλ‹€. λ‹€λ₯Έ λͺ¨λ“  λ°°μ—΄ μœ ν˜•μ„ λͺ©λ‘μœΌλ‘œ κ°•μ œ λ³€ν™˜ν•˜κ³  SimpleImputer(add_indicate=True)의 경우λ₯Ό μ²˜λ¦¬ν•©λ‹ˆλ‹€. λ˜ν•œ 쑰금 더 μž₯ν™©ν•©λ‹ˆλ‹€.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        print('\n\ntransformer: ', transformer_in_columns[0])

        raw_col_name = list(transformer_in_columns[2])

        if isinstance(transformer_in_columns[1], Pipeline): 
            # if pipeline, get the last transformer
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]

        try:
          if isinstance(transformer, OneHotEncoder):
            names = list(transformer.get_feature_names(raw_col_name))

          elif isinstance(transformer, SimpleImputer) and transformer.add_indicator:
            missing_indicator_indices = transformer.indicator_.features_
            missing_indicators = [raw_col_name[idx] + '_missing_flag' for idx in missing_indicator_indices]

            names = raw_col_name + missing_indicators

          else:
            names = list(transformer.get_feature_names())

        except AttributeError as error:
          names = raw_col_name

        print(names)    

        col_name.extend(names)

    return col_name

참고둜 μ €λŠ” λ³΅μž‘ν•œ Pipelines 및 ColumnTransformersμ—μ„œ κΈ°λŠ₯ 이름을 μΆ”μΆœν•˜λŠ” 방법에 λŒ€ν•œ λͺ‡ 가지 μ½”λ“œμ™€ λΈ”λ‘œκ·Έλ₯Ό μž‘μ„±ν–ˆμŠ΅λ‹ˆλ‹€. μ½”λ“œλŠ” 이전 κ²Œμ‹œλ¬Όλ³΄λ‹€ κ°œμ„ λœ κ²ƒμž…λ‹ˆλ‹€. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

@kylegilde ν›Œλ₯­ν•œ 기사와 μ½”λ“œ κ°μ‚¬ν•©λ‹ˆλ‹€. 맀λ ₯처럼 μž‘λ™ν•©λ‹ˆλ‹€. κΈ€λ‘œλ²Œ μ„€λͺ…을 μœ„ν•΄ λͺ‡ μ‹œκ°„ λ™μ•ˆ KernelSHAP 및 μ•Œλ¦¬λ°”μ΄ 와 μ”¨λ¦„ν–ˆμ§€λ§Œ handle_unkown='ignore' μ—†μ΄λŠ” onehot λ³€ν™˜κΈ°κ°€ μž‘λ™ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ μ•Œλ¦Όμ˜ 열을 ν¬ν•¨ν•˜λŠ” @pjgao 의 μŠ€λ‹ˆνŽ«μ˜ λ‹€λ₯Έ λ²„μ „μž…λ‹ˆλ‹€.

def get_columns_from_transformer(column_transformer, input_colums):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names(raw_col_name)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)

    [_, _, reminder_columns] = column_transformer.transformers_[-1]

    for col_idx in reminder_columns:
        col_name.append(input_colums[col_idx])

    return col_name

핡심 μ½”λ“œλ² μ΄μŠ€μ— μœ μ‚¬ν•œ κΈ°λŠ₯을 μΆ”κ°€ν•˜λŠ” 것에 λŒ€ν•΄ μ–΄λ–»κ²Œ μƒκ°ν•˜μ‹­λ‹ˆκΉŒ?

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰