我正在sklearn管道中使用sklearn-pandas DataFrameMapper。为了评估特征联合管道中的特征贡献,我喜欢测量估计器的系数(Logistic回归)。对于下面的代码示例中,三个文本内容列a, b和c被矢量化和选择用于X_train:
import pandas as pd
import numpy as np
import pickle
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/WZHwqLWr')
#data.columns
X = data.copy()
y = data.result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
mapper = DataFrameMapper([
('a', CountVectorizer()),
('b', CountVectorizer()),
('c', CountVectorizer())
])
pipeline = Pipeline([
('featurize', mapper),
('clf', LogisticRegression(random_state=1))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(abs(pipeline.named_steps['clf'].coef_))
#array([[0.3567311 , 0.3567311 , 0.46215153, 0.10542043, 0.3567311 ,
# 0.46215153, 0.46215153, 0.3567311 , 0.3567311 , 0.3567311 ,
# 0.3567311 , 0.46215153, 0.46215153, 0.3567311 , 0.46215153,
# 0.3567311 , 0.3567311 , 0.3567311 , 0.3567311 , 0.46215153,
# 0.46215153, 0.46215153, 0.3567311 , 0.3567311 ]])
print(len(pipeline.named_steps['clf'].coef_[0]))
#24
与多个特征的常规分析不同,DataFrameMapper返回更大的系数矩阵,而这些特征通常返回与特征数量相等的系数。
a)如何解释大写的总共24个系数?b)访问每个特征(“a”,“b”,“c”)的coef_值的最佳方法是什么?
期望的输出:
a: coef_score (float)
b: coef_score (float)
解决办法:虽然您最初的数据帧确实只包含你的三个功能栏a,b和c,熊猫DataFrameMapper()类应用SKlearn的CountVectorizer()每一列A,B和C的各个字的语料库。这导致创建了总共24个特征,然后将这些特征传递给您的LogisticRegression()分类器。这就是为什么当您尝试访问分类器的.coef_属性时,您获得了24个值的未标记列表。
然而,这是非常简单的,以匹配每个那些24个的coeff_得分与原始柱(a,b,或c),它们来自,然后计算平均系数分数每一列。
c: coef_score (float)








暂无数据