我正在sklearn管道中使用sklearn-pandas DataFrameMapper。为了评估特征联合管道中的特征贡献,我喜欢测量估计器的系数(Logistic回归)。对于下面的代码示例中,三个文本内容列a, b和c被矢量化和选择用于X_train:
import pandas as pd
import numpy as np
import pickle
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/WZHwqLWr')
#data.columns
X = data.copy()
y = data.result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
mapper = DataFrameMapper([
('a', CountVectorizer()),
('b', CountVectorizer()),
('c', CountVectorizer())
])
pipeline = Pipeline([
('featurize', mapper),
('clf', LogisticRegression(random_state=1))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(abs(pipeline.named_steps['clf'].coef_))








暂无数据