第5步:将语料库分为训练和测试集。为此,我们需要来自sklearn.cross_validation的class train_test_split。拆分可以是70/30或80/20或85/15或75/25,这里我通过“test_size”选择75/25。
X是单词包,y是0或1(正面或负面)。
# Splitting the dataset into
# the Training set and Test set
from sklearn.cross_validation import train_test_split
# experiment with "test_size"
# to get better results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
第6步:拟合预测模型(此处为随机森林)
- 由于Random fored是来自sklearn.ensemble的集合模型(由许多树组成),因此导入RandomForestClassifier类
- 使用501树或“n_estimators”,标准为“熵”
- 通过.fit()方法使用属性X_train和y_train拟合模型
# Fitting Random Forest Classification
# to the Training set
from sklearn.ensemble import RandomForestClassifier
# n_estimators can be said as number of
# trees, experiment with n_estimators
# to get better results
model = RandomForestClassifier(n_estimators = 501,
criterion = 'entropy')
model.fit(X_train, y_train)








暂无数据