7.1 投票分类器

硬投票器:

聚合每个分类器的预测,然后将得票最多的结果作为预测类别。

当预测器尽可能互相独立时,集成方法的效果最优。获得多种分类器的方法之一就是使用不同的算法进行训练。这会增加它们犯不同类型错误的机会,从而提升集成的准确率。

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.style.use("ggplot")
[3]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
[4]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
len(X_train), len(X_test)
[4]:
(375, 125)
[5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
[6]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard')

voting_clf.fit(X_train, y_train)
[6]:
VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])
[8]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.896
VotingClassifier 0.904

如果所有分类器都能够估算出类别的概率(即有predict_proba()方法),那么你可以将概率在所有单个分类器上平均,然后让Scikit-Learn给出平均概率最高的类别作为预测。这被称为软投票法

默认情况下,SVC类是不行的,所以你需要将其超参数probability设置为True(这会导致SVC使用交叉验证来估算类别概率,减慢训练速度,并会添加predict_proba()方法)。如果修改上面代码为使用软投票,你会发现投票分类器的准确率达到91.2%以上!

[9]:
# Soft Voting
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma='scale', probability=True, random_state=42)

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='soft')
voting_clf.fit(X_train, y_train)
[9]:
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('svc', SVC(probability=True, random_state=42))],
                 voting='soft')
[10]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92
[ ]: