7.7 Q&A¶

如果你已经在完全相同的训练集上训练了5个不同的模型，并且它们都达到了95%的准确率，是否还有机会通过结合这些模型来获得更好的结果？如果可以，该怎么做？如果不行，为什么？

可以尝试将组合这些模型，这通常可以获得更好的性能。如果这5个模型非常不同，则效果更优。如果他们是在不同的训练集和测试集上完成训练的（这就bagging和pasting）的关键点，那就更好了。如果不是，只是模型非常的不同，这通常也能带来更好的结果。

硬投票分类器和软投票分类器有什么区别？

硬投票分类器只是统计每个分类器的投票，然后挑选出获得投票最多的类。而软投票分别计算出每个分类器的平均概率估算，然后选出概率最高的分类。软投票比硬投票更优，因为软投票给那些高度自信的投票更高的权重。不过软投票要求每个分类器可以估算出概率才能正常工作（例如，Scikit-Learn中的SVM分类器必须要设置probability=True）。

是否可以通过在多个服务器上并行来加速bagging集成的训练？pasting集成呢？boosting集成呢？随机森林或stacking集成呢？

对于bagging，将其分布在多个服务器上能够有效加速训练过程，因为集成中的每个预测器都是独立工作的。同理，对于pasting集成和随机森林来说也是如此。但是，boosting集成的每个预测器都是基于其前序的结果，因此训练过程必须是有序的，将其分布在多个服务器上毫无意义。对于stacking集成来说，某个指定层的预测器之间彼此独立，因而可以在多台服务器上并行训练，但是，某一层的预测器只能在其前一层的预测器全部训练完成之后才能开始训练。

包外评估的好处是什么？

包外评估可以对bagging集成中的每个预测器使用其未经训练的实例（它们是被保留的）进行评估。不需要额外的验证集，就可以对集成实施相当公正的评估。所以，如果训练使用的实例越多，集成的性能可以略有提升。

是什么让极端随机树比一般随机森林更加随机？这部分增加的随机性有什么用？极端随机树比一般随机森林快还是慢？

随机森林在生长过程中，每个节点的分裂仅考虑到了特征的一个随机子集。极限随机树也是如此，它甚至走得更远：常规决策树会搜索出特征的最佳阈值，极端随机树直接对每个特征使用随机阈值。这种极端随机性就像是一种正则化的形式：如果随机森林过拟合训练数据，那么极端随机树可能执行效果更好。而且，由于极端随机树不需要计算最佳阈值，因此它训练起来比随机森林快得多。但是，在做预测的时候，相比随机森林它不快也不慢。

如果你的AdaBoost集成对训练数据欠拟合，你应该调整哪些超参数？怎么调整？

如果你的AdaBoost集成欠拟合训练集，可以尝试提升估算器的数量或是降低基础估算器的正则化超参数。你也可以尝试略微提升学习率。

如果你的梯度提升集成对训练集过拟合，你是应该提升还是降低学习率

如果你的梯度提升集成过拟合训练集，你应该试着降低学习率，也可以通过提前停止法来寻找合适的预测器数量（可能是因为预测器太多）。

加载MNIST数据集（第3章中有介绍），将其分为一个训练集、一个验证集和一个测试集（例如，使用50 000个实例训练、10 000个实例验证、10 000个实例测试）。然后训练多个分类器，比如一个随机森林分类器、一个极端随机树分类器和一个SVM分类器。接下来，尝试使用软投票法或者硬投票法将它们组合成一个集成，这个集成在验证集上的表现要胜过它们各自单独的表现。成功找到集成后，在测试集上测试。与单个的分类器相比，它的性能要好多少？

[1]:

%matplotlib inline
import matplotlib as mlp
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('ggplot')

[3]:

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)

[3]:

<function Bunch.keys>

[5]:

mnist.details

[5]:

{'id': '554',
 'name': 'mnist_784',
 'version': '1',
 'format': 'ARFF',
 'upload_date': '2014-09-29T03:28:38',
 'licence': 'Public',
 'url': 'https://www.openml.org/data/v1/download/52667/mnist_784.arff',
 'file_id': '52667',
 'default_target_attribute': 'class',
 'tag': ['AzurePilot',
  'OpenML-CC18',
  'OpenML100',
  'study_1',
  'study_123',
  'study_41',
  'study_99',
  'vision'],
 'visibility': 'public',
 'status': 'active',
 'processing_date': '2018-10-03 21:23:30',
 'md5_checksum': '0298d579eb1b86163de7723944c7e495'}

[8]:

mnist.target = mnist.target.astype(np.uint8)
mnist.target

[8]:

array([5, 0, 4, ..., 4, 5, 6], dtype=uint8)

[9]:

from sklearn.model_selection import train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

len(X_train), len(X_val), len(X_test)

[9]:

(50000, 10000, 10000)

[10]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)
mlp_clf = MLPClassifier(random_state=42)
tree_clf = DecisionTreeClassifier(max_leaf_nodes=16, random_state=42)

[11]:

estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf, tree_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)
Training the DecisionTreeClassifier(max_leaf_nodes=16, random_state=42)

[12]:

# 显示每个分类器的acc
[print(estimator, estimator.score(X_val, y_val)) for estimator in estimators]

RandomForestClassifier(random_state=42) 0.9692
ExtraTreesClassifier(random_state=42) 0.9715
LinearSVC(max_iter=100, random_state=42, tol=20) 0.859
MLPClassifier(random_state=42) 0.9629
DecisionTreeClassifier(max_leaf_nodes=16, random_state=42) 0.64

[12]:

[None, None, None, None, None]

[13]:

from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
    ("decision_tree_clf", tree_clf),
]

voting_clf = VotingClassifier(named_estimators)

[14]:

voting_clf.fit(X_train, y_train)

[14]:

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=42)),
                             ('svm_clf',
                              LinearSVC(max_iter=100, random_state=42, tol=20)),
                             ('mlp_clf', MLPClassifier(random_state=42)),
                             ('decision_tree_clf',
                              DecisionTreeClassifier(max_leaf_nodes=16,
                                                     random_state=42))])

[15]:

voting_clf.score(X_val, y_val)

[15]:

0.9687

[16]:

[_estimator.score(X_val, y_val) for _estimator in voting_clf.estimators_]

[16]:

[0.9692, 0.9715, 0.859, 0.9629, 0.64]

[17]:

# 这里我们移除掉DecisionTreeClassifier
voting_clf.set_params(decision_tree_clf=None)

[17]:

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=42)),
                             ('svm_clf',
                              LinearSVC(max_iter=100, random_state=42, tol=20)),
                             ('mlp_clf', MLPClassifier(random_state=42)),
                             ('decision_tree_clf', None)])

[18]:

voting_clf.estimators

[18]:

[('random_forest_clf', RandomForestClassifier(random_state=42)),
 ('extra_trees_clf', ExtraTreesClassifier(random_state=42)),
 ('svm_clf', LinearSVC(max_iter=100, random_state=42, tol=20)),
 ('mlp_clf', MLPClassifier(random_state=42)),
 ('decision_tree_clf', None)]

[19]:

voting_clf.estimators_

[19]:

[RandomForestClassifier(random_state=42),
 ExtraTreesClassifier(random_state=42),
 LinearSVC(max_iter=100, random_state=42, tol=20),
 MLPClassifier(random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=16, random_state=42)]

[20]:

del voting_clf.estimators_[2]

[22]:

voting_clf.estimators_

[22]:

[RandomForestClassifier(random_state=42),
 ExtraTreesClassifier(random_state=42),
 MLPClassifier(random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=16, random_state=42)]

[23]:

voting_clf.score(X_train, y_train)

[23]:

0.9993

[24]:

voting_clf.voting="soft"

[25]:

voting_clf.score(X_train, y_train)

[25]:

0.9997

[26]:

# 这里soft稍稍胜出
voting_clf.voting = "soft"
voting_clf.score(X_test, y_test)

[26]:

0.9682

[27]:

[_estimator.score(X_test, y_test) for _estimator in voting_clf.estimators_]

[27]:

[0.9645, 0.9691, 0.9603, 0.6402]

运行练习题8中的单个分类器，用验证集进行预测，然后用预测结果创建一个新的训练集：新训练集中的每个实例都是一个向量，这个向量包含所有分类器对于一张图像的一组预测，目标值是图像的类。恭喜，你成功训练了一个混合器，结合第一层的分类器，它们一起构成了一个stacking集成。现在在测试集上评估这个集成。对于测试集中的每张图像，使用所有的分类器进行预测，然后将预测结果提供给混合器，得到集成的预测。与前面训练的投票分类器相比，这个集成的结果如何？

[28]:

len(estimators)

[28]:

[29]:

X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.uint8)

for index, _estimator in enumerate(estimators):
    X_val_predictions[:, index] = _estimator.predict(X_val)

[30]:

X_val_predictions

[30]:

array([[5, 5, 5, 5, 5],
       [8, 8, 8, 8, 8],
       [2, 2, 3, 2, 1],
       ...,
       [7, 7, 7, 7, 4],
       [6, 6, 6, 6, 6],
       [7, 7, 7, 7, 7]], dtype=uint8)

[31]:

rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

[31]:

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

[32]:

rnd_forest_blender.oob_score_

[32]:

0.9702

此时可以fine-tune这个的混合器，也可是使用的交叉验证选择其他的更好一点的混合器

[34]:

mlp_blender = MLPClassifier(random_state=42)
mlp_blender.fit(X_val_predictions, y_val)

[34]:

MLPClassifier(random_state=42)

[35]:

from sklearn.model_selection import cross_val_score
scores_rnd = cross_val_score(rnd_forest_blender, X_val_predictions, y_val, cv=3)
scores_mlp = cross_val_score(mlp_blender, X_val_predictions, y_val, cv=3)

scores_rnd, scores_mlp

[35]:

(array([0.96880624, 0.96879688, 0.96789679]),
 array([0.96130774, 0.95529553, 0.95949595]))

[36]:

np.mean(scores_rnd), np.mean(scores_mlp)

[36]:

(0.9684999693730619, 0.9586997392000748)

从上面的交叉验证可以看出rnd_forest_blender能力更好

现在已经用了一个混合器了，可以使用这个混合器使用stacking的方法进行预测

[38]:

X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, _estimator in enumerate(estimators):
    X_test_predictions[:, index] = _estimator.predict(X_test)
X_test_predictions

[38]:

array([[8., 8., 8., 8., 8.],
       [4., 4., 4., 4., 9.],
       [8., 8., 8., 6., 9.],
       ...,
       [3., 3., 3., 3., 2.],
       [8., 8., 3., 8., 5.],
       [3., 3., 3., 3., 3.]], dtype=float32)

[39]:

from sklearn.metrics import accuracy_score

y_pred = rnd_forest_blender.predict(X_test_predictions)
y_pred

[39]:

array([8, 4, 8, ..., 3, 8, 3], dtype=uint8)

[40]:

accuracy_score(y_test, y_pred)

[40]:

0.9667