8.7 练习9

加载MNIST数据集,并将其分为训练集和测试集(使用前60000个实例进行训练,其余10 000个进行测试)。在数据集上训练随机森林分类器,花费多长时间,然后在测试集上评估模型。接下来,使用PCA来减少数据集的维度,可解释方差率为95%。在精简后的数据集上训练新的随机森林分类器,查看花费了多长时间。训练速度提高了吗?接下来,评估测试集上的分类器。与之前的分类器相比如何?

[1]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
[2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)
[3]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]
[4]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
[5]:
import time

t0 = time.time()
rnd_clf.fit(X_train, y_train)
t1 = time.time()
[6]:
print("训练时间: {:.2f}s".format(t1 - t0))
训练时间: 42.47s
[7]:
from sklearn.metrics import accuracy_score
y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)
[7]:
0.9705
[8]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)
[11]:
rnd_clf2 = RandomForestClassifier(n_estimators=100, random_state=42)
t0 = time.time()
rnd_clf2.fit(X_train_reduced, y_train)
t1 = time.time()

print("训练时间: {:.2f}s".format(t1 - t0))
训练时间: 102.57s

可以看到,训练的时间不仅没有减少,而且变大了!所以,维度下降并不总是可以带来更快的训练速度:训练时间依赖于数据集、模型和训练算法。

[12]:
X_test_reduced = pca.transform(X_test)

y_pred = rnd_clf2.predict(X_test_reduced)

accuracy_score(y_test, y_pred)
[12]:
0.9481

用于降维丢失了部分有用的信息,所以模型性能稍微下降一点是正常的。但在这里,PCA的性能下降太多了:不仅降低了训练速度,而且降低了模型性能。_

下面试试使用softmax regression

[15]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)

t0 = time.time()
log_clf.fit(X_train, y_train)
t1 = time.time()

print("训练时间: {:.2f}s".format(t1 - t0))
训练时间: 19.54s
/Users/zhangxiaomin/Works/A05-Developments/PythonWorkplace/ml_scikit_torch/venv/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
[17]:
y_pred = log_clf.predict(X_test)

accuracy_score(y_test, y_pred)
[17]:
0.9255
[18]:
log_clf2 = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)

t0 = time.time()
log_clf2.fit(X_train_reduced, y_train)
t1 = time.time()

print("训练时间: {:.2f}s".format(t1 - t0))
训练时间: 5.25s
/Users/zhangxiaomin/Works/A05-Developments/PythonWorkplace/ml_scikit_torch/venv/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
[19]:
y_pred = log_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)
[19]:
0.9201

模型的性能略微下降,但是训练的速度提高了两倍。因此:PCA可以帮助你提高处理速度,但是并不能保证一定可以提供模型的训练速度!