5.5 课后练习10¶

使用SVM回归预测California的房价

[1]:

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing.keys()

[1]:

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

[2]:

housing['DESCR']

[2]:

'.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block\n        - HouseAge      median house age in block\n        - AveRooms      average number of rooms\n        - AveBedrms     average number of bedrooms\n        - Population    block population\n        - AveOccup      average house occupancy\n        - Latitude      house block latitude\n        - Longitude     house block longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttp://lib.stat.cmu.edu/datasets/\n\nThe target variable is the median house value for California districts.\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit for which the U.S.\nCensus Bureau publishes sample data (a block group typically has a population\nof 600 to 3,000 people).\n\nIt can be downloaded/loaded using the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic:: References\n\n    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n      Statistics and Probability Letters, 33 (1997) 291-297\n'

[3]:

# 拆分为训练集和测试集
from sklearn.model_selection import train_test_split
X = housing['data']
y = housing['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[4]:

# 不要忘记归一化
from sklearn.preprocessing import StandardScaler

[5]:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

先使用LinearSVR试试

[6]:

from sklearn.svm import LinearSVR

[7]:

lin_svr = LinearSVR(random_state=42)
lin_svr.fit(X_train_scaled, y_train)

[7]:

LinearSVR(random_state=42)

[8]:

from sklearn.metrics import mean_squared_error

[9]:

y_pred = lin_svr.predict(X_train_scaled)
mse = mean_squared_error(y_train, y_pred)
mse

[9]:

0.9641780189948642

看看RMSE的值

[10]:

import numpy as np
np.sqrt(mse)

[10]:

0.9819256687727764

在训练集中，targets是千为单位的。RMSE可以粗略的看出期望的误差：在这个模型中我们快成看出期望的误差是1000美金。模型的表现可以说是相当的烂。下面我们看看RBF核是不是可以表现的好一点

[11]:

from sklearn.svm import SVR
from scipy.stats import reciprocal, uniform
from sklearn.model_selection import RandomizedSearchCV

[12]:

param_distributions = {'gamma':reciprocal(0.001, 0.1), 'C':uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(SVR(), param_distributions=param_distributions, n_iter=10, verbose=2, cv=3, random_state=42)
rnd_search_cv.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] C=4.745401188473625, gamma=0.07969454818643928 ..................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ... C=4.745401188473625, gamma=0.07969454818643928, total=   8.2s
[CV] C=4.745401188473625, gamma=0.07969454818643928 ..................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.2s remaining:    0.0s
[CV] ... C=4.745401188473625, gamma=0.07969454818643928, total=   9.3s
[CV] C=4.745401188473625, gamma=0.07969454818643928 ..................
[CV] ... C=4.745401188473625, gamma=0.07969454818643928, total=   8.9s
[CV] C=8.31993941811405, gamma=0.015751320499779724 ..................
[CV] ... C=8.31993941811405, gamma=0.015751320499779724, total=   8.0s
[CV] C=8.31993941811405, gamma=0.015751320499779724 ..................
[CV] ... C=8.31993941811405, gamma=0.015751320499779724, total=   7.7s
[CV] C=8.31993941811405, gamma=0.015751320499779724 ..................
[CV] ... C=8.31993941811405, gamma=0.015751320499779724, total=   7.1s
[CV] C=2.560186404424365, gamma=0.002051110418843397 .................
[CV] .. C=2.560186404424365, gamma=0.002051110418843397, total=   6.8s
[CV] C=2.560186404424365, gamma=0.002051110418843397 .................
[CV] .. C=2.560186404424365, gamma=0.002051110418843397, total=   6.6s
[CV] C=2.560186404424365, gamma=0.002051110418843397 .................
[CV] .. C=2.560186404424365, gamma=0.002051110418843397, total=   6.2s
[CV] C=1.5808361216819946, gamma=0.05399484409787431 .................
[CV] .. C=1.5808361216819946, gamma=0.05399484409787431, total=   7.1s
[CV] C=1.5808361216819946, gamma=0.05399484409787431 .................
[CV] .. C=1.5808361216819946, gamma=0.05399484409787431, total=   7.2s
[CV] C=1.5808361216819946, gamma=0.05399484409787431 .................
[CV] .. C=1.5808361216819946, gamma=0.05399484409787431, total=   6.2s
[CV] C=7.011150117432088, gamma=0.026070247583707663 .................
[CV] .. C=7.011150117432088, gamma=0.026070247583707663, total=   7.1s
[CV] C=7.011150117432088, gamma=0.026070247583707663 .................
[CV] .. C=7.011150117432088, gamma=0.026070247583707663, total=   7.3s
[CV] C=7.011150117432088, gamma=0.026070247583707663 .................
[CV] .. C=7.011150117432088, gamma=0.026070247583707663, total=   7.3s
[CV] C=1.2058449429580245, gamma=0.0870602087830485 ..................
[CV] ... C=1.2058449429580245, gamma=0.0870602087830485, total=   6.3s
[CV] C=1.2058449429580245, gamma=0.0870602087830485 ..................
[CV] ... C=1.2058449429580245, gamma=0.0870602087830485, total=   6.4s
[CV] C=1.2058449429580245, gamma=0.0870602087830485 ..................
[CV] ... C=1.2058449429580245, gamma=0.0870602087830485, total=   6.4s
[CV] C=9.324426408004218, gamma=0.0026587543983272693 ................
[CV] . C=9.324426408004218, gamma=0.0026587543983272693, total=   6.5s
[CV] C=9.324426408004218, gamma=0.0026587543983272693 ................
[CV] . C=9.324426408004218, gamma=0.0026587543983272693, total=   6.6s
[CV] C=9.324426408004218, gamma=0.0026587543983272693 ................
[CV] . C=9.324426408004218, gamma=0.0026587543983272693, total=   6.4s
[CV] C=2.818249672071006, gamma=0.0023270677083837795 ................
[CV] . C=2.818249672071006, gamma=0.0023270677083837795, total=   6.7s
[CV] C=2.818249672071006, gamma=0.0023270677083837795 ................
[CV] . C=2.818249672071006, gamma=0.0023270677083837795, total=   6.4s
[CV] C=2.818249672071006, gamma=0.0023270677083837795 ................
[CV] . C=2.818249672071006, gamma=0.0023270677083837795, total=   6.7s
[CV] C=4.042422429595377, gamma=0.011207606211860567 .................
[CV] .. C=4.042422429595377, gamma=0.011207606211860567, total=   6.8s
[CV] C=4.042422429595377, gamma=0.011207606211860567 .................
[CV] .. C=4.042422429595377, gamma=0.011207606211860567, total=   7.7s
[CV] C=4.042422429595377, gamma=0.011207606211860567 .................
[CV] .. C=4.042422429595377, gamma=0.011207606211860567, total=   7.4s
[CV] C=5.319450186421157, gamma=0.003823475224675185 .................
[CV] .. C=5.319450186421157, gamma=0.003823475224675185, total=   5.9s
[CV] C=5.319450186421157, gamma=0.003823475224675185 .................
[CV] .. C=5.319450186421157, gamma=0.003823475224675185, total=   6.1s
[CV] C=5.319450186421157, gamma=0.003823475224675185 .................
[CV] .. C=5.319450186421157, gamma=0.003823475224675185, total=   6.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  3.5min finished

[12]:

RandomizedSearchCV(cv=3, estimator=SVR(),
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc770caf810>,
                                        'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc770cafad0>},
                   random_state=42, verbose=2)

[16]:

rnd_search_cv.best_estimator_

[16]:

SVR(C=4.745401188473625, gamma=0.07969454818643928)

[17]:

y_pred = rnd_search_cv.best_estimator_.predict(X_train_scaled)
mse = mean_squared_error(y_train, y_pred)
np.sqrt(mse)

[17]:

0.5727524770785359

看起来要比LinearSVR好些了，在测试集上测试一下

[18]:

y_pred = rnd_search_cv.best_estimator_.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
np.sqrt(mse)

[18]:

0.5929168385528734

[ ]: