2.2. Machine Learning Crash Course Jam

1.基本資訊
2.Google Machine Learning Course
1.監督式學習
2.regression: 預測連續型的資料 (e.g., 房價資料)
(1)Algebraic : y = wx + b -> y = wx1 + w2x2 + b
(2)Loss : 衡量預測模型的能力, 最好的情況是損失為0
Mean square error (MSE)
Root Mean square error (RMSE)
3.Reducing error
最佳的Learning rate會因資料而異
step size: 經幾次迭代後更新learning rate
Stochastic gradient descent: SGD (隨機選擇下一點計算下一次斜率) vs GD, SGD會比較有效率
3.Review Pandas
codlab是一個很像Jupyter notebook的環境
Pandas Structure
Series
讀取外部資料
用head檢查資料有沒有讀進來
accidents = pd.read_csv('https://raw.githubusercontent.com/kristenchan/Sharing/master/104_accident.csv') accidents.head(5)
查詢資料類型
accidents.dtypes
數值型資料 (count, min, max, std. mean)
accidents.describe()
Stastic上的資料種類
類別資料
連續型資料
將資料轉型
accidents['Year'] = accidents.Year.astype('str')
選取資料
accidents['Age'] accidents.iloc[:, 9]
重新洗牌
新增欄位
4.Tensorflow
1.匯入Library
import math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
2.讀取外部資料
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
3.对数据进行随机化处理,以确保不会出现任何病态排序结果(可能会损害随机梯度下降法的效果)
california_housing_dataframe = california_housing_dataframe.reindex(
np.random.permutation(california_housing_dataframe.index))
4.將大資料簡化 (修改單位)
california_housing_dataframe["median_house_value"] /= 1000.0
5.數值型資料統計
california_housing_dataframe.describe()
6.Build the first model (Linear regression)
(1)先畫圖瞭解一下資料
import seaborn as sns #畫出房屋數量跟房價中位數的關係 sns.lmplot("total_rooms", "median_house_value", data = california_housing_dataframe, fit_reg = False)
(2)定義x軸資料
# Define the input feature: total_rooms. my_feature = california_housing_dataframe[["total_rooms"]] # Configure a numeric feature column for total_rooms. feature_columns = [tf.feature_column.numeric_column("total_rooms")]
(3)定義y資料
# Define the label. targets = california_housing_dataframe["median_house_value"]
(4)配置 LinearRegressor
# Use gradient descent as the optimizer for training the model. my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001) my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0) # Configure the linear regression model with our feature columns and optimizer. # Set a learning rate of 0.0000001 for Gradient Descent. linear_regressor = tf.estimator.LinearRegressor( feature_columns=feature_columns, optimizer=my_optimizer )
(5)Training model
_ = linear_regressor.train( input_fn = lambda:my_input_fn(my_feature, targets), steps=100 )
(6)Predict : 看看model的預測值如何
# Create an input function for predictions. # Note: Since we're making just one prediction for each example, we don't # need to repeat or shuffle the data here. prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False) # Call predict() on the linear_regressor to make predictions. predictions = linear_regressor.predict(input_fn=prediction_input_fn)
(7)計算MSE, RMSE
# Format predictions as a NumPy array, so we can calculate error metrics. predictions = np.array([item['predictions'][0] for item in predictions]) # Print Mean Squared Error and Root Mean Squared Error. mean_squared_error = metrics.mean_squared_error(predictions, targets) root_mean_squared_error = math.sqrt(mean_squared_error) print "Mean Squared Error (on training data): %0.3f" % mean_squared_error print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error min_house_value = california_housing_dataframe["median_house_value"].min() max_house_value = california_housing_dataframe["median_house_value"].max() min_max_difference = max_house_value - min_house_value print "Min. Median House Value: %0.3f" % min_house_value print "Max. Median House Value: %0.3f" % max_house_value print "Difference between Min. and Max.: %0.3f" % min_max_difference print "Root Mean Squared Error: %0.3f" % root_mean_squared_error
(8)瞭解predictions跟target的差異
calibration_data = pd.DataFrame() calibration_data["predictions"] = pd.Series(predictions) calibration_data["targets"] = pd.Series(targets) calibration_data.describe()
(9)取平均分佈的樣本
sample = california_housing_dataframe.sample(n=300)
(10)繪製散布圖
# Get the min and max total_rooms values. x_0 = sample["total_rooms"].min() x_1 = sample["total_rooms"].max() # Retrieve the final weight and bias generated during training. weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0] bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights') # Get the predicted median_house_values for the min and max total_rooms values. y_0 = weight * x_0 + bias y_1 = weight * x_1 + bias # Plot our regression line from (x_0, y_0) to (x_1, y_1). plt.plot([x_0, x_1], [y_0, y_1], c='r') # Label the graph axes. plt.ylabel("median_house_value") plt.xlabel("total_rooms") # Plot a scatter plot from our data sample. plt.scatter(sample["total_rooms"], sample["median_house_value"]) # Display graph. plt.show()
5.Build a predictable model
1.如何Fit a model?
預測未知的資料夠準確
模型的狀態
Overfitting
Optimum
Underfitting
未來的資料怎麼來?
將資料分成training set及testing set
怎麼切testing set?
80/20法則: 取1/5作為testing set
validation set
2.如何挑選合適的feature?
(1)特徵工程
挑選適合的特徵
(2)單位換算
(3)Feature should not take on "magic value"
不要用-1表示不能用的資料
多加一個flag的column
(4)不要用代碼紀錄特徵的特性
(5)了解資料, 刪除outlier
(6)對於不是連續型的變數, 可以用資料區間表示
3.分類
機率模型
邏輯式回歸: 將線性模型轉換為機率模型
y = exp(b + w1x1 + ... + wpxp) > 0
p = f(b + w1x1 + ... + wpxp) < 0
Logistic regression是最簡單的神經網路 (NN)
Threshold的選擇取決於風險的承擔
同一模型, 須根據不同場景選擇不同閾值, 將會影響決策
模型說是真, 實際也為真的機率 (True Positive)
除了精確度外, 需要考慮召回率: TP/(TP + FP)
驗證codlab
1.匯入Library
import math from IPython import display from matplotlib import cm from matplotlib import gridspec from matplotlib import pyplot as plt import numpy as np import pandas as pd from sklearn import metrics import tensorflow as tf from tensorflow.python.data import Dataset tf.logging.set_verbosity(tf.logging.ERROR) pd.options.display.max_rows = 10 pd.options.display.float_format = '{:.1f}'.format
2.讀取外部資料
3.資料預處理
def preprocess_features(california_housing_dataframe):
"""Prepares input features from California housing data set.
Args:
california_housing_dataframe: A Pandas DataFrame expected to contain data
from the California housing data set.
Returns:
A DataFrame that contains the features to be used for the model, including
synthetic features.
"""
selected_features = california_housing_dataframe[
["latitude",
"longitude",
"housing_median_age",
"total_rooms",
"total_bedrooms",
"population",
"households",
"median_income"]]
processed_features = selected_features.copy()
# Create a synthetic feature.
processed_features["rooms_per_person"] = (
california_housing_dataframe["total_rooms"] /
california_housing_dataframe["population"])
return processed_features
def preprocess_targets(california_housing_dataframe):
"""Prepares target features (i.e., labels) from California housing data set.
Args:
california_housing_dataframe: A Pandas DataFrame expected to contain data
from the California housing data set.
Returns:
A DataFrame that contains the target feature.
"""
output_targets = pd.DataFrame()
# Scale the target to be in units of thousands of dollars.
output_targets["median_house_value"] = (
california_housing_dataframe["median_house_value"] / 1000.0)
return output_targets
(4) 選擇Training set
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()
training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()
(5) 選擇validation set
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()
codlab觀察重點
1.Training data和validation data有相似的特性
所以記得先用shuffle重新洗牌, 讓資料均勻
from sklearn.utils import shuffle california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",") california_housing_dataframe = shuffle(california_housing_dataframe)
overfitting
是否是feature放太多所造成的結果?
目標loss越小, 不希望模型太複雜
如何定義模型複雜度?
y = b + w1x1 + ... + wpxp
w或p越大, 模型越複雜
模型複雜度
||w||^2 = w1 ^2 + ... + wp ^2
Last updated
Was this helpful?