Python
  • Introduction
  • Chapter 1.Notes from research
    • 1.Introduction of Python
    • 2. Build developer environment
      • 2.1.Sublime Text3
      • 2.2.Jupyter(IPython notebook)
        • 2.2.1.Introduction
        • 2.2.2.Basic usage
        • 2.2.3.some common operations
      • 2.3.Github
        • 2.3.1.Create Github account
        • 2.3.2.Create a new repository
        • 2.3.3.Basic operations: config, clone, push
      • 2.4.Install Python 3.4 in Windows
    • 3. Write Python code
      • 3.1.Hello Python
      • 3.2.Basic knowledges
      • 3.3.撰寫獨立python程式
      • 3.4.Arguments parser
      • 3.5.Class
      • 3.6.Sequence
    • 4. Web crawler
      • 4.1.Introduction
      • 4.2.requests
      • 4.3.beautifulSoup4
      • 3.4.a little web crawler
    • 5. Software testing
      • 5.1. Robot Framework
        • 1.1.Introduction
        • 1.2.What is test-automation framework?
        • 1.3.Robot Framework Architecture
        • 1.4.Robot Framework Library
        • 1.5.Reference
    • 6. encode/ decode
      • 6.1.編碼/解碼器的基本概念
      • 6.2.常見的編碼/ 解碼錯誤訊息與其意義
      • 6.3 .處理文字檔案
    • 7. module
      • 7.1.Write a module
      • 7.2.Common module
        • 7.2.1.sched
        • 7.2.2.threading
    • 8. Integrate IIS with django
      • 8.1.Integrate IIS with django
  • Chapter 2.Courses
    • 2.1.Python for Data Science and Machine Learning Bootcamp
      • 2.1.1.Virtual Environment
      • 2.1.2.Python crash course
      • 2.1.3.Python for Data Analysis - NumPy
        • 2.1.3.1.Numpy arrays
        • 2.1.3.2.Numpy Array Indexing
        • 2.1.3.3.Numpy Operations
      • 2.1.4.Python for Data Analysis - Pandas
        • 2.1.4.1.Introduction
        • 2.1.4.2.Series
        • 2.1.4.3.DataFrames
        • 2.1.4.4.Missing Data
        • 2.1.4.5.GroupBy
        • 2.1.4.6.Merging joining and Concatenating
        • 2.1.4.7.Data input and output
      • 2.1.5.Python for Data Visual Visualization - Pandas Built-in Data Visualization
      • 2.1.6.Python for Data Visualization - Matplotlib
        • 2.1.6.1.Introduction of Matplotlib
        • 2.1.6.2.Matplotlib
      • 2.1.7.Python for Data Visualization - Seaborn
        • 2.1.7.1.Introduction to Seaborn
        • 2.1.7.2.Distribution Plots
        • 2.1.7.3.Categorical Plots
        • 2.1.7.4.Matrix Plots
        • 2.1.7.5.Grids
        • 2.1.7.6.Regression Plots
      • 2.1.8. Python for Data Visualization - Plotly and Cufflinks
        • 2.1.8.1.Introduction to Plotly and Cufflinks
        • 2.1.8.2.Plotly and Cufflinks
      • 2.1.9. Python for Data Visualization - Geographical plotting
        • 2.1.9.1.Choropleth Maps - USA
        • 2.1.9.2.Choropleth Maps - World
      • 2.1.10.Combine data analysis and visualization to tackle real world data sets
        • 911 calls capstone project
      • 2.1.11.Linear regression
        • 2.1.11.1.Introduction to Scikit-learn
        • 2.1.11.2.Linear regression with Python
      • 2.1.12.Logistic regression
        • 2.1.12.1.Logistic regression Theory
        • 2.1.12.2.Logistic regression with Python
      • 2.1.13.K Nearest Neighbors
        • 2.1.13.1.KNN Theory
        • 2.1.13.2.KNN with Python
      • 2.1.14.Decision trees and random forests
        • 2.1.14.1.Introduction of tree methods
        • 2.1.14.2.Decision trees and Random Forests with Python
      • 2.1.15.Support Vector Machines
      • 2.1.16.K means clustering
      • 2.1.17.Principal Component Analysis
    • 2.2. Machine Learning Crash Course Jam
Powered by GitBook
On this page
  • 1.基本資訊
  • 2.Google Machine Learning Course
  • 3.Review Pandas
  • 4.Tensorflow
  • 5.Build a predictable model
  • 驗證codlab
  • codlab觀察重點

Was this helpful?

  1. Chapter 2.Courses

2.2. Machine Learning Crash Course Jam

Previous2.1.17.Principal Component Analysis

Last updated 5 years ago

Was this helpful?

1.基本資訊

2.Google Machine Learning Course

    • (1)Algebraic : y = wx + b -> y = wx1 + w2x2 + b

    • (2)Loss : 衡量預測模型的能力, 最好的情況是損失為0

      • Mean square error (MSE)

      • Root Mean square error (RMSE)

  • 3.Reducing error

    • 最佳的Learning rate會因資料而異

      • step size: 經幾次迭代後更新learning rate

      • Stochastic gradient descent: SGD (隨機選擇下一點計算下一次斜率) vs GD, SGD會比較有效率

3.Review Pandas

  • Pandas Structure

    • Series

      • 讀取外部資料

        • 用head檢查資料有沒有讀進來

          accidents = pd.read_csv('https://raw.githubusercontent.com/kristenchan/Sharing/master/104_accident.csv')
          accidents.head(5)
        • 查詢資料類型

          accidents.dtypes
        • 數值型資料 (count, min, max, std. mean)

          accidents.describe()
        • Stastic上的資料種類

          • 類別資料

          • 連續型資料

            • 將資料轉型

              accidents['Year'] = accidents.Year.astype('str')
      • 選取資料

        accidents['Age']
        accidents.iloc[:, 9]
      • 重新洗牌

      • 新增欄位

4.Tensorflow

  • 1.匯入Library

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
  • 2.讀取外部資料

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
  • 3.对数据进行随机化处理,以确保不会出现任何病态排序结果(可能会损害随机梯度下降法的效果)

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))
  • 4.將大資料簡化 (修改單位)

california_housing_dataframe["median_house_value"] /= 1000.0
  • 5.數值型資料統計

california_housing_dataframe.describe()
  • 6.Build the first model (Linear regression)

    • (1)先畫圖瞭解一下資料

      import seaborn as sns
      #畫出房屋數量跟房價中位數的關係
      sns.lmplot("total_rooms", "median_house_value", data = california_housing_dataframe, fit_reg = False)
    • (2)定義x軸資料

      # Define the input feature: total_rooms.
      my_feature = california_housing_dataframe[["total_rooms"]]
      
      # Configure a numeric feature column for total_rooms.
      feature_columns = [tf.feature_column.numeric_column("total_rooms")]
    • (3)定義y資料

      # Define the label.
      targets = california_housing_dataframe["median_house_value"]
    • (4)配置 LinearRegressor

      # Use gradient descent as the optimizer for training the model.
      my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
      my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
      # Configure the linear regression model with our feature columns and optimizer.
      # Set a learning rate of 0.0000001 for Gradient Descent.
      linear_regressor = tf.estimator.LinearRegressor(
        feature_columns=feature_columns,
        optimizer=my_optimizer
      )
    • (5)Training model

      _ = linear_regressor.train(
        input_fn = lambda:my_input_fn(my_feature, targets),
        steps=100
      )
    • (6)Predict : 看看model的預測值如何

      # Create an input function for predictions.
      # Note: Since we're making just one prediction for each example, we don't 
      # need to repeat or shuffle the data here.
      prediction_input_fn =lambda: my_input_fn(my_feature,     targets, num_epochs=1, shuffle=False)
      
      # Call predict() on the linear_regressor to make predictions.
      predictions = linear_regressor.predict(input_fn=prediction_input_fn)
    • (7)計算MSE, RMSE

      # Format predictions as a NumPy array, so we can calculate error metrics.
      predictions = np.array([item['predictions'][0] for item in predictions])
      
      # Print Mean Squared Error and Root Mean Squared Error.
      mean_squared_error = metrics.mean_squared_error(predictions, targets)
      root_mean_squared_error = math.sqrt(mean_squared_error)
      print "Mean Squared Error (on training data): %0.3f" % mean_squared_error
      print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error
      
      min_house_value = california_housing_dataframe["median_house_value"].min()
      max_house_value = california_housing_dataframe["median_house_value"].max()
      min_max_difference = max_house_value - min_house_value
      
      print "Min. Median House Value: %0.3f" % min_house_value
      print "Max. Median House Value: %0.3f" % max_house_value
      print "Difference between Min. and Max.: %0.3f" % min_max_difference
      print "Root Mean Squared Error: %0.3f" % root_mean_squared_error
    • (8)瞭解predictions跟target的差異

      calibration_data = pd.DataFrame()
      calibration_data["predictions"] = pd.Series(predictions)
      calibration_data["targets"] = pd.Series(targets)
      calibration_data.describe()
    • (9)取平均分佈的樣本

      sample = california_housing_dataframe.sample(n=300)
    • (10)繪製散布圖

      # Get the min and max total_rooms values.
      x_0 = sample["total_rooms"].min()
      x_1 = sample["total_rooms"].max()
      
      # Retrieve the final weight and bias generated during training.
      weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
      bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
      
      # Get the predicted median_house_values for the min and max total_rooms values.
      y_0 = weight * x_0 + bias 
      y_1 = weight * x_1 + bias
      
      # Plot our regression line from (x_0, y_0) to (x_1, y_1).
      plt.plot([x_0, x_1], [y_0, y_1], c='r')
      
      # Label the graph axes.
      plt.ylabel("median_house_value")
      plt.xlabel("total_rooms")
      
      # Plot a scatter plot from our data sample.
      plt.scatter(sample["total_rooms"], sample["median_house_value"])
      
      # Display graph.
      plt.show()

5.Build a predictable model

    • 1.如何Fit a model?

      • 預測未知的資料夠準確

      • 模型的狀態

        • Overfitting

        • Optimum

        • Underfitting

      • 未來的資料怎麼來?

        • 將資料分成training set及testing set

        • 怎麼切testing set?

          • 80/20法則: 取1/5作為testing set

        • validation set

    • 2.如何挑選合適的feature?

      • (1)特徵工程

        • 挑選適合的特徵

      • (2)單位換算

      • (3)Feature should not take on "magic value"

        • 不要用-1表示不能用的資料

        • 多加一個flag的column

      • (4)不要用代碼紀錄特徵的特性

      • (5)了解資料, 刪除outlier

      • (6)對於不是連續型的變數, 可以用資料區間表示

    • 3.分類

      • 機率模型

        • 邏輯式回歸: 將線性模型轉換為機率模型

          • y = exp(b + w1x1 + ... + wpxp) > 0

          • p = f(b + w1x1 + ... + wpxp) < 0

      • Logistic regression是最簡單的神經網路 (NN)

      • Threshold的選擇取決於風險的承擔

        • 同一模型, 須根據不同場景選擇不同閾值, 將會影響決策

        • 模型說是真, 實際也為真的機率 (True Positive)

        • 除了精確度外, 需要考慮召回率: TP/(TP + FP)

驗證codlab

  • 1.匯入Library

      import math
    
      from IPython import display
      from matplotlib import cm
      from matplotlib import gridspec
      from matplotlib import pyplot as plt
      import numpy as np
      import pandas as pd
      from sklearn import metrics
      import tensorflow as tf
      from tensorflow.python.data import Dataset
    
      tf.logging.set_verbosity(tf.logging.ERROR)
      pd.options.display.max_rows = 10
      pd.options.display.float_format = '{:.1f}'.format
  • 2.讀取外部資料

  • 3.資料預處理

def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

(4) 選擇Training set

training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

(5) 選擇validation set

validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

codlab觀察重點

  • 1.Training data和validation data有相似的特性

    • 所以記得先用shuffle重新洗牌, 讓資料均勻

      from sklearn.utils import shuffle
      
      california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")
      
      california_housing_dataframe = shuffle(california_housing_dataframe)
    • overfitting

      • 是否是feature放太多所造成的結果?

      • 目標loss越小, 不希望模型太複雜

    • 如何定義模型複雜度?

      • y = b + w1x1 + ... + wpxp

        • w或p越大, 模型越複雜

      • 模型複雜度

        • ||w||^2 = w1 ^2 + ... + wp ^2

1.

2.: 預測連續型的資料 (e.g., 房價資料)

是一個很像Jupyter notebook的環境

監督式學習
regression
https://developers.google.com/machine-learning/crash-course/reducing-loss/check-your-understanding
https://developers.google.com/machine-learning/crash-course/fitter/graph
Review Pandas簡報
codlab
Build a predictable model簡報
共筆
Google MLCC教材
Tensorflow colab
驗證colab
邏輯回歸codlab