2.1.12.2.Logistic regression with Python

1. Titanic: Machine Learning from Disaster

在資料頁面按下Download, 即可下載CSV檔

從頁面上可得到資料欄位的說明

2.資料探勘及分析

匯入基本的library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

將圖表直接嵌入到Notebook之中
```
 %matplotlib inline
```
讀取資料並了解資料
- 讀取資料
  train = pd.read_csv('USA_Housing.csv')
- 先了解資料欄位的型別以及變數的型態, 由pd.info()可以知道這份資料有9個欄位: 有2筆屬於float, 4筆屬於int64, 3筆屬於uint8
  train.info()
- 可以確認前幾筆資料的內容長怎樣, 例如確認前10筆
  train.head(10)
視覺化分析
- Seaborn Heatmap: 創建一個映射圖以了解缺少的資料
  - 可以發現Age, Cabin有一些屬於null的資料
    sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap='viridis')
- Seaborn countplot: 瞭解倖存者的比例
  - 表示生還者
  sns.set_style('whitegrid') sns.countplot(x = 'Survived', data = train)
- Seaborn countplot:瞭解倖存者/非倖存者的男女比例
  sns.countplot(x = 'Survived', hue = 'Sex', data = train, palette='RdBu_r')
- Seaborn countplot:瞭解倖存者/非倖存者的上船港口比例
  sns.countplot(x = 'Survived', hue = 'Embarked', data = train, palette='RdBu_r')
- Seaborn countplot:瞭解倖存者/非倖存者跟艙等間的關係
  sns.countplot(x = 'Survived', hue = 'Pclass', data = train, palette='RdBu_r')
- Seaborn distplot: 瞭解船上的乘客年齡分佈
  sns.distplot(train['Age'].dropna(), kde = False, bins = 30)
- Seaborn countplot:船上的人是否有兄弟姊妹: 大多數的人都是單身
  sns.countplot(x = 'SibSp', data = train)
- pandas plot: 乘客票價區間的分佈
  train['Fare'].plot.hist(bins=40, figsize = (10, 4))
- Plotly and Cufflinks: 建立互動式的圖表
  import cufflinks as cf cf.go_offline() train['Fare'].iplot(bins=40)
- Seaborn boxplot: 可以看出不同等級的機艙乘客的年齡分佈
  plt.figure(figsize=(10,7)) sns.boxplot(x='Pclass', y='Age',data=train)

3.分類及處理不存在的資料

在進行訓練之前, 需要先檢視原始資料是否有缺漏, 並設法補充缺漏的地方

先對Age資料做處理: 目的是填充NA的資料

定義一個function用來替換掉null值的資料

#將null值換成假資料
def impute_age(cols):
  Age = cols[0]
  Pclass = cols[1]

  if pd.isnull(Age):

      if Pclass == 1: 
          return 37
      elif Pclass == 2:
          return 29
      else:
          return 24
  else:
      return Age

train['Age'] = train[['Age', 'Pclass']].apply(impute_age, axis = 1)

#重新畫heat map
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap='viridis')

接著處理Cabin: 目的是丟棄NA的資料

drop Cabin

train.drop('Cabin', axis = 1, inplace = True)
#drop後仍有一些NA
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap='viridis')

drop NA的column

train.dropna(inplace = True)
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap='viridis')

對於某些離散特徵, 例如Sex, Embarked, 可以轉成one-hot編碼, 目的是利用one-hot編碼數值化這些離散資料以進行後面的training
- 處理Sex資料
  #性別 #one hot 編碼 sex = pd.get_dummies(train['Sex'], drop_first=True) sex.head()
- 處理Embarked資料
  #登船港口 embark = pd.get_dummies(train['Embarked'], drop_first=True) embark.head()

將原始資料跟one-hot的資料concat在一起

train = pd.concat([train, sex, embark], axis = 1)
train.head()

從資料中移除原始的Sex, Embarked資料, 此外移除不需要的Name, Ticket, PassengerId

train.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'PassengerId'], axis = 1, inplace = True)
train.head()

4.訓練及建模以預測乘客是否會生還

接著用除了Survived以外的資料來預測Survived, 在實際案例時可以使用另一批資料做為測試資料. 這邊使用的是sklearn

首先用train_test_split, 這個函式可以隨機劃分訓練集和測試集

X = train.drop('Survived', axis = 1)
Y = train['Survived']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)

接著使用Logic regression的library進行訓練, 接著進行預測

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
predictions = logmodel.predict(X_test)

Evaluate model

  from sklearn.metrics import classification_report
  print(classification_report(y_test, predictions))

confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

Previous2.1.12.1.Logistic regression Theory Next2.1.13.K Nearest Neighbors

Last updated 4 years ago