# 911 calls capstone project

![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0T7_bBGYNUsLOU6Y%2Fertt.png?generation=1586302957541966\&alt=media)

## 1.Kaggle

* [Kaggle](https://zh.wikipedia.org/wiki/Kaggle)是一個數據建模和數據分析競賽平台。企業和研究者可在其上發布數據
* 有眾多策略可以用於解決幾乎所有預測建模的問題, 而研究者不可能在一開始就了解什麼方法對於特定問題是最為有效的, Kaggle的目標則是試圖通過眾包的形式來解決這一難題, 進而使數據科學成為一場運動
* [2017年被google收購](https://www.inside.com.tw/2017/03/09/kaggle-joins-google-cloud)

## 2.Emergency - 911 Calls projects

* 在[資料頁面](https://www.kaggle.com/mchirico/montcoalert)按下Download, 即可下載CSV檔

![](https://1184108162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M4M0G8SFgkeUaGo4vl-%2F-M4M0HrDfjWeZX2tGCNv%2F-M4M0T7boLG8DUJ--Wqv%2Fertt.png?generation=1586302957322240\&alt=media)

## 3.讀取資料並了解資料

1. 第一步必須先匯入[pandas](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas.html), [numpy](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/213python-for-data-analysis-numpy.html)

   ```
   import numpy as np
   import pandas as pd
   ```
2. 第二步記得將matplotlib的圖表直接嵌入到Notebook之中 (如果是使用jupyter notebook), 並匯入[seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html), [matplotlib](https://jenhsuan.gitbooks.io/python/content/wqew.html)等視覺化套件

   ```
   import matplotlib.pyplot as plt
   import seaborn as sns
   sns.set_style('whitegrid')
   %matplotlib inline
   ```
3. 讀取資料並分析資料

   ```
   df = pd.read_csv('911.csv')
   ```

   * 先了解資料欄位的型別以及變數的型態, 由pd.info()可以知道這份資料有9個欄位: 有3筆屬於float, 1筆屬於int64, 5筆屬於object (oject可能是屬於Series)

   ```
   df.info()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-09%20%E4%B8%8B%E5%8D%889.26.05.png)

   * 可以確認前幾筆資料的內容長怎樣, 例如確認前10筆

   ```
   df.head(10)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-09%20%E4%B8%8B%E5%8D%889.49.17.png)

   * 取得資料的基本統計數值: 如數量, 平均值, 標準差, 四分位數等數值

   ```
   pf.describe()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8811.21.24.png)

## 4.基本分析

1. 列出出現次數前幾名的某欄位? 例如列出出現次數前5名的zipcode

   * value\_counts會返回一個Series, 預設為由降冪排序, 想改成升冪排序也可以使用ascending=True

   ```
   df['zip'].value_counts().head(5)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%887.59.06.png)
2. 某欄位有多少個唯一值? 例如有多少個title

```
df['title'].nunique()
```

## 5.產生新的feature

1. title的格式是Reason/Department, 例如以EMS: BACK PAINS/INJURY來說, Reason就是EMS, 使用[lambda表達式及apply()](https://my.oschina.net/lionets/blog/187067)來創造新的Reason欄位

   ```
   pf['Reason'] = pf['title'].apply(lambda title: title.split(':')[0])
   ```
2. 911中最常被呼叫的幾個原因為何?

   ```
   pf['Reason'].value_counts()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%888.19.14.png)
3. 用[seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html)產生countplot

   * [countplot](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2173categorical-plots.html): x軸指定想要計數的欄位名稱, 指定資料來源

   ```
   sns.set_style('whitegrid')
   sns.countplot(x='Reason', data=pf)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%888.48.53.png)
4. 取得timeStamp中的hour

   * 由於在DataFrame中所儲存的timeStamp Series為字串, 需要先轉成datetime

   ```
   pf['timeStamp'] = pd.to_datetime(pf['timeStamp'])
   pf['timeStamp'].iloc[0].hour
   ```
5. 將timeStamp中的hour, month, dayofweek存在新欄位中

   * 對pf\['timeStamp']中的每一個元素來說, 直接呼叫hour, month, dayofweek就可以取得小時, 月, 日資料

   ```
   pf['Hour'] = pf['timeStamp'].apply(lambda time: time.hour)
   pf['Month'] = pf['timeStamp'].apply(lambda time: time.month)
   pf['Day of Week'] = pf['timeStamp'].apply(lambda time: time.dayofweek)
   ```

   * 將Day of Week從number轉為string

   ```
   dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
   pf['Day of Week'] = pf['Day of Week'].map(dmap)
   ```

   * 畫出Day of Week的countplot: x軸指定'Day of Week', hue指定'Reason'

   ```
   sns.countplot(x='Day of Week', hue = 'Reason', data = pf)

   # To relocate the legend
   plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%889.21.59.png)

   * 用[groupby](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas/2145groupby.html)找出個月份中每個欄位分別的數量

   ```
   byMonth = pf.groupby('Month').count()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%889.41.58.png)

   * 畫出每個月求救的次數的曲線圖

     * [可以使用pandas內建的plot](https://jenhsuan.gitbooks.io/python/content/2323.html)

     ```
     byMonth['twp'].plot()
     ```

     ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%889.52.31.png)
6. 用seaborn來做每個月求救次數的線性回歸圖 (linear fit), 可以使用[sns.lmplot](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2176regression-plots.html)

   * x軸資料指定為Month, y軸資料指定為twp, data指定為[reset\_index()](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas/2143dataframes.html)後的資料

   ```
   sns.lmplot(x='Month',y='twp',data=byMonth.reset_index())
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.02.14.png)
7. 將timeStamp中的date存在新欄位中

   * 對pf\['timeStamp']中的每一個元素來說, 必須呼叫[date()](http://www.wklken.me/posts/2015/03/03/python-base-datetime.html#1-datetime)以取得date資料

   ```
   pf['timeStamp'].iloc[0].date()
   pf['Date'] = pf['timeStamp'].apply(lambda t: t.date())
   ```

   * 畫出每日的求救次數的曲線圖

   ```
   byDate = pf.groupby('Date').count()['twp'].plot()
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.28.06.png)
8. 對每一種撥打求救電話原因, 畫出其每日的求救次數的曲線圖: 先對原始資料進行[條件過濾](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas/2143dataframes.html)再plot

   * Taffic的每日的求救次數曲線圖, 並加上title

   ```
   pf[pf['Reason']=='Traffic'].groupby('Date').count()['twp'].plot()
   plt.title('Traffic')
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.35.33.png)

   * Fire的每日的求救次數曲線圖, 並加上title

   ```
   pf[pf['Fire']=='Traffic'].groupby('Date').count()['twp'].plot()
   plt.title('Fire')
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.38.38.png)

   * EMS的每日的求救次數曲線圖, 並加上title

   ```
   pf[pf['EMS']=='Traffic'].groupby('Date').count()['twp'].plot()
   plt.title('EMS')
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.39.54.png)
9. [Heatmap分析](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2174matrix-plots.html)

   * 先取得**一週中每天每小時的各欄位的累積數量**

     ```
     pf.groupby(by=['Day of Week','Hour']).count()
     ```
   * 取出Reason欄位並畫出熱力圖

   ```
   dayHour = pf.groupby(by=['Day of Week','Hour']).count()['Reason'].unstack()
   plt.figure(figsize=(12,6))
   sns.heatmap(dayHour,cmap='viridis')
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.56.36.png)
10. [Clustermap分析](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2174matrix-plots.html)

```
plt.figure(figsize=(12,6))
sns.clustermap(dayHour,cmap='viridis')
```

![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202018-06-10%20%E4%B8%8A%E5%8D%8810.59.58.png)
