# 911 calls capstone project

![](/files/-M4M0T7_bBGYNUsLOU6Y)

## 1.Kaggle

* [Kaggle](https://zh.wikipedia.org/wiki/Kaggle)是一個數據建模和數據分析競賽平台。企業和研究者可在其上發布數據
* 有眾多策略可以用於解決幾乎所有預測建模的問題, 而研究者不可能在一開始就了解什麼方法對於特定問題是最為有效的, Kaggle的目標則是試圖通過眾包的形式來解決這一難題, 進而使數據科學成為一場運動
* [2017年被google收購](https://www.inside.com.tw/2017/03/09/kaggle-joins-google-cloud)

## 2.Emergency - 911 Calls projects

* 在[資料頁面](https://www.kaggle.com/mchirico/montcoalert)按下Download, 即可下載CSV檔

![](/files/-M4M0T7boLG8DUJ--Wqv)

## 3.讀取資料並了解資料

1. 第一步必須先匯入[pandas](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas.html), [numpy](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/213python-for-data-analysis-numpy.html)

   ```
   import numpy as np
   import pandas as pd
   ```
2. 第二步記得將matplotlib的圖表直接嵌入到Notebook之中 (如果是使用jupyter notebook), 並匯入[seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html), [matplotlib](https://jenhsuan.gitbooks.io/python/content/wqew.html)等視覺化套件

   ```
   import matplotlib.pyplot as plt
   import seaborn as sns
   sns.set_style('whitegrid')
   %matplotlib inline
   ```
3. 讀取資料並分析資料

   ```
   df = pd.read_csv('911.csv')
   ```

   * 先了解資料欄位的型別以及變數的型態, 由pd.info()可以知道這份資料有9個欄位: 有3筆屬於float, 1筆屬於int64, 5筆屬於object (oject可能是屬於Series)

   ```
   df.info()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-09%20下午9.26.05.png)

   * 可以確認前幾筆資料的內容長怎樣, 例如確認前10筆

   ```
   df.head(10)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-09%20下午9.49.17.png)

   * 取得資料的基本統計數值: 如數量, 平均值, 標準差, 四分位數等數值

   ```
   pf.describe()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午11.21.24.png)

## 4.基本分析

1. 列出出現次數前幾名的某欄位? 例如列出出現次數前5名的zipcode

   * value\_counts會返回一個Series, 預設為由降冪排序, 想改成升冪排序也可以使用ascending=True

   ```
   df['zip'].value_counts().head(5)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午7.59.06.png)
2. 某欄位有多少個唯一值? 例如有多少個title

```
df['title'].nunique()
```

## 5.產生新的feature

1. title的格式是Reason/Department, 例如以EMS: BACK PAINS/INJURY來說, Reason就是EMS, 使用[lambda表達式及apply()](https://my.oschina.net/lionets/blog/187067)來創造新的Reason欄位

   ```
   pf['Reason'] = pf['title'].apply(lambda title: title.split(':')[0])
   ```
2. 911中最常被呼叫的幾個原因為何?

   ```
   pf['Reason'].value_counts()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午8.19.14.png)
3. 用[seaborn](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn.html)產生countplot

   * [countplot](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2173categorical-plots.html): x軸指定想要計數的欄位名稱, 指定資料來源

   ```
   sns.set_style('whitegrid')
   sns.countplot(x='Reason', data=pf)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午8.48.53.png)
4. 取得timeStamp中的hour

   * 由於在DataFrame中所儲存的timeStamp Series為字串, 需要先轉成datetime

   ```
   pf['timeStamp'] = pd.to_datetime(pf['timeStamp'])
   pf['timeStamp'].iloc[0].hour
   ```
5. 將timeStamp中的hour, month, dayofweek存在新欄位中

   * 對pf\['timeStamp']中的每一個元素來說, 直接呼叫hour, month, dayofweek就可以取得小時, 月, 日資料

   ```
   pf['Hour'] = pf['timeStamp'].apply(lambda time: time.hour)
   pf['Month'] = pf['timeStamp'].apply(lambda time: time.month)
   pf['Day of Week'] = pf['timeStamp'].apply(lambda time: time.dayofweek)
   ```

   * 將Day of Week從number轉為string

   ```
   dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
   pf['Day of Week'] = pf['Day of Week'].map(dmap)
   ```

   * 畫出Day of Week的countplot: x軸指定'Day of Week', hue指定'Reason'

   ```
   sns.countplot(x='Day of Week', hue = 'Reason', data = pf)

   # To relocate the legend
   plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午9.21.59.png)

   * 用[groupby](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas/2145groupby.html)找出個月份中每個欄位分別的數量

   ```
   byMonth = pf.groupby('Month').count()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午9.41.58.png)

   * 畫出每個月求救的次數的曲線圖

     * [可以使用pandas內建的plot](https://jenhsuan.gitbooks.io/python/content/2323.html)

     ```
     byMonth['twp'].plot()
     ```

     ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午9.52.31.png)
6. 用seaborn來做每個月求救次數的線性回歸圖 (linear fit), 可以使用[sns.lmplot](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2176regression-plots.html)

   * x軸資料指定為Month, y軸資料指定為twp, data指定為[reset\_index()](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas/2143dataframes.html)後的資料

   ```
   sns.lmplot(x='Month',y='twp',data=byMonth.reset_index())
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.02.14.png)
7. 將timeStamp中的date存在新欄位中

   * 對pf\['timeStamp']中的每一個元素來說, 必須呼叫[date()](http://www.wklken.me/posts/2015/03/03/python-base-datetime.html#1-datetime)以取得date資料

   ```
   pf['timeStamp'].iloc[0].date()
   pf['Date'] = pf['timeStamp'].apply(lambda t: t.date())
   ```

   * 畫出每日的求救次數的曲線圖

   ```
   byDate = pf.groupby('Date').count()['twp'].plot()
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.28.06.png)
8. 對每一種撥打求救電話原因, 畫出其每日的求救次數的曲線圖: 先對原始資料進行[條件過濾](https://jenhsuan.gitbooks.io/python/content/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/211jupyter-overview/214python-for-data-analysis-pandas/2143dataframes.html)再plot

   * Taffic的每日的求救次數曲線圖, 並加上title

   ```
   pf[pf['Reason']=='Traffic'].groupby('Date').count()['twp'].plot()
   plt.title('Traffic')
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.35.33.png)

   * Fire的每日的求救次數曲線圖, 並加上title

   ```
   pf[pf['Fire']=='Traffic'].groupby('Date').count()['twp'].plot()
   plt.title('Fire')
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.38.38.png)

   * EMS的每日的求救次數曲線圖, 並加上title

   ```
   pf[pf['EMS']=='Traffic'].groupby('Date').count()['twp'].plot()
   plt.title('EMS')
   plt.tight_layout()
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.39.54.png)
9. [Heatmap分析](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2174matrix-plots.html)

   * 先取得**一週中每天每小時的各欄位的累積數量**

     ```
     pf.groupby(by=['Day of Week','Hour']).count()
     ```
   * 取出Reason欄位並畫出熱力圖

   ```
   dayHour = pf.groupby(by=['Day of Week','Hour']).count()['Reason'].unstack()
   plt.figure(figsize=(12,6))
   sns.heatmap(dayHour,cmap='viridis')
   ```

   ![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.56.36.png)
10. [Clustermap分析](https://jenhsuan.gitbooks.io/python/content/217python-for-data-visualization-seaborn/2174matrix-plots.html)

```
plt.figure(figsize=(12,6))
sns.clustermap(dayHour,cmap='viridis')
```

![](https://github.com/jenhsuan/python/tree/8fc9c0b8df4ccd709d3078c2d8842af0932de09d/assets/螢幕快照%202018-06-10%20上午10.59.58.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jen-hsuan-hsieh.gitbook.io/python/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/sadsd/sas.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
