# 2.1.4.3.DataFrames

## 2.1.4.3.DataFrames

### 使用library

```
import numpy as np
import pandas as pd
from numpy.random import randn
```

### 初始化亂數

```
np.random.seed(101)
```

### DataFrame可以接受三個參數:

* 第一個參數為數值的數列, 型別可以是python數列
* 第二個參數為label的數列, 型別可以是python數列
* 第三個參數為label的數列, 型別可以是python數列

```
df = pd.DataFrame(randn(5,4),['A', 'B', 'C', 'D', 'E'],['W', 'X', 'Y', 'Z'])
```

* 也可以直接傳入dictionary

```
d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C': [1,2,3]}
df.pd.DataFrame(d)
```

### 選取DataFrame的Series或subset

* 選取Column的Series
  * DataFrame中的任一個column就是Series

    ```
    df['W']
    In: type(df['W'])
    Out: pandas.core.series.Series
    ```
  * 也可以選取多個column

    ```
    df[['W', 'Y']]
    ```
* 選取Row的Series
  * 指定label

    ```
    df.loc['A']
    ```
  * 指定index

    ```
    df.iloc['2']
    ```
* 選取特定(Row, Column)位置的subset
  * 取出單一值

    ```
    df.loc['B', 'Y']
    ```
  * 取出特定範圍的subset

    ```
    df.loc[['A', 'B'], ['X', 'Y']]
    ```
* 取出前五個值

```
df.head(5)
```

### 條件選擇

* 過濾DataFrame
  * 留下 > 0的值, 將會得到boolean的DataFrame

    ```
    df > 0
    ```
* 過濾Column
  * 留下 > 0的row, 將會得到boolean的Series

    ```
    df['W'] > 0
    ```
* 條件選擇
  * 去掉DataFrame所有column 'W'中小於0的row, 將會得到DataFrame

    ```
    df[df['W'] > 0]
    ```
* 多重條件選擇

```
df[(df['W']>0) | (df['Y']>1)]
df[(df['W']>0) & (df['Y']>1)]
```

### 新增/ 刪除DataFrame的column

* 'new'將會被新增到df中

```
newind = 'CA NY WY OR CO'.split()
df['new'] = newind
```

* 刪除欄位
  * 由於DataFrame有兩軸座標, 刪除時需指定label是在哪一軸上的
  * inplace參數預設值為Flase, 僅會返回刪除column後的一組DataFrame, 原來的DataFrame並不會改變; 如果設為True則原來的DataFrame會被覆蓋掉

    ```
    df.drop('new', axis = 1, inplace = True)
    ```

### 操作索引值

* Reset索引值

```
df.reset_index()
```

* 將某一個column設為索引值

```
newind = 'CA NY WY OR CO'.split()
df['new'] = newind
df.set_index('new')
```

### Multi-Index

* 建立Multi-Index的DataFrame

```
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
```

* 選取資料

```
df.loc['G1'].loc[1]
```

* 為Multi-Index的名稱賦值

```
df.index.name = ['G', 'Numbers']
```

* Cross section
  * 可以同時選取不同row的資料

    ```
    df.xs(['G1',1])
    ```

### DataFrame的一些基本操作

* 取出Series的唯一值 (將會回傳array)

```
df['col2'].unique()
```

* 取得Series的長度

```
len(df['col2'].unique())
df['col2'].nunique()
```

* 統計Dataframe中各Series的數量

```
df['col2'].value_counts()
```

* 取得Dataframe的column

```
df.column
```

* 排序Dataframe

```
df.sort_values('col2')
```

* 檢查Dataframe裡的值是不是null

```
df.isnull()
```

### 將function綁定到Dataframe上

```
def times2(x):
    return x*2

df['col1'].apply(times2)
```

* 或是

```
df['col1'].apply(lambda x: x*2)
```

## Pivot table

* 指定A, B為索引值時, 取D的值作為value, 並將C設定為列層次階段

```
data = {'A':['foo','foo','foo','bar','bar','bar'],
        'B':['one','one','two','two','one','one'],
        'C':['x','y','x','y','x','y'],
        'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

OUT:

       A    B    C    D
0    foo    one    x    1
1    foo    one    y    3
2    foo    two    x    2
3    bar    two    y    5
4    bar    one    x    4
5    bar    one    y    1

df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

OUT:

     C         x     y
  A     B        
bar    one    4.0    1.0
two    NaN    5.0
foo    one    1.0    3.0
two    2.0    NaN
```

* [參考資料](https://zhuanlan.zhihu.com/p/31952948)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jen-hsuan-hsieh.gitbook.io/python/chapter-2courses/21python-for-data-science-and-machine-learning-bootcamp/214python-for-data-analysis-pandas/2143dataframes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
