Python
  • Introduction
  • Chapter 1.Notes from research
    • 1.Introduction of Python
    • 2. Build developer environment
      • 2.1.Sublime Text3
      • 2.2.Jupyter(IPython notebook)
        • 2.2.1.Introduction
        • 2.2.2.Basic usage
        • 2.2.3.some common operations
      • 2.3.Github
        • 2.3.1.Create Github account
        • 2.3.2.Create a new repository
        • 2.3.3.Basic operations: config, clone, push
      • 2.4.Install Python 3.4 in Windows
    • 3. Write Python code
      • 3.1.Hello Python
      • 3.2.Basic knowledges
      • 3.3.撰寫獨立python程式
      • 3.4.Arguments parser
      • 3.5.Class
      • 3.6.Sequence
    • 4. Web crawler
      • 4.1.Introduction
      • 4.2.requests
      • 4.3.beautifulSoup4
      • 3.4.a little web crawler
    • 5. Software testing
      • 5.1. Robot Framework
        • 1.1.Introduction
        • 1.2.What is test-automation framework?
        • 1.3.Robot Framework Architecture
        • 1.4.Robot Framework Library
        • 1.5.Reference
    • 6. encode/ decode
      • 6.1.編碼/解碼器的基本概念
      • 6.2.常見的編碼/ 解碼錯誤訊息與其意義
      • 6.3 .處理文字檔案
    • 7. module
      • 7.1.Write a module
      • 7.2.Common module
        • 7.2.1.sched
        • 7.2.2.threading
    • 8. Integrate IIS with django
      • 8.1.Integrate IIS with django
  • Chapter 2.Courses
    • 2.1.Python for Data Science and Machine Learning Bootcamp
      • 2.1.1.Virtual Environment
      • 2.1.2.Python crash course
      • 2.1.3.Python for Data Analysis - NumPy
        • 2.1.3.1.Numpy arrays
        • 2.1.3.2.Numpy Array Indexing
        • 2.1.3.3.Numpy Operations
      • 2.1.4.Python for Data Analysis - Pandas
        • 2.1.4.1.Introduction
        • 2.1.4.2.Series
        • 2.1.4.3.DataFrames
        • 2.1.4.4.Missing Data
        • 2.1.4.5.GroupBy
        • 2.1.4.6.Merging joining and Concatenating
        • 2.1.4.7.Data input and output
      • 2.1.5.Python for Data Visual Visualization - Pandas Built-in Data Visualization
      • 2.1.6.Python for Data Visualization - Matplotlib
        • 2.1.6.1.Introduction of Matplotlib
        • 2.1.6.2.Matplotlib
      • 2.1.7.Python for Data Visualization - Seaborn
        • 2.1.7.1.Introduction to Seaborn
        • 2.1.7.2.Distribution Plots
        • 2.1.7.3.Categorical Plots
        • 2.1.7.4.Matrix Plots
        • 2.1.7.5.Grids
        • 2.1.7.6.Regression Plots
      • 2.1.8. Python for Data Visualization - Plotly and Cufflinks
        • 2.1.8.1.Introduction to Plotly and Cufflinks
        • 2.1.8.2.Plotly and Cufflinks
      • 2.1.9. Python for Data Visualization - Geographical plotting
        • 2.1.9.1.Choropleth Maps - USA
        • 2.1.9.2.Choropleth Maps - World
      • 2.1.10.Combine data analysis and visualization to tackle real world data sets
        • 911 calls capstone project
      • 2.1.11.Linear regression
        • 2.1.11.1.Introduction to Scikit-learn
        • 2.1.11.2.Linear regression with Python
      • 2.1.12.Logistic regression
        • 2.1.12.1.Logistic regression Theory
        • 2.1.12.2.Logistic regression with Python
      • 2.1.13.K Nearest Neighbors
        • 2.1.13.1.KNN Theory
        • 2.1.13.2.KNN with Python
      • 2.1.14.Decision trees and random forests
        • 2.1.14.1.Introduction of tree methods
        • 2.1.14.2.Decision trees and Random Forests with Python
      • 2.1.15.Support Vector Machines
      • 2.1.16.K means clustering
      • 2.1.17.Principal Component Analysis
    • 2.2. Machine Learning Crash Course Jam
Powered by GitBook
On this page
  • 2.1.4.3.DataFrames
  • 使用library
  • 初始化亂數
  • DataFrame可以接受三個參數:
  • 選取DataFrame的Series或subset
  • 條件選擇
  • 新增/ 刪除DataFrame的column
  • 操作索引值
  • Multi-Index
  • DataFrame的一些基本操作
  • 將function綁定到Dataframe上
  • Pivot table

Was this helpful?

  1. Chapter 2.Courses
  2. 2.1.Python for Data Science and Machine Learning Bootcamp
  3. 2.1.4.Python for Data Analysis - Pandas

2.1.4.3.DataFrames

2.1.4.3.DataFrames

使用library

import numpy as np
import pandas as pd
from numpy.random import randn

初始化亂數

np.random.seed(101)

DataFrame可以接受三個參數:

  • 第一個參數為數值的數列, 型別可以是python數列

  • 第二個參數為label的數列, 型別可以是python數列

  • 第三個參數為label的數列, 型別可以是python數列

df = pd.DataFrame(randn(5,4),['A', 'B', 'C', 'D', 'E'],['W', 'X', 'Y', 'Z'])
  • 也可以直接傳入dictionary

d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C': [1,2,3]}
df.pd.DataFrame(d)

選取DataFrame的Series或subset

  • 選取Column的Series

    • DataFrame中的任一個column就是Series

      df['W']
      In: type(df['W'])
      Out: pandas.core.series.Series
    • 也可以選取多個column

      df[['W', 'Y']]
  • 選取Row的Series

    • 指定label

      df.loc['A']
    • 指定index

      df.iloc['2']
  • 選取特定(Row, Column)位置的subset

    • 取出單一值

      df.loc['B', 'Y']
    • 取出特定範圍的subset

      df.loc[['A', 'B'], ['X', 'Y']]
  • 取出前五個值

df.head(5)

條件選擇

  • 過濾DataFrame

    • 留下 > 0的值, 將會得到boolean的DataFrame

      df > 0
  • 過濾Column

    • 留下 > 0的row, 將會得到boolean的Series

      df['W'] > 0
  • 條件選擇

    • 去掉DataFrame所有column 'W'中小於0的row, 將會得到DataFrame

      df[df['W'] > 0]
  • 多重條件選擇

df[(df['W']>0) | (df['Y']>1)]
df[(df['W']>0) & (df['Y']>1)]

新增/ 刪除DataFrame的column

  • 'new'將會被新增到df中

newind = 'CA NY WY OR CO'.split()
df['new'] = newind
  • 刪除欄位

    • 由於DataFrame有兩軸座標, 刪除時需指定label是在哪一軸上的

    • inplace參數預設值為Flase, 僅會返回刪除column後的一組DataFrame, 原來的DataFrame並不會改變; 如果設為True則原來的DataFrame會被覆蓋掉

      df.drop('new', axis = 1, inplace = True)

操作索引值

  • Reset索引值

df.reset_index()
  • 將某一個column設為索引值

newind = 'CA NY WY OR CO'.split()
df['new'] = newind
df.set_index('new')

Multi-Index

  • 建立Multi-Index的DataFrame

# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
  • 選取資料

df.loc['G1'].loc[1]
  • 為Multi-Index的名稱賦值

df.index.name = ['G', 'Numbers']
  • Cross section

    • 可以同時選取不同row的資料

      df.xs(['G1',1])

DataFrame的一些基本操作

  • 取出Series的唯一值 (將會回傳array)

df['col2'].unique()
  • 取得Series的長度

len(df['col2'].unique())
df['col2'].nunique()
  • 統計Dataframe中各Series的數量

df['col2'].value_counts()
  • 取得Dataframe的column

df.column
  • 排序Dataframe

df.sort_values('col2')
  • 檢查Dataframe裡的值是不是null

df.isnull()

將function綁定到Dataframe上

def times2(x):
    return x*2

df['col1'].apply(times2)
  • 或是

df['col1'].apply(lambda x: x*2)

Pivot table

  • 指定A, B為索引值時, 取D的值作為value, 並將C設定為列層次階段

data = {'A':['foo','foo','foo','bar','bar','bar'],
        'B':['one','one','two','two','one','one'],
        'C':['x','y','x','y','x','y'],
        'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

OUT:

       A    B    C    D
0    foo    one    x    1
1    foo    one    y    3
2    foo    two    x    2
3    bar    two    y    5
4    bar    one    x    4
5    bar    one    y    1

df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

OUT:

     C         x     y
  A     B        
bar    one    4.0    1.0
two    NaN    5.0
foo    one    1.0    3.0
two    2.0    NaN
Previous2.1.4.2.SeriesNext2.1.4.4.Missing Data

Last updated 5 years ago

Was this helpful?

參考資料