In [1]: import pandas as pd

本教學課程使用資料

鐵達尼號資料
本教學課程使用鐵達尼號資料集，儲存為 CSV。資料包含下列資料欄位
- PassengerId：每位乘客的 ID。
- Survived：表示乘客是否存活。 0 表示是， 1 表示否。
- Pclass：三種票艙等級之一：艙等 1、艙等 2 和艙等 3。
- Name：乘客姓名。
- Sex：乘客性別。
- Age：乘客年齡（歲）。
- SibSp：船上有多少兄弟姊妹或配偶。
- Parch：船上有多少父母或子女。
- Ticket：乘客的票號。
- Fare：表示票價。
- Cabin：乘客的艙位號碼。
- Embarked：登船港口。
至原始資料

如何讀寫表格資料？#

我想分析鐵達尼號乘客資料，其為 CSV 檔案。
```
In [2]: titanic = pd.read_csv("data/titanic.csv")
```
pandas 提供 read_csv() 函數，將儲存在 csv 檔案中的資料讀取到 pandas DataFrame 中。pandas 支援許多不同的檔案格式或資料來源（csv、excel、sql、json、parquet 等），每個格式的函數名稱都以 read_* 為字首。

讀取資料後，務必隨時檢查資料。顯示 DataFrame 時，預設會顯示前 5 列和最後 5 列

In [3]: titanic
Out[3]: 
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]

我想看 pandas DataFrame 的前 8 列。

In [4]: titanic.head(8)
Out[4]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S
5            6         0       3  ...   8.4583   NaN         Q
6            7         0       1  ...  51.8625   E46         S
7            8         0       3  ...  21.0750   NaN         S

[8 rows x 12 columns]

若要看 DataFrame 的前 N 列，請使用 head() 方法，並將所需列數（此例為 8）作為參數。

注意

有興趣看最後 N 列嗎？pandas 也提供 tail() 方法。例如，titanic.tail(10) 會傳回 DataFrame 的最後 10 列。

若要檢查 pandas 如何詮釋每個欄位資料類型，可以要求 pandas dtypes 屬性

In [5]: titanic.dtypes
Out[5]: 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

每個欄位會列出使用的資料類型。此 DataFrame 中的資料類型為整數 (int64)、浮點數 (float64) 和字串 (object)。

注意

在詢問 dtypes 時，不要使用括號！ dtypes 是 DataFrame 和 Series 的屬性。 DataFrame 或 Series 的屬性不需要括號。屬性表示 DataFrame/Series 的特徵，而方法（需要括號）則對 DataFrame/Series 執行某些操作，如第一個教學課程中所介紹的。

我的同事要求將鐵達尼號資料作為試算表。
```
In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
```
雖然 read_* 函式用於將資料讀取到 pandas，但 to_* 方法用於儲存資料。 to_excel() 方法將資料儲存為 Excel 檔案。在此範例中， sheet_name 命名為 passengers，而不是預設的 Sheet1。透過設定 index=False，不會在試算表中儲存列索引標籤。

等效的讀取函數 read_excel() 會將資料重新載入至 DataFrame

In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

In [8]: titanic.head()
Out[8]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

我對 DataFrame 的技術摘要有興趣

In [9]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

方法 info() 會提供 DataFrame 的技術資訊，因此我們來更詳細地說明輸出

它確實是 DataFrame。
有 891 個項目，即 891 列。
每列都有列標籤（又稱 index），其值範圍從 0 到 890。
表格有 12 個欄。大部分欄位都有每列的值（所有 891 個值都是 non-null）。有些欄位確實有遺失值，且小於 891 個 non-null 值。
欄位 Name、Sex、Cabin 和 Embarked 包含文字資料（字串，又稱 object）。其他欄位是數值資料，其中有些是整數（又稱 integer），而其他則是實數（又稱 float）。
不同欄位中的資料類型（字元、整數…）會透過列出 dtypes 來加以摘要。
也會提供用於儲存 DataFrame 的 RAM 近似用量。

請記住

支援透過 read_* 函式將資料從許多不同的檔案格式或資料來源匯入至 pandas。
可透過不同的 to_* 方法將資料匯出至 pandas。
方法 head/tail/info 和屬性 dtypes 方便進行初次檢查。

至使用者指南

如需有關從 pandas 輸入和輸出的完整概觀，請參閱使用者指南部分，了解讀取器和寫入器函式。