與 SQL 的比較#
由於許多潛在的 pandas 使用者對 SQL 有些熟悉,因此此頁面旨在提供一些範例,說明如何使用 pandas 執行各種 SQL 作業。
如果您是 pandas 新手,您可能想先閱讀 10 分鐘學會 pandas,以熟悉此函式庫。
根據慣例,我們將 pandas 和 NumPy 匯入如下
In [1]: import pandas as pd
In [2]: import numpy as np
大多數範例將使用 pandas 測試中找到的 tips
資料集。我們會將資料讀入稱為 tips
的 DataFrame,並假設我們有一個名稱和結構相同的資料庫表格。
In [3]: url = (
...: "https://raw.githubusercontent.com/pandas-dev"
...: "/pandas/main/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [4]: tips = pd.read_csv(url)
In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
複製與原地作業#
大多數 pandas 作業會傳回 Series
/DataFrame
的複製。若要讓變更「生效」,您需要指定給新變數
sorted_df = df.sort_values("col1")
或覆寫原始變數
df = df.sort_values("col1")
注意
您會看到 inplace=True
或 copy=False
關鍵字引數可供一些方法使用
df.replace(5, inplace=True)
目前正積極討論要捨棄並移除大多數方法(例如 dropna
)的 inplace
和 copy
,但一小部分方法(包括 replace
)例外。在寫入時複製的語境中,這兩個關鍵字將不再必要。提案可在此處找到 連結。
SELECT#
在 SQL 中,選取是使用逗號分隔的清單來選取您要選取的欄位(或使用 *
來選取所有欄位)
SELECT total_bill, tip, smoker, time
FROM tips;
使用 pandas 時,欄位選取是透過將欄位名稱清單傳遞給您的 DataFrame 來完成
In [6]: tips[["total_bill", "tip", "smoker", "time"]]
Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
.. ... ... ... ...
239 29.03 5.92 No Dinner
240 27.18 2.00 Yes Dinner
241 22.67 2.00 Yes Dinner
242 17.82 1.75 No Dinner
243 18.78 3.00 No Dinner
[244 rows x 4 columns]
在沒有欄位名稱清單的情況下呼叫 DataFrame 會顯示所有欄位(類似於 SQL 的 *
)。
在 SQL 中,您可以新增一個計算欄位
SELECT *, tip/total_bill as tip_rate
FROM tips;
使用 pandas 時,您可以使用 DataFrame 的 DataFrame.assign()
方法來附加一個新欄位
In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"])
Out[7]:
total_bill tip sex smoker day time size tip_rate
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
.. ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744
[244 rows x 8 columns]
WHERE#
在 SQL 中,篩選是透過 WHERE 子句來完成。
SELECT *
FROM tips
WHERE time = 'Dinner';
DataFrame 可以透過多種方式來篩選;最直覺的方式是使用 布林索引。
In [8]: tips[tips["total_bill"] > 10]
Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[227 rows x 7 columns]
上述陳述只不過是將 Series
的 True
/False
物件傳遞給 DataFrame,傳回所有具有 True
的列。
In [9]: is_dinner = tips["time"] == "Dinner"
In [10]: is_dinner
Out[10]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool
In [11]: is_dinner.value_counts()
Out[11]:
time
True 176
False 68
Name: count, dtype: int64
In [12]: tips[is_dinner]
Out[12]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[176 rows x 7 columns]
就像 SQL 的 OR
和 AND
,可以使用 |
(OR
) 和 &
(AND
) 將多個條件傳遞給 DataFrame。
晚餐用餐小費超過 5 美元
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]
Out[13]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
用餐人數至少 5 人的小費或帳單總額超過 45 美元
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]
Out[14]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
使用 notna()
和 isna()
方法進行 NULL 檢查。
In [15]: frame = pd.DataFrame(
....: {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]}
....: )
....:
In [16]: frame
Out[16]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
假設我們有一個與上面 DataFrame 結構相同的表格。我們可以使用以下查詢僅查看 col2
為 NULL 的記錄
SELECT *
FROM frame
WHERE col2 IS NULL;
In [17]: frame[frame["col2"].isna()]
Out[17]:
col1 col2
1 B NaN
可以使用 notna()
取得 col1
不為 NULL 的項目。
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [18]: frame[frame["col1"].notna()]
Out[18]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
GROUP BY#
在 pandas 中,SQL 的 GROUP BY
作業使用類似名稱的 groupby()
方法執行。 groupby()
通常是指我們想要將資料集分割成群組,套用一些函數(通常是聚合),然後將群組合併在一起的程序。
常見的 SQL 作業會取得資料集內每個群組的記錄數。例如,查詢取得由性別留下的提示數
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
Pandas 等效的程式碼為
In [19]: tips.groupby("sex").size()
Out[19]:
sex
Female 87
Male 157
dtype: int64
請注意,在 Pandas 程式碼中,我們使用 DataFrameGroupBy.size()
,而不是 DataFrameGroupBy.count()
。這是因為 DataFrameGroupBy.count()
會將函式套用至每一欄,傳回每個欄位內 NOT NULL
的記錄數。
In [20]: tips.groupby("sex").count()
Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
或者,我們可以將 DataFrameGroupBy.count()
方法套用至個別欄位
In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
也可以同時套用多個函式。例如,假設我們想要查看小費金額如何根據星期幾而有所不同 - DataFrameGroupBy.agg()
允許您傳遞字典至已分組的 DataFrame,指出要套用至特定欄位的函式。
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thu 2.771452 62
*/
In [22]: tips.groupby("day").agg({"tip": "mean", "day": "size"})
Out[22]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
依據多個欄位分組的方法是傳遞欄位清單至 groupby()
方法。
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thu 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thu 17 3.030000
*/
In [23]: tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]})
Out[23]:
tip
size mean
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
JOIN#
JOIN
可以使用 join()
或 merge()
來執行。預設情況下,join()
會根據索引來合併資料框。每個方法都有參數,可讓您指定要執行的合併類型(LEFT
、RIGHT
、INNER
、FULL
)或要合併的欄位(欄位名稱或索引)。
警告
如果兩個金鑰欄都包含金鑰為 null 值的列,這些列會彼此配對。這與一般的 SQL JOIN 行為不同,可能會導致意外的結果。
In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
假設我們有兩個資料庫表格,其名稱和結構與我們的資料框相同。
現在我們來探討各種 JOIN
類型。
INNER JOIN#
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [26]: pd.merge(df1, df2, on="key")
Out[26]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
merge()
也提供參數,供您將一個 DataFrame 的欄位與另一個 DataFrame 的索引合併時使用。
In [27]: indexed_df2 = df2.set_index("key")
In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True)
Out[28]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209
左外合併#
顯示 df1
中的所有記錄。
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
In [29]: pd.merge(df1, df2, on="key", how="left")
Out[29]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
右合併#
顯示 df2
中的所有記錄。
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
In [30]: pd.merge(df1, df2, on="key", how="right")
Out[30]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236
全合併#
pandas 也允許 FULL JOIN
,它會顯示資料集的兩側,無論合併的欄位是否找到相符項。在撰寫本文時,FULL JOIN
並非所有 RDBMS (MySQL) 都支援。
顯示兩個表格中的所有記錄。
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
In [31]: pd.merge(df1, df2, on="key", how="outer")
Out[31]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
聯集#
UNION ALL
可以使用 concat()
執行。
In [32]: df1 = pd.DataFrame(
....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
....: )
....:
In [33]: df2 = pd.DataFrame(
....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
....: )
....:
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In [34]: pd.concat([df1, df2])
Out[34]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
SQL 的 UNION
類似於 UNION ALL
,但 UNION
會移除重複列。
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
在 pandas 中,你可以使用 concat()
搭配 drop_duplicates()
。
In [35]: pd.concat([df1, df2]).drop_duplicates()
Out[35]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
LIMIT#
SELECT * FROM tips
LIMIT 10;
In [36]: tips.head(10)
Out[36]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
pandas 等效於某些 SQL 分析和聚合函數#
頂端 n 列與偏移#
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [37]: tips.nlargest(10 + 5, columns="tip").tail(10)
Out[37]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4
每組頂端 n 列#
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [38]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
....: .sort_values(["day", "rn"])
....: )
....:
Out[38]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2
使用 rank(method='first')
函數執行相同的操作
In [39]: (
....: tips.assign(
....: rnk=tips.groupby(["day"])["total_bill"].rank(
....: method="first", ascending=False
....: )
....: )
....: .query("rnk < 3")
....: .sort_values(["day", "rnk"])
....: )
....:
Out[39]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
讓我們找出 (tips < 2) 的每性別組 (rank < 3) 的小費。請注意,當使用 rank(method='min')
函數時,rnk_min
對於相同的 tip
保持相同(如同 Oracle 的 RANK()
函數)
In [40]: (
....: tips[tips["tip"] < 2]
....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
....: .query("rnk_min < 3")
....: .sort_values(["sex", "rnk_min"])
....: )
....:
Out[40]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0
UPDATE#
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [41]: tips.loc[tips["tip"] < 2, "tip"] *= 2
DELETE#
DELETE FROM tips
WHERE tip > 9;
在 pandas 中,我們選取應該保留的列,而不是刪除它們
In [42]: tips = tips.loc[tips["tip"] <= 9]