MultiIndex / 進階索引#
本節涵蓋 使用 MultiIndex 索引 和 其他進階索引功能。
請參閱 索引和選擇資料 以取得一般索引文件。
警告
設定作業傳回副本或參考,可能會取決於內容。這有時稱為 連鎖 指派
,應避免使用。請參閱 傳回檢視與副本。
請參閱 食譜 以取得一些進階策略。
階層式索引 (MultiIndex)#
階層式/多層級索引非常令人興奮,因為它開啟了通往一些相當精密的資料分析和處理的門戶,特別是處理高維度資料時。基本上,它讓您能夠在較低維度資料結構中儲存和處理具有任意維度的資料,例如 Series
(1d) 和 DataFrame
(2d)。
在本節中,我們將說明「階層式」索引的具體含義,以及它如何與上面和之前各節所述的所有 pandas 索引功能整合。稍後,在討論 群組依據 和 樞紐分析和資料重塑 時,我們將展示非平凡的應用,說明它如何協助建構資料以進行分析。
請參閱 食譜 以取得一些進階策略。
建立 MultiIndex (階層式索引) 物件#
The MultiIndex
object is the hierarchical analogue of the standard
Index
object which typically stores the axis labels in pandas objects. You
can think of MultiIndex
as an array of tuples where each tuple is unique. A
MultiIndex
can be created from a list of arrays (using
MultiIndex.from_arrays()
), an array of tuples (using
MultiIndex.from_tuples()
), a crossed set of iterables (using
MultiIndex.from_product()
), or a DataFrame
(using
MultiIndex.from_frame()
). The Index
constructor will attempt to return
a MultiIndex
when it is passed a list of tuples. The following examples
demonstrate different ways to initialize MultiIndexes.
In [1]: arrays = [
...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
...: ["one", "two", "one", "two", "one", "two", "one", "two"],
...: ]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
當您想要兩個可迭代元素中的每個配對時,可以使用 MultiIndex.from_product()
方法會比較容易
In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
您也可以使用 MultiIndex.from_frame()
方法,直接從 DataFrame
建立 MultiIndex
。這是 MultiIndex.to_frame()
的互補方法。
In [10]: df = pd.DataFrame(
....: [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
....: columns=["first", "second"],
....: )
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
為了方便起見,您可以將陣列清單直接傳遞到 Series
或 DataFrame
中,以自動建立 MultiIndex
In [12]: arrays = [
....: np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
....: np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
....: ]
....:
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
所有 MultiIndex
建構函式都接受 names
參數,用於儲存層級本身的字串名稱。如果未提供名稱,將會指定 None
In [17]: df.index.names
Out[17]: FrozenList([None, None])
此索引可以支援 pandas 物件的任何軸,而索引的層級數量由您決定
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
In [19]: df
Out[19]:
first bar baz ... foo qux
second one two one ... two one two
A 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169
B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737
C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747
[3 rows x 8 columns]
In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]:
first bar baz foo
second one two one two one two
first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
我們已經「稀疏化」索引的較高層級,以讓主控台輸出對眼睛來說比較容易閱讀。請注意,索引的顯示方式可以使用 pandas.set_options()
中的 multi_sparse
選項來控制
In [21]: with pd.option_context("display.multi_sparse", False):
....: df
....:
值得注意的是,沒有任何因素會阻止您在軸上使用元組作為原子標籤
In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]:
(bar, one) -1.236269
(bar, two) 0.896171
(baz, one) -0.487602
(baz, two) -0.082240
(foo, one) -2.182937
(foo, two) 0.380396
(qux, one) 0.084844
(qux, two) 0.432390
dtype: float64
之所以 MultiIndex
很重要,是因為它可以讓您執行分組、選取和變形操作,如下文和文件後續部分所述。正如您將在後續部分中看到的,您可能會發現自己在處理階層索引資料,而沒有自己明確建立 MultiIndex
。但是,當從檔案載入資料時,您可能希望在準備資料集時產生自己的 MultiIndex
。
重建層級標籤#
方法 get_level_values()
會傳回特定層級中每個位置的標籤向量
In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
使用 MultiIndex 在軸上進行基本索引#
階層索引的重要功能之一是,您可以透過識別資料中子群組的「部分」標籤來選取資料。部分選取會「捨棄」階層索引的層級,其結果與在一般 DataFrame 中選取欄位的方式完全類似
In [25]: df["bar"]
Out[25]:
second one two
A 0.895717 0.805244
B 0.410835 0.813850
C -1.413681 1.607920
In [26]: df["bar", "one"]
Out[26]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
In [27]: df["bar"]["one"]
Out[27]:
A 0.895717
B 0.410835
C -1.413681
Name: one, dtype: float64
In [28]: s["qux"]
Out[28]:
one -1.039575
two 0.271860
dtype: float64
請參閱 具有階層索引的橫切面,了解如何在更深層級上進行選取。
已定義層級#
即使未實際使用,MultiIndex
會保留索引的所有已定義層級。在切片索引時,您可能會注意到這一點。例如
In [29]: df.columns.levels # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
In [30]: df[["foo","qux"]].columns.levels # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
這樣做是為了避免重新計算層級,以使切片效能極佳。如果您只想查看已使用的層級,可以使用 get_level_values()
方法。
In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]:
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
dtype=object)
# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
若要僅使用已使用的層級重建 MultiIndex
,可以使用 remove_unused_levels()
方法。
In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
資料對齊和使用 reindex
#
在軸上具有 MultiIndex
的不同索引物件之間的操作會按照您的預期運作;資料對齊會與元組索引相同
In [35]: s + s[:-2]
Out[35]:
bar one -1.723698
two -4.209138
baz one -0.989859
two 2.143608
foo one 1.443110
two -1.413542
qux one NaN
two NaN
dtype: float64
In [36]: s + s[::2]
Out[36]:
bar one -1.723698
two NaN
baz one -0.989859
two NaN
foo one 1.443110
two NaN
qux one -2.079150
two NaN
dtype: float64
Series
/DataFrames
的 reindex()
方法可以呼叫另一個 MultiIndex
,甚至可以呼叫元組的清單或陣列
In [37]: s.reindex(index[:3])
Out[37]:
first second
bar one -0.861849
two -2.104569
baz one -0.494929
dtype: float64
In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]:
foo two -0.706771
bar one -0.861849
qux one -1.039575
baz one -0.494929
dtype: float64
進階索引與階層索引#
在進階索引中以語法方式整合 MultiIndex
和 .loc
有點困難,但我們已盡一切努力這麼做。一般而言,MultiIndex 鍵會採用元組形式。例如,以下範例會按照您的預期運作
In [39]: df = df.T
In [40]: df
Out[40]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [41]: df.loc[("bar", "two")]
Out[41]:
A 0.805244
B 0.813850
C 1.607920
Name: (bar, two), dtype: float64
請注意,df.loc['bar', 'two']
在此範例中也能運作,但此簡寫符號通常會造成歧義。
如果您也想使用 .loc
編制特定欄位的索引,您必須使用類似這樣的元組
In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785
您不必透過只傳遞元組的第一個元素來指定 MultiIndex
的所有層級。例如,您可以使用「部分」編制索引,以取得第一層級中包含 bar
的所有元素,如下所示
In [43]: df.loc["bar"]
Out[43]:
A B C
second
one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
這是稍嫌冗長的符號 df.loc[('bar',),]
的捷徑(在此範例中等同於 df.loc['bar',]
)。
「部分」切片也運作得相當良好。
In [44]: df.loc["baz":"foo"]
Out[44]:
A B C
first second
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
您可以透過提供元組切片,使用「範圍」值進行切片。
In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
In [46]: df.loc[("baz", "two"):"foo"]
Out[46]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
傳遞標籤或元組清單的運作方式類似於重新編制索引
In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]:
A B C
first second
bar two 0.805244 0.813850 1.607920
qux one -1.170299 1.130127 0.974466
注意
請務必注意,在編制索引時,pandas 並不會將元組和清單視為相同。元組會被詮釋為一個多層級金鑰,而清單則用於指定多個金鑰。換句話說,元組是橫向的(橫跨層級),而清單則是縱向的(掃描層級)。
重要的是,元組清單會編制多個完整的 MultiIndex
金鑰的索引,而清單元組則會參照層級內的幾個值
In [48]: s = pd.Series(
....: [1, 2, 3, 4, 5, 6],
....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
....: )
....:
In [49]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples
Out[49]:
A c 1
B d 5
dtype: int64
In [50]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists
Out[50]:
A c 1
d 2
B c 4
d 5
dtype: int64
使用切片器#
您可以透過提供多個索引器來切片 MultiIndex
。
您可以提供任何選取器,就像使用標籤索引一樣,請參閱 依標籤選取,包括切片、標籤清單、標籤和布林索引器。
您可以使用 slice(None)
選取該層級的所有內容。您不需要指定所有更深層的層級,它們會被暗示為 slice(None)
。
與往常一樣,切片器的兩側都包含在內,因為這是標籤索引。
警告
您應該在 .loc
規格說明中指定所有軸,表示索引和欄的索引器。有一些模稜兩可的情況,傳遞的索引器可能會被誤解為索引兩個軸,而不是說成是列的 MultiIndex
。
您應該這樣做
df.loc[(slice("A1", "A3"), ...), :] # noqa: E999
您不應該這樣做
df.loc[(slice("A1", "A3"), ...)] # noqa: E999
In [51]: def mklbl(prefix, n):
....: return ["%s%s" % (prefix, i) for i in range(n)]
....:
In [52]: miindex = pd.MultiIndex.from_product(
....: [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
....: )
....:
In [53]: micolumns = pd.MultiIndex.from_tuples(
....: [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
....: )
....:
In [54]: dfmi = (
....: pd.DataFrame(
....: np.arange(len(miindex) * len(micolumns)).reshape(
....: (len(miindex), len(micolumns))
....: ),
....: index=miindex,
....: columns=micolumns,
....: )
....: .sort_index()
....: .sort_index(axis=1)
....: )
....:
In [55]: dfmi
Out[55]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]
使用切片、清單和標籤的基本 MultiIndex 切片。
In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]:
lvl0 a b
lvl1 bar foo bah foo
A1 B0 C1 D0 73 72 75 74
D1 77 76 79 78
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C1 D0 105 104 107 106
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[24 rows x 4 columns]
您可以使用 pandas.IndexSlice
使用 :
而不是 slice(None)
來簡化語法。
In [57]: idx = pd.IndexSlice
In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
可以使用此方法在多個軸上同時執行相當複雜的選取。
In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
... ... ...
B1 C1 D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
[16 rows x 2 columns]
In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
使用布林索引器,您可以提供與值相關的選取。
In [61]: mask = dfmi[("a", "foo")] > 200
In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
您也可以指定 axis
參數給 .loc
來詮釋在單一軸上傳遞的切片器。
In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C1 D0 9 8 11 10
D1 13 12 15 14
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C1 D0 41 40 43 42
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[32 rows x 4 columns]
此外,您可以使用下列方法設定值。
In [64]: df2 = dfmi.copy()
In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
In [66]: df2
Out[66]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
[64 rows x 4 columns]
您也可以使用可對齊物件的右側。
In [67]: df2 = dfmi.copy()
In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000
In [69]: df2
Out[69]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9000 8000 11000 10000
D1 13000 12000 15000 14000
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237000 236000 239000 238000
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249000 248000 251000 250000
D1 253000 252000 255000 254000
[64 rows x 4 columns]
橫切面#
DataFrame
的 xs()
方法另外採用 level 參數,讓在 MultiIndex
的特定層級選取資料變得更容易。
In [70]: df
Out[70]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [71]: df.xs("one", level="second")
Out[71]:
A B C
first
bar 0.895717 0.410835 -1.413681
baz -1.206412 0.132003 1.024180
foo 1.431256 -0.076467 0.875906
qux -1.170299 1.130127 0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
baz one -1.206412 0.132003 1.024180
foo one 1.431256 -0.076467 0.875906
qux one -1.170299 1.130127 0.974466
您也可以透過提供 axis 參數,使用 xs
選取欄位。
In [73]: df = df.T
In [74]: df.xs("one", level="second", axis=1)
Out[74]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
xs
也允許多個鍵選取。
In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]:
first bar
second one
A 0.895717
B 0.410835
C -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
您可以將 drop_level=False
傳遞給 xs
,以保留已選取的層級。
In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
將上述內容與使用 drop_level=True
(預設值) 的結果進行比較。
In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
進階重新索引和對齊#
在 pandas 物件的 reindex()
和 align()
方法中使用參數 level
可用於在層級間廣播值。例如
In [80]: midx = pd.MultiIndex(
....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
....: )
....:
In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
In [82]: df
Out[82]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [83]: df2 = df.groupby(level=0).mean()
In [84]: df2
Out[84]:
0 1
one 1.060074 -0.109716
zero 1.271532 0.713416
In [85]: df2.reindex(df.index, level=0)
Out[85]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)
In [87]: df_aligned
Out[87]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [88]: df2_aligned
Out[88]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
使用 swaplevel
交換層級#
swaplevel()
方法可以切換兩個層級的順序
In [89]: df[:5]
Out[89]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
使用 reorder_levels
重新排序層級 #
reorder_levels()
方法概括了 swaplevel
方法,允許您一步驟排列階層索引層級
In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
重新命名 Index
或 MultiIndex
的名稱 #
rename()
方法用於重新命名 MultiIndex
的標籤,通常用於重新命名 DataFrame
的欄位。rename
的 columns
參數允許指定僅包含您想要重新命名的欄位的字典。
In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]:
col0 col1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
此方法也可用于重新命名 DataFrame
的主索引的特定標籤。
In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]:
0 1
two z 1.519970 -0.493662
x 0.600178 0.274230
zero z 0.132885 -0.023688
x 2.410179 1.450520
使用 rename_axis()
方法來重新命名 Index
或 MultiIndex
的名稱。特別是,可以指定 MultiIndex
層級的名稱,這在稍後使用 reset_index()
將值從 MultiIndex
移至欄位時很有用。
In [94]: df.rename_axis(index=["abc", "def"])
Out[94]:
0 1
abc def
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
請注意,DataFrame
的欄位是一個索引,因此使用帶有 columns
參數的 rename_axis
會變更該索引的名稱。
In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')
rename
和 rename_axis
都支援指定字典、Series
或對應函數,以將標籤/名稱對應至新值。
直接使用 Index
物件(而非透過 DataFrame
)時,可以使用 Index.set_names()
來變更名稱。
In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])
In [98]: mi2 = mi.rename("new name", level=0)
In [99]: mi2
Out[99]:
MultiIndex([(1, 'a'),
(1, 'b'),
(2, 'a'),
(2, 'b')],
names=['new name', 'y'])
無法透過層級設定 MultiIndex 的名稱。
In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"
File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
1686 @name.setter
1687 def name(self, value: Hashable) -> None:
1688 if self._no_setting_name:
1689 # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690 raise RuntimeError(
1691 "Cannot set name on a level of a MultiIndex. Use "
1692 "'MultiIndex.set_names' instead."
1693 )
1694 maybe_extract_name(value, None, type(self))
1695 self._name = value
RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
請改用 Index.set_names()
。
排序 MultiIndex
#
對於要有效率地索引和切片 MultiIndex
物件,它們需要經過排序。與任何索引一樣,您可以使用 sort_index()
。
In [101]: import random
In [102]: random.shuffle(tuples)
In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
In [104]: s
Out[104]:
foo two 0.206053
baz two -0.251905
qux two -2.213588
baz one 1.063327
foo one 1.266143
qux one 0.299368
bar two -0.863838
one 0.408204
dtype: float64
In [105]: s.sort_index()
Out[105]:
bar one 0.408204
two -0.863838
baz one 1.063327
two -0.251905
foo one 1.266143
two 0.206053
qux one 0.299368
two -2.213588
dtype: float64
In [106]: s.sort_index(level=0)
Out[106]:
bar one 0.408204
two -0.863838
baz one 1.063327
two -0.251905
foo one 1.266143
two 0.206053
qux one 0.299368
two -2.213588
dtype: float64
In [107]: s.sort_index(level=1)
Out[107]:
bar one 0.408204
baz one 1.063327
foo one 1.266143
qux one 0.299368
bar two -0.863838
baz two -0.251905
foo two 0.206053
qux two -2.213588
dtype: float64
如果 MultiIndex
層級有命名,您也可以傳遞層級名稱給 sort_index
。
In [108]: s.index = s.index.set_names(["L1", "L2"])
In [109]: s.sort_index(level="L1")
Out[109]:
L1 L2
bar one 0.408204
two -0.863838
baz one 1.063327
two -0.251905
foo one 1.266143
two 0.206053
qux one 0.299368
two -2.213588
dtype: float64
In [110]: s.sort_index(level="L2")
Out[110]:
L1 L2
bar one 0.408204
baz one 1.063327
foo one 1.266143
qux one 0.299368
bar two -0.863838
baz two -0.251905
foo two 0.206053
qux two -2.213588
dtype: float64
在較高維度的物件上,如果它們有 MultiIndex
,您可以依據層級來對任何其他軸進行排序。
In [111]: df.T.sort_index(level=1, axis=1)
Out[111]:
one zero one zero
x x y y
0 0.600178 2.410179 1.519970 0.132885
1 0.274230 1.450520 -0.493662 -0.023688
即使資料未經排序,索引也會運作,但效率會很低(並會顯示 PerformanceWarning
)。它也會傳回資料的副本,而非檢視。
In [112]: dfm = pd.DataFrame(
.....: {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
.....: )
.....:
In [113]: dfm = dfm.set_index(["jim", "joe"])
In [114]: dfm
Out[114]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 z 0.537020
y 0.110968
In [115]: dfm.loc[(1, 'z')]
Out[115]:
jolie
jim joe
1 z 0.53702
此外,如果您嘗試索引未完全經由詞彙排序的項目,這可能會引發
In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step)
2852 """
2853 For an ordered MultiIndex, compute the slice locations for input
2854 labels.
(...)
2900 sequence of such.
2901 """
2902 # This function adds nothing to its parent implementation (the magic
2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
6877 start_slice = None
6878 if start is not None:
-> 6879 start_slice = self.get_slice_bound(start, "left")
6880 if start_slice is None:
6881 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side)
2846 if not isinstance(label, tuple):
2847 label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side)
2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
2907 if len(tup) > self._lexsort_depth:
-> 2908 raise UnsortedIndexError(
2909 f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
2910 f"({self._lexsort_depth})"
2911 )
2913 n = len(tup)
2914 start, end = 0, len(self)
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
MultiIndex
上的 is_monotonic_increasing()
方法顯示索引是否已排序
In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()
In [119]: dfm
Out[119]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 y 0.110968
z 0.537020
In [120]: dfm.index.is_monotonic_increasing
Out[120]: True
現在,選取運作如預期般。
In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]:
jolie
jim joe
1 y 0.110968
z 0.537020
取用方法#
類似 NumPy ndarrays,pandas Index
、Series
和 DataFrame
也提供 take()
方法,用於在給定軸上擷取給定索引的元素。給定的索引必須是整數索引位置的清單或 ndarray。take
也會接受負整數作為物件末端的相對位置。
In [122]: index = pd.Index(np.random.randint(0, 1000, 10))
In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
In [124]: positions = [0, 9, 3]
In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')
In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')
In [127]: ser = pd.Series(np.random.randn(10))
In [128]: ser.iloc[positions]
Out[128]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
In [129]: ser.take(positions)
Out[129]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
對於 DataFrames,給定的索引應為指定列或行位置的 1d 清單或 ndarray。
In [130]: frm = pd.DataFrame(np.random.randn(5, 3))
In [131]: frm.take([1, 4, 3])
Out[131]:
0 1 2
1 -1.237881 0.106854 -1.276829
4 0.629675 -1.425966 1.857704
3 0.979542 -1.633678 0.615855
In [132]: frm.take([0, 2], axis=1)
Out[132]:
0 2
0 0.595974 0.601544
1 -1.237881 -1.276829
2 -0.767101 1.499591
3 0.979542 0.615855
4 0.629675 1.857704
請務必注意,pandas 物件上的 take
方法並非用於處理布林索引,且可能會傳回意外的結果。
In [133]: arr = np.random.randn(10)
In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935, 0.6775, 0.6775])
In [135]: arr[[0, 1]]
Out[135]: array([-1.1935, 0.6775])
In [136]: ser = pd.Series(np.random.randn(10))
In [137]: ser.take([False, False, True, True])
Out[137]:
0 0.233141
0 0.233141
1 -0.223540
1 -0.223540
dtype: float64
In [138]: ser.iloc[[0, 1]]
Out[138]:
0 0.233141
1 -0.223540
dtype: float64
最後,關於效能的簡要說明,由於 take
方法處理的輸入範圍較窄,因此其提供的效能比花式索引快很多。
In [139]: arr = np.random.randn(10000, 5)
In [140]: indexer = np.arange(10000)
In [141]: random.shuffle(indexer)
In [142]: %timeit arr[indexer]
.....: %timeit arr.take(indexer, axis=0)
.....:
246 us +- 7.83 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
73.5 us +- 2.52 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])
In [144]: %timeit ser.iloc[indexer]
.....: %timeit ser.take(indexer)
.....:
144 us +- 3.72 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
132 us +- 3.57 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
索引類型#
我們已在先前的章節中相當廣泛地討論 MultiIndex
。關於 DatetimeIndex
和 PeriodIndex
的文件顯示於 此處,而關於 TimedeltaIndex
的文件則可在 此處 找到。
在以下小節中,我們將重點介紹其他一些索引類型。
分類索引#
CategoricalIndex
是一種索引類型,可用於支援重複索引。這是 Categorical
的容器,並允許有效率地索引和儲存具有大量重複元素的索引。
In [145]: from pandas.api.types import CategoricalDtype
In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})
In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))
In [148]: df
Out[148]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [149]: df.dtypes
Out[149]:
A int64
B category
dtype: object
In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')
設定索引將建立 CategoricalIndex
。
In [151]: df2 = df.set_index("B")
In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
使用 __getitem__/.iloc/.loc
進行索引的方式類似於具有重複項目的 Index
。索引器必須在類別中,否則操作會引發 KeyError
。
In [153]: df2.loc["a"]
Out[153]:
A
B
a 0
a 1
a 5
索引後會保留 CategoricalIndex
In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
對索引進行排序會依類別順序排序(請回想我們使用 CategoricalDtype(list('cab'))
建立索引,因此排序順序為 cab
)。
In [155]: df2.sort_index()
Out[155]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
索引上的群組操作也會保留索引性質。
In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]:
A
B
c 4
a 6
b 5
In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
重新編製索引的運算會根據傳遞索引器的類型傳回結果索引。傳遞清單會傳回純舊 Index
;使用 Categorical
編製索引會傳回 CategoricalIndex
,根據傳遞 Categorical
資料類型的類別編製索引。這允許您任意編製這些索引,即使值不在類別中,類似於您可以重新編製任何 pandas 索引的方式。
In [158]: df3 = pd.DataFrame(
.....: {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
.....: )
.....:
In [159]: df3 = df3.set_index("B")
In [160]: df3
Out[160]:
A
B
a 0
b 1
c 2
In [161]: df3.reindex(["a", "e"])
Out[161]:
A
B
a 0.0
e NaN
In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')
In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]:
A
B
a 0.0
e NaN
In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')
警告
在 CategoricalIndex
上重新調整大小和比較運算必須有相同的類別,否則會引發 TypeError
。
In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})
In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))
In [167]: df4 = df4.set_index("B")
In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')
In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})
In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))
In [171]: df5 = df5.set_index("B")
In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
Out[173]:
A
B
b 0
a 1
b 0
c 1
RangeIndex#
RangeIndex
是 Index
的子類別,它提供所有 DataFrame
和 Series
物件的預設索引。 RangeIndex
是 Index
的最佳化版本,它可以表示單調的有序集合。這類比於 Python 範圍類型。 RangeIndex
永遠會有 int64
dtype。
In [174]: idx = pd.RangeIndex(5)
In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)
RangeIndex
是所有 DataFrame
和 Series
物件的預設索引
In [176]: ser = pd.Series([1, 2, 3])
In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)
In [178]: df = pd.DataFrame([[1, 2], [3, 4]])
In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)
In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)
RangeIndex
的行為會類似於 int64
dtype 的 Index
,且在 RangeIndex
上的運算,其結果無法由 RangeIndex
表示,但應具有整數 dtype,會轉換為具有 int64
的 Index
。例如
In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64')
IntervalIndex#
IntervalIndex
連同它自己的資料型態 IntervalDtype
以及 Interval
純量型態,允許在 pandas 中以一級支援的方式使用區間表示法。
IntervalIndex
允許一些獨特的索引,並且也用作 cut()
和 qcut()
中類別的回傳型態。
使用 IntervalIndex
進行索引#
IntervalIndex
可以用在 Series
和 DataFrame
中作為索引。
In [182]: df = pd.DataFrame(
.....: {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
.....: )
.....:
In [183]: df
Out[183]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
(3, 4] 4
透過 .loc
沿著區間邊緣進行基於標籤的索引,會像您預期的那樣運作,選取那個特定的區間。
In [184]: df.loc[2]
Out[184]:
A 2
Name: (1, 2], dtype: int64
In [185]: df.loc[[2, 3]]
Out[185]:
A
(1, 2] 2
(2, 3] 3
如果您選取一個包含在區間內的標籤,這也會選取該區間。
In [186]: df.loc[2.5]
Out[186]:
A 3
Name: (2, 3], dtype: int64
In [187]: df.loc[[2.5, 3.5]]
Out[187]:
A
(2, 3] 3
(3, 4] 4
使用 Interval
進行選取只會回傳完全相符的結果。
In [188]: df.loc[pd.Interval(1, 2)]
Out[188]:
A 2
Name: (1, 2], dtype: int64
嘗試選取一個沒有完全包含在 IntervalIndex
中的 Interval
會引發 KeyError
。
In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
1429 # fall thru to straight lookup
1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
1379 def _get_label(self, label, axis: AxisInt):
1380 # GH#5567 this will fail if the label is not present in the axis.
-> 1381 return self.obj.xs(label, axis=axis)
File ~/work/pandas/pandas/pandas/core/generic.py:4298, in NDFrame.xs(self, key, axis, level, drop_level)
4296 new_index = index[loc]
4297 else:
-> 4298 loc = index.get_loc(key)
4300 if isinstance(loc, np.ndarray):
4301 if loc.dtype == np.bool_:
File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
676 matches = mask.sum()
677 if matches == 0:
--> 678 raise KeyError(key)
679 if matches == 1:
680 return mask.argmax()
KeyError: Interval(0.5, 2.5, closed='right')
選取與給定 區間
重疊的所有 區間
可以使用 overlaps()
方法來建立布林索引。
In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
In [191]: idxr
Out[191]: array([ True, True, True, False])
In [192]: df[idxr]
Out[192]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
使用 cut
和 qcut
對資料進行分組#
cut()
和 qcut()
都會傳回一個 Categorical
物件,而它們建立的分組會儲存在 IntervalIndex
中,作為其 .categories
屬性。
In [193]: c = pd.cut(range(4), bins=2)
In [194]: c
Out[194]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')
cut()
也接受 IntervalIndex
作為其 bins
參數,這啟用了實用的 pandas 慣用語。首先,我們呼叫 cut()
,並將一些資料和 bins
設定為固定數字,以產生區間。然後,我們將 .categories
的值傳遞為後續呼叫 cut()
中的 bins
參數,提供將會分組到相同區間的新資料。
In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]:
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
落在所有區間之外的任何值都將會指定為 NaN
值。
產生區間範圍#
如果我們需要定期頻率的區間,我們可以使用 interval_range()
函數使用 start
、end
和 periods
的各種組合來建立 IntervalIndex
。 interval_range
的預設頻率為數字區間的 1,以及日期時間類型區間的日曆日
In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')
In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]:
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
(2017-01-02 00:00:00, 2017-01-03 00:00:00],
(2017-01-03 00:00:00, 2017-01-04 00:00:00],
(2017-01-04 00:00:00, 2017-01-05 00:00:00]],
dtype='interval[datetime64[ns], right]')
In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
(1 days 00:00:00, 2 days 00:00:00],
(2 days 00:00:00, 3 days 00:00:00]],
dtype='interval[timedelta64[ns], right]')
freq
參數可用於指定非預設頻率,並可使用各種 頻率別名 與日期時間類型區間
In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')
In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]:
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
(2017-01-08 00:00:00, 2017-01-15 00:00:00],
(2017-01-15 00:00:00, 2017-01-22 00:00:00],
(2017-01-22 00:00:00, 2017-01-29 00:00:00]],
dtype='interval[datetime64[ns], right]')
In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]:
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
(0 days 09:00:00, 0 days 18:00:00],
(0 days 18:00:00, 1 days 03:00:00]],
dtype='interval[timedelta64[ns], right]')
此外,closed
參數可用於指定區間封閉在哪一側。區間在右側預設為封閉。
In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')
In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')
指定 start
、end
和 periods
會產生一個從 start
到 end
(包含兩者)的均勻間隔區間範圍,其中 periods
是產生的 IntervalIndex
中元素的數量
In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')
In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]:
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
(2018-01-20 08:00:00, 2018-02-08 16:00:00],
(2018-02-08 16:00:00, 2018-02-28 00:00:00]],
dtype='interval[datetime64[ns], right]')
雜項索引常見問題#
整數索引#
使用整數軸標籤的標籤式索引是一個棘手的議題。它已在郵寄清單和科學 Python 社群的各個成員間廣泛討論。在 pandas 中,我們的觀點通常是標籤比整數位置更重要。因此,對於整數軸索引,僅能使用標準工具(例如 .loc
)進行標籤式索引。以下程式碼會產生例外狀況
In [207]: s = pd.Series(range(5))
In [208]: s[-1]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
412 try:
--> 413 return self._range.index(new_key)
414 except ValueError as err:
ValueError: -1 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]
File ~/work/pandas/pandas/pandas/core/series.py:1112, in Series.__getitem__(self, key)
1109 return self._values[key]
1111 elif key_is_scalar:
-> 1112 return self._get_value(key)
1114 # Convert generator to list before going through hashable part
1115 # (We will iterate through the generator there to check for slices)
1116 if is_iterator(key):
File ~/work/pandas/pandas/pandas/core/series.py:1228, in Series._get_value(self, label, takeable)
1225 return self._values[label]
1227 # Similar to Index.get_value, but we do not fall back to positional
-> 1228 loc = self.index.get_loc(label)
1230 if is_integer(loc):
1231 return self._values[loc]
File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
413 return self._range.index(new_key)
414 except ValueError as err:
--> 415 raise KeyError(key) from err
416 if isinstance(key, Hashable):
417 raise KeyError(key)
KeyError: -1
In [209]: df = pd.DataFrame(np.random.randn(5, 4))
In [210]: df
Out[210]:
0 1 2 3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703 1.347213 -0.243487 0.514704
2 1.162969 -0.287725 -0.179734 0.993962
3 -0.212673 0.909872 -0.733333 -0.349893
4 0.456434 -0.306735 0.553396 0.166221
In [211]: df.loc[-2:]
Out[211]:
0 1 2 3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703 1.347213 -0.243487 0.514704
2 1.162969 -0.287725 -0.179734 0.993962
3 -0.212673 0.909872 -0.733333 -0.349893
4 0.456434 -0.306735 0.553396 0.166221
這個深思熟慮的決定旨在防止歧義和微妙的錯誤(許多使用者回報在 API 變更為停止「回退」至基於位置的索引時發現錯誤)。
非單調索引需要完全符合#
如果 Series
或 DataFrame
的索引單調遞增或遞減,則基於標籤的切片邊界可以超出索引範圍,就像切片索引一個普通的 Python list
一樣。可以使用 is_monotonic_increasing()
和 is_monotonic_decreasing()
屬性來測試索引的單調性。
In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))
In [213]: df.index.is_monotonic_increasing
Out[213]: True
# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]:
data
2 0
3 1
3 2
4 3
# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]:
Empty DataFrame
Columns: [data]
Index: []
另一方面,如果索引不是單調的,則兩個切片邊界都必須是索引的唯一成員。
In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))
In [217]: df.index.is_monotonic_increasing
Out[217]: False
# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]:
data
2 0
3 1
1 2
4 3
# 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()
File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()
File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]
File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
1182 if self._is_scalar_access(key):
1183 return self.obj._get_value(*key, takeable=self._takeable)
-> 1184 return self._getitem_tuple(key)
1185 else:
1186 # we by definition only have the 0th axis
1187 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
1374 if self._multi_take_opportunity(tup):
1375 return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
1017 if com.is_null_slice(key):
1018 continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
1021 # We should never have retval.ndim < self.ndim, as that should
1022 # be handled by the _getitem_lowerdim call above.
1023 assert retval.ndim == self.ndim
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
6877 start_slice = None
6878 if start is not None:
-> 6879 start_slice = self.get_slice_bound(start, "left")
6880 if start_slice is None:
6881 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side)
6801 return self._searchsorted_monotonic(label, side)
6802 except ValueError:
6803 # raise the original KeyError
-> 6804 raise err
6806 if isinstance(slc, np.ndarray):
6807 # get_loc may return a boolean array, which
6808 # is OK as long as they are representable by a slice.
6809 assert is_bool_dtype(slc.dtype)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
6796 # we need to look up the label
6797 try:
-> 6798 slc = self.get_loc(label)
6799 except KeyError as err:
6800 try:
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 0
# 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]
File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
1182 if self._is_scalar_access(key):
1183 return self.obj._get_value(*key, takeable=self._takeable)
-> 1184 return self._getitem_tuple(key)
1185 else:
1186 # we by definition only have the 0th axis
1187 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
1374 if self._multi_take_opportunity(tup):
1375 return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
1017 if com.is_null_slice(key):
1018 continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
1021 # We should never have retval.ndim < self.ndim, as that should
1022 # be handled by the _getitem_lowerdim call above.
1023 assert retval.ndim == self.ndim
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step)
6883 end_slice = None
6884 if end is not None:
-> 6885 end_slice = self.get_slice_bound(end, "right")
6886 if end_slice is None:
6887 end_slice = len(self)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side)
6810 slc = lib.maybe_booleans_to_slice(slc.view("u1"))
6811 if isinstance(slc, np.ndarray):
-> 6812 raise KeyError(
6813 f"Cannot get {side} slice bound for non-unique "
6814 f"label: {repr(original_label)}"
6815 )
6817 if isinstance(slc, slice):
6818 if side == "left":
KeyError: 'Cannot get right slice bound for non-unique label: 3'
Index.is_monotonic_increasing
和 Index.is_monotonic_decreasing
只檢查索引是否弱單調。若要檢查嚴格單調性,可以將其中一個與 is_unique()
屬性結合使用。
In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])
In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')
In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True
In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False
端點包含#
與切片端點不包含的標準 Python 序列切片相比,pandas 中的基於標籤的切片包含。這樣做的主要原因是通常無法輕易確定索引中特定標籤之後的「後繼」或下一個元素。例如,考慮以下 Series
In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))
In [226]: s
Out[226]:
a -0.101684
b -0.734907
c -0.130121
d -0.476046
e 0.759104
f 0.213379
dtype: float64
假設我們希望從 c
切片到 e
,使用整數可以這樣完成
In [227]: s[2:5]
Out[227]:
c -0.130121
d -0.476046
e 0.759104
dtype: float64
然而,如果你只有 c
和 e
,確定索引中的下一個元素可能會有點複雜。例如,下列方法無法運作
In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]
TypeError: can only concatenate str (not "int") to str
一個非常常見的用例是將時間序列限制為從兩個特定日期開始和結束。為了啟用此功能,我們做出設計選擇,讓基於標籤的切片包含兩個端點
In [229]: s.loc["c":"e"]
Out[229]:
c -0.130121
d -0.476046
e 0.759104
dtype: float64
這絕對是一種「實用性勝過純粹性」的事情,但如果你預期基於標籤的切片會完全按照標準 Python 整數切片運作的方式運作,那麼這是一件需要留意的問題。
索引可能會變更底層 Series dtype#
不同的索引操作可能會變更 Series
的 dtype。
In [230]: series1 = pd.Series([1, 2, 3])
In [231]: series1.dtype
Out[231]: dtype('int64')
In [232]: res = series1.reindex([0, 4])
In [233]: res.dtype
Out[233]: dtype('float64')
In [234]: res
Out[234]:
0 1.0
4 NaN
dtype: float64
In [235]: series2 = pd.Series([True])
In [236]: series2.dtype
Out[236]: dtype('bool')
In [237]: res = series2.reindex_like(series1)
In [238]: res.dtype
Out[238]: dtype('O')
In [239]: res
Out[239]:
0 True
1 NaN
2 NaN
dtype: object
這是因為上述(重新)索引操作會靜默插入 NaNs
,而 dtype
會相應變更。在使用 numpy
ufuncs
(例如 numpy.logical_and
)時,這可能會造成一些問題。
請參閱 GH 2388 以取得更詳細的討論。