MultiIndex / 進階索引#

本節涵蓋 使用 MultiIndex 索引其他進階索引功能

請參閱 索引和選擇資料 以取得一般索引文件。

警告

設定作業傳回副本或參考,可能會取決於內容。這有時稱為 連鎖 指派,應避免使用。請參閱 傳回檢視與副本

請參閱 食譜 以取得一些進階策略。

階層式索引 (MultiIndex)#

階層式/多層級索引非常令人興奮,因為它開啟了通往一些相當精密的資料分析和處理的門戶,特別是處理高維度資料時。基本上,它讓您能夠在較低維度資料結構中儲存和處理具有任意維度的資料,例如 Series (1d) 和 DataFrame (2d)。

在本節中,我們將說明「階層式」索引的具體含義,以及它如何與上面和之前各節所述的所有 pandas 索引功能整合。稍後,在討論 群組依據樞紐分析和資料重塑 時,我們將展示非平凡的應用,說明它如何協助建構資料以進行分析。

請參閱 食譜 以取得一些進階策略。

建立 MultiIndex (階層式索引) 物件#

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [1]: arrays = [
   ...:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...: ]
   ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

當您想要兩個可迭代元素中的每個配對時,可以使用 MultiIndex.from_product() 方法會比較容易

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

您也可以使用 MultiIndex.from_frame() 方法,直接從 DataFrame 建立 MultiIndex。這是 MultiIndex.to_frame() 的互補方法。

In [10]: df = pd.DataFrame(
   ....:     [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
   ....:     columns=["first", "second"],
   ....: )
   ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

為了方便起見,您可以將陣列清單直接傳遞到 SeriesDataFrame 中,以自動建立 MultiIndex

In [12]: arrays = [
   ....:     np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
   ....:     np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
   ....: ]
   ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

所有 MultiIndex 建構函式都接受 names 參數,用於儲存層級本身的字串名稱。如果未提供名稱,將會指定 None

In [17]: df.index.names
Out[17]: FrozenList([None, None])

此索引可以支援 pandas 物件的任何軸,而索引的層級數量由您決定

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz  ...       foo       qux          
second       one       two       one  ...       two       one       two
A       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo          
second             one       two       one       two       one       two
first second                                                            
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

我們已經「稀疏化」索引的較高層級,以讓主控台輸出對眼睛來說比較容易閱讀。請注意,索引的顯示方式可以使用 pandas.set_options() 中的 multi_sparse 選項來控制

In [21]: with pd.option_context("display.multi_sparse", False):
   ....:     df
   ....: 

值得注意的是,沒有任何因素會阻止您在軸上使用元組作為原子標籤

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

之所以 MultiIndex 很重要,是因為它可以讓您執行分組、選取和變形操作,如下文和文件後續部分所述。正如您將在後續部分中看到的,您可能會發現自己在處理階層索引資料,而沒有自己明確建立 MultiIndex。但是,當從檔案載入資料時,您可能希望在準備資料集時產生自己的 MultiIndex

重建層級標籤#

方法 get_level_values() 會傳回特定層級中每個位置的標籤向量

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

使用 MultiIndex 在軸上進行基本索引#

階層索引的重要功能之一是,您可以透過識別資料中子群組的「部分」標籤來選取資料。部分選取會「捨棄」階層索引的層級,其結果與在一般 DataFrame 中選取欄位的方式完全類似

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64

請參閱 具有階層索引的橫切面,了解如何在更深層級上進行選取。

已定義層級#

即使未實際使用,MultiIndex 會保留索引的所有已定義層級。在切片索引時,您可能會注意到這一點。例如

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

這樣做是為了避免重新計算層級,以使切片效能極佳。如果您只想查看已使用的層級,可以使用 get_level_values() 方法。

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

若要僅使用已使用的層級重建 MultiIndex,可以使用 remove_unused_levels() 方法。

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

資料對齊和使用 reindex#

在軸上具有 MultiIndex 的不同索引物件之間的操作會按照您的預期運作;資料對齊會與元組索引相同

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

Series/DataFramesreindex() 方法可以呼叫另一個 MultiIndex,甚至可以呼叫元組的清單或陣列

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64

進階索引與階層索引#

在進階索引中以語法方式整合 MultiIndex.loc 有點困難,但我們已盡一切努力這麼做。一般而言,MultiIndex 鍵會採用元組形式。例如,以下範例會按照您的預期運作

In [39]: df = df.T

In [40]: df
Out[40]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")]
Out[41]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

請注意,df.loc['bar', 'two'] 在此範例中也能運作,但此簡寫符號通常會造成歧義。

如果您也想使用 .loc 編制特定欄位的索引,您必須使用類似這樣的元組

In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785

您不必透過只傳遞元組的第一個元素來指定 MultiIndex 的所有層級。例如,您可以使用「部分」編制索引,以取得第一層級中包含 bar 的所有元素,如下所示

In [43]: df.loc["bar"]
Out[43]: 
               A         B         C
second                              
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920

這是稍嫌冗長的符號 df.loc[('bar',),] 的捷徑(在此範例中等同於 df.loc['bar',])。

「部分」切片也運作得相當良好。

In [44]: df.loc["baz":"foo"]
Out[44]: 
                     A         B         C
first second                              
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

您可以透過提供元組切片,使用「範圍」值進行切片。

In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [46]: df.loc[("baz", "two"):"foo"]
Out[46]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

傳遞標籤或元組清單的運作方式類似於重新編制索引

In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]: 
                     A         B         C
first second                              
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

注意

請務必注意,在編制索引時,pandas 並不會將元組和清單視為相同。元組會被詮釋為一個多層級金鑰,而清單則用於指定多個金鑰。換句話說,元組是橫向的(橫跨層級),而清單則是縱向的(掃描層級)。

重要的是,元組清單會編制多個完整的 MultiIndex 金鑰的索引,而清單元組則會參照層級內的幾個值

In [48]: s = pd.Series(
   ....:     [1, 2, 3, 4, 5, 6],
   ....:     index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
   ....: )
   ....: 

In [49]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[49]: 
A  c    1
B  d    5
dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[50]: 
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

使用切片器#

您可以透過提供多個索引器來切片 MultiIndex

您可以提供任何選取器,就像使用標籤索引一樣,請參閱 依標籤選取,包括切片、標籤清單、標籤和布林索引器。

您可以使用 slice(None) 選取層級的所有內容。您不需要指定所有更深層的層級,它們會被暗示為 slice(None)

與往常一樣,切片器的兩側都包含在內,因為這是標籤索引。

警告

您應該在 .loc 規格說明中指定所有軸,表示索引的索引器。有一些模稜兩可的情況,傳遞的索引器可能會被誤解為索引兩個軸,而不是說成是列的 MultiIndex

您應該這樣做

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999

應該這樣做

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999
In [51]: def mklbl(prefix, n):
   ....:     return ["%s%s" % (prefix, i) for i in range(n)]
   ....: 

In [52]: miindex = pd.MultiIndex.from_product(
   ....:     [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
   ....: )
   ....: 

In [53]: micolumns = pd.MultiIndex.from_tuples(
   ....:     [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
   ....: )
   ....: 

In [54]: dfmi = (
   ....:     pd.DataFrame(
   ....:         np.arange(len(miindex) * len(micolumns)).reshape(
   ....:             (len(miindex), len(micolumns))
   ....:         ),
   ....:         index=miindex,
   ....:         columns=micolumns,
   ....:     )
   ....:     .sort_index()
   ....:     .sort_index(axis=1)
   ....: )
   ....: 

In [55]: dfmi
Out[55]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

使用切片、清單和標籤的基本 MultiIndex 切片。

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

您可以使用 pandas.IndexSlice 使用 : 而不是 slice(None) 來簡化語法。

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

可以使用此方法在多個軸上同時執行相當複雜的選取。

In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

使用布林索引器,您可以提供與相關的選取。

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

您也可以指定 axis 參數給 .loc 來詮釋在單一軸上傳遞的切片器。

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[32 rows x 4 columns]

此外,您可以使用下列方法設定值。

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
Out[66]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10

[64 rows x 4 columns]

您也可以使用可對齊物件的右側。

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
Out[69]: 
lvl0              a               b        
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000

[64 rows x 4 columns]

橫切面#

DataFramexs() 方法另外採用 level 參數,讓在 MultiIndex 的特定層級選取資料變得更容易。

In [70]: df
Out[70]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
Out[71]: 
              A         B         C
first                              
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

您也可以透過提供 axis 參數,使用 xs 選取欄位。

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

xs 也允許多個鍵選取。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

您可以將 drop_level=False 傳遞給 xs,以保留已選取的層級。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

將上述內容與使用 drop_level=True (預設值) 的結果進行比較。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

進階重新索引和對齊#

在 pandas 物件的 reindex()align() 方法中使用參數 level 可用於在層級間廣播值。例如

In [80]: midx = pd.MultiIndex(
   ....:     levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
   ....: )
   ....: 

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
Out[82]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
Out[84]: 
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
Out[85]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
Out[87]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [88]: df2_aligned
Out[88]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

使用 swaplevel 交換層級#

swaplevel() 方法可以切換兩個層級的順序

In [89]: df[:5]
Out[89]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

使用 reorder_levels 重新排序層級 #

reorder_levels() 方法概括了 swaplevel 方法,允許您一步驟排列階層索引層級

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

重新命名 IndexMultiIndex 的名稱 #

rename() 方法用於重新命名 MultiIndex 的標籤,通常用於重新命名 DataFrame 的欄位。renamecolumns 參數允許指定僅包含您想要重新命名的欄位的字典。

In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
            col0      col1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

此方法也可用于重新命名 DataFrame 的主索引的特定標籤。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
               0         1
two  z  1.519970 -0.493662
     x  0.600178  0.274230
zero z  0.132885 -0.023688
     x  2.410179  1.450520

使用 rename_axis() 方法來重新命名 IndexMultiIndex 的名稱。特別是,可以指定 MultiIndex 層級的名稱,這在稍後使用 reset_index() 將值從 MultiIndex 移至欄位時很有用。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
                 0         1
abc  def                    
one  y    1.519970 -0.493662
     x    0.600178  0.274230
zero y    0.132885 -0.023688
     x    2.410179  1.450520

請注意,DataFrame 的欄位是一個索引,因此使用帶有 columns 參數的 rename_axis 會變更該索引的名稱。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')

renamerename_axis 都支援指定字典、Series 或對應函數,以將標籤/名稱對應至新值。

直接使用 Index 物件(而非透過 DataFrame)時,可以使用 Index.set_names() 來變更名稱。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

無法透過層級設定 MultiIndex 的名稱。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
   1686 @name.setter
   1687 def name(self, value: Hashable) -> None:
   1688     if self._no_setting_name:
   1689         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690         raise RuntimeError(
   1691             "Cannot set name on a level of a MultiIndex. Use "
   1692             "'MultiIndex.set_names' instead."
   1693         )
   1694     maybe_extract_name(value, None, type(self))
   1695     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

請改用 Index.set_names()

排序 MultiIndex#

對於要有效率地索引和切片 MultiIndex 物件,它們需要經過排序。與任何索引一樣,您可以使用 sort_index()

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
Out[104]: 
foo  two    0.206053
baz  two   -0.251905
qux  two   -2.213588
baz  one    1.063327
foo  one    1.266143
qux  one    0.299368
bar  two   -0.863838
     one    0.408204
dtype: float64

In [105]: s.sort_index()
Out[105]: 
bar  one    0.408204
     two   -0.863838
baz  one    1.063327
     two   -0.251905
foo  one    1.266143
     two    0.206053
qux  one    0.299368
     two   -2.213588
dtype: float64

In [106]: s.sort_index(level=0)
Out[106]: 
bar  one    0.408204
     two   -0.863838
baz  one    1.063327
     two   -0.251905
foo  one    1.266143
     two    0.206053
qux  one    0.299368
     two   -2.213588
dtype: float64

In [107]: s.sort_index(level=1)
Out[107]: 
bar  one    0.408204
baz  one    1.063327
foo  one    1.266143
qux  one    0.299368
bar  two   -0.863838
baz  two   -0.251905
foo  two    0.206053
qux  two   -2.213588
dtype: float64

如果 MultiIndex 層級有命名,您也可以傳遞層級名稱給 sort_index

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one    0.408204
     two   -0.863838
baz  one    1.063327
     two   -0.251905
foo  one    1.266143
     two    0.206053
qux  one    0.299368
     two   -2.213588
dtype: float64

In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one    0.408204
baz  one    1.063327
foo  one    1.266143
qux  one    0.299368
bar  two   -0.863838
baz  two   -0.251905
foo  two    0.206053
qux  two   -2.213588
dtype: float64

在較高維度的物件上,如果它們有 MultiIndex,您可以依據層級來對任何其他軸進行排序。

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

即使資料未經排序,索引也會運作,但效率會很低(並會顯示 PerformanceWarning)。它也會傳回資料的副本,而非檢視。

In [112]: dfm = pd.DataFrame(
   .....:     {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
   .....: )
   .....: 

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
Out[114]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968

In [115]: dfm.loc[(1, 'z')]
Out[115]: 
           jolie
jim joe         
1   z    0.53702

此外,如果您嘗試索引未完全經由詞彙排序的項目,這可能會引發

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError                        Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step)
   2852 """
   2853 For an ordered MultiIndex, compute the slice locations for input
   2854 labels.
   (...)
   2900                       sequence of such.
   2901 """
   2902 # This function adds nothing to its parent implementation (the magic
   2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side)
   2846 if not isinstance(label, tuple):
   2847     label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side)
   2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
   2907     if len(tup) > self._lexsort_depth:
-> 2908         raise UnsortedIndexError(
   2909             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
   2910             f"({self._lexsort_depth})"
   2911         )
   2913     n = len(tup)
   2914     start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

MultiIndex 上的 is_monotonic_increasing() 方法顯示索引是否已排序

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()

In [119]: dfm
Out[119]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020

In [120]: dfm.index.is_monotonic_increasing
Out[120]: True

現在,選取運作如預期般。

In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]: 
            jolie
jim joe          
1   y    0.110968
    z    0.537020

取用方法#

類似 NumPy ndarrays,pandas IndexSeriesDataFrame 也提供 take() 方法,用於在給定軸上擷取給定索引的元素。給定的索引必須是整數索引位置的清單或 ndarray。take 也會接受負整數作為物件末端的相對位置。

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

對於 DataFrames,給定的索引應為指定列或行位置的 1d 清單或 ndarray。

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
Out[131]: 
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
Out[132]: 
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

請務必注意,pandas 物件上的 take 方法並非用於處理布林索引,且可能會傳回意外的結果。

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64

最後,關於效能的簡要說明,由於 take 方法處理的輸入範圍較窄,因此其提供的效能比花式索引快很多。

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
   .....: 
246 us +- 7.83 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
73.5 us +- 2.52 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
   .....: 
144 us +- 3.72 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
132 us +- 3.57 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

索引類型#

我們已在先前的章節中相當廣泛地討論 MultiIndex。關於 DatetimeIndexPeriodIndex 的文件顯示於 此處,而關於 TimedeltaIndex 的文件則可在 此處 找到。

在以下小節中,我們將重點介紹其他一些索引類型。

分類索引#

CategoricalIndex 是一種索引類型,可用於支援重複索引。這是 Categorical 的容器,並允許有效率地索引和儲存具有大量重複元素的索引。

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')

設定索引將建立 CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

使用 __getitem__/.iloc/.loc 進行索引的方式類似於具有重複項目的 Index。索引器必須在類別中,否則操作會引發 KeyError

In [153]: df2.loc["a"]
Out[153]: 
   A
B   
a  0
a  1
a  5

索引後會保留 CategoricalIndex

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

對索引進行排序會依類別順序排序(請回想我們使用 CategoricalDtype(list('cab')) 建立索引,因此排序順序為 cab)。

In [155]: df2.sort_index()
Out[155]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

索引上的群組操作也會保留索引性質。

In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]: 
   A
B   
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

重新編製索引的運算會根據傳遞索引器的類型傳回結果索引。傳遞清單會傳回純舊 Index;使用 Categorical 編製索引會傳回 CategoricalIndex,根據傳遞 Categorical 資料類型的類別編製索引。這允許您任意編製這些索引,即使值不在類別中,類似於您可以重新編製任何 pandas 索引的方式。

In [158]: df3 = pd.DataFrame(
   .....:     {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
   .....: )
   .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
   A
B   
a  0
b  1
c  2
In [161]: df3.reindex(["a", "e"])
Out[161]: 
     A
B     
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
     A
B     
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')

警告

CategoricalIndex 上重新調整大小和比較運算必須有相同的類別,否則會引發 TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
Out[173]: 
   A
B   
b  0
a  1
b  0
c  1

RangeIndex#

RangeIndexIndex 的子類別,它提供所有 DataFrameSeries 物件的預設索引。 RangeIndexIndex 的最佳化版本,它可以表示單調的有序集合。這類比於 Python 範圍類型RangeIndex 永遠會有 int64 dtype。

In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)

RangeIndex 是所有 DataFrameSeries 物件的預設索引

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)

RangeIndex 的行為會類似於 int64 dtype 的 Index,且在 RangeIndex 上的運算,其結果無法由 RangeIndex 表示,但應具有整數 dtype,會轉換為具有 int64Index。例如

In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64')

IntervalIndex#

IntervalIndex 連同它自己的資料型態 IntervalDtype 以及 Interval 純量型態,允許在 pandas 中以一級支援的方式使用區間表示法。

IntervalIndex 允許一些獨特的索引,並且也用作 cut()qcut() 中類別的回傳型態。

使用 IntervalIndex 進行索引#

IntervalIndex 可以用在 SeriesDataFrame 中作為索引。

In [182]: df = pd.DataFrame(
   .....:     {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
   .....: )
   .....: 

In [183]: df
Out[183]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

透過 .loc 沿著區間邊緣進行基於標籤的索引,會像您預期的那樣運作,選取那個特定的區間。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
        A
(1, 2]  2
(2, 3]  3

如果您選取一個包含在區間內的標籤,這也會選取該區間。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
        A
(2, 3]  3
(3, 4]  4

使用 Interval 進行選取只會回傳完全相符的結果。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64

嘗試選取一個沒有完全包含在 IntervalIndex 中的 Interval 會引發 KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
   1429 # fall thru to straight lookup
   1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
   1379 def _get_label(self, label, axis: AxisInt):
   1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4298, in NDFrame.xs(self, key, axis, level, drop_level)
   4296             new_index = index[loc]
   4297 else:
-> 4298     loc = index.get_loc(key)
   4300     if isinstance(loc, np.ndarray):
   4301         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
    676 matches = mask.sum()
    677 if matches == 0:
--> 678     raise KeyError(key)
    679 if matches == 1:
    680     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right')

選取與給定 區間 重疊的所有 區間 可以使用 overlaps() 方法來建立布林索引。

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

使用 cutqcut 對資料進行分組#

cut()qcut() 都會傳回一個 Categorical 物件,而它們建立的分組會儲存在 IntervalIndex 中,作為其 .categories 屬性。

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

cut() 也接受 IntervalIndex 作為其 bins 參數,這啟用了實用的 pandas 慣用語。首先,我們呼叫 cut(),並將一些資料和 bins 設定為固定數字,以產生區間。然後,我們將 .categories 的值傳遞為後續呼叫 cut() 中的 bins 參數,提供將會分組到相同區間的新資料。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

落在所有區間之外的任何值都將會指定為 NaN 值。

產生區間範圍#

如果我們需要定期頻率的區間,我們可以使用 interval_range() 函數使用 startendperiods 的各種組合來建立 IntervalIndexinterval_range 的預設頻率為數字區間的 1,以及日期時間類型區間的日曆日

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
               (2017-01-02 00:00:00, 2017-01-03 00:00:00],
               (2017-01-03 00:00:00, 2017-01-04 00:00:00],
               (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
               (1 days 00:00:00, 2 days 00:00:00],
               (2 days 00:00:00, 3 days 00:00:00]],
              dtype='interval[timedelta64[ns], right]')

freq 參數可用於指定非預設頻率,並可使用各種 頻率別名 與日期時間類型區間

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
               (2017-01-08 00:00:00, 2017-01-15 00:00:00],
               (2017-01-15 00:00:00, 2017-01-22 00:00:00],
               (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
               (0 days 09:00:00, 0 days 18:00:00],
               (0 days 18:00:00, 1 days 03:00:00]],
              dtype='interval[timedelta64[ns], right]')

此外,closed 參數可用於指定區間封閉在哪一側。區間在右側預設為封閉。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

指定 startendperiods 會產生一個從 startend(包含兩者)的均勻間隔區間範圍,其中 periods 是產生的 IntervalIndex 中元素的數量

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
               (2018-01-20 08:00:00, 2018-02-08 16:00:00],
               (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
              dtype='interval[datetime64[ns], right]')

雜項索引常見問題#

整數索引#

使用整數軸標籤的標籤式索引是一個棘手的議題。它已在郵寄清單和科學 Python 社群的各個成員間廣泛討論。在 pandas 中,我們的觀點通常是標籤比整數位置更重要。因此,對於整數軸索引,僅能使用標準工具(例如 .loc)進行標籤式索引。以下程式碼會產生例外狀況

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
    412 try:
--> 413     return self._range.index(new_key)
    414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/series.py:1112, in Series.__getitem__(self, key)
   1109     return self._values[key]
   1111 elif key_is_scalar:
-> 1112     return self._get_value(key)
   1114 # Convert generator to list before going through hashable part
   1115 # (We will iterate through the generator there to check for slices)
   1116 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1228, in Series._get_value(self, label, takeable)
   1225     return self._values[label]
   1227 # Similar to Index.get_value, but we do not fall back to positional
-> 1228 loc = self.index.get_loc(label)
   1230 if is_integer(loc):
   1231     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
    413         return self._range.index(new_key)
    414     except ValueError as err:
--> 415         raise KeyError(key) from err
    416 if isinstance(key, Hashable):
    417     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
Out[210]: 
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
Out[211]: 
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

這個深思熟慮的決定旨在防止歧義和微妙的錯誤(許多使用者回報在 API 變更為停止「回退」至基於位置的索引時發現錯誤)。

非單調索引需要完全符合#

如果 SeriesDataFrame 的索引單調遞增或遞減,則基於標籤的切片邊界可以超出索引範圍,就像切片索引一個普通的 Python list 一樣。可以使用 is_monotonic_increasing()is_monotonic_decreasing() 屬性來測試索引的單調性。

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]: 
   data
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]: 
Empty DataFrame
Columns: [data]
Index: []

另一方面,如果索引不是單調的,則兩個切片邊界都必須是索引的唯一成員。

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]: 
   data
2     0
3     1
1     2
4     3
 # 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side)
   6801         return self._searchsorted_monotonic(label, side)
   6802     except ValueError:
   6803         # raise the original KeyError
-> 6804         raise err
   6806 if isinstance(slc, np.ndarray):
   6807     # get_loc may return a boolean array, which
   6808     # is OK as long as they are representable by a slice.
   6809     assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
   6796 # we need to look up the label
   6797 try:
-> 6798     slc = self.get_loc(label)
   6799 except KeyError as err:
   6800     try:

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step)
   6883 end_slice = None
   6884 if end is not None:
-> 6885     end_slice = self.get_slice_bound(end, "right")
   6886 if end_slice is None:
   6887     end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side)
   6810     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
   6811     if isinstance(slc, np.ndarray):
-> 6812         raise KeyError(
   6813             f"Cannot get {side} slice bound for non-unique "
   6814             f"label: {repr(original_label)}"
   6815         )
   6817 if isinstance(slc, slice):
   6818     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasingIndex.is_monotonic_decreasing 只檢查索引是否弱單調。若要檢查嚴格單調性,可以將其中一個與 is_unique() 屬性結合使用。

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False

端點包含#

與切片端點不包含的標準 Python 序列切片相比,pandas 中的基於標籤的切片包含。這樣做的主要原因是通常無法輕易確定索引中特定標籤之後的「後繼」或下一個元素。例如,考慮以下 Series

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
Out[226]: 
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64

假設我們希望從 c 切片到 e,使用整數可以這樣完成

In [227]: s[2:5]
Out[227]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

然而,如果你只有 ce,確定索引中的下一個元素可能會有點複雜。例如,下列方法無法運作

In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str

一個非常常見的用例是將時間序列限制為從兩個特定日期開始和結束。為了啟用此功能,我們做出設計選擇,讓基於標籤的切片包含兩個端點

In [229]: s.loc["c":"e"]
Out[229]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

這絕對是一種「實用性勝過純粹性」的事情,但如果你預期基於標籤的切片會完全按照標準 Python 整數切片運作的方式運作,那麼這是一件需要留意的問題。

索引可能會變更底層 Series dtype#

不同的索引操作可能會變更 Series 的 dtype。

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
Out[234]: 
0    1.0
4    NaN
dtype: float64
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
Out[239]: 
0    True
1     NaN
2     NaN
dtype: object

這是因為上述(重新)索引操作會靜默插入 NaNs,而 dtype 會相應變更。在使用 numpy ufuncs(例如 numpy.logical_and)時,這可能會造成一些問題。

請參閱 GH 2388 以取得更詳細的討論。