使用文字資料#

文字資料類型#

在 pandas 中儲存文字資料有兩種方式

object -dtype NumPy 陣列。
StringDtype 擴充類型。

我們建議使用 StringDtype 來儲存文字資料。

在 pandas 1.0 之前，object dtype 是唯一的選項。這在許多方面來說是不幸的

您可能會意外地在 object dtype 陣列中儲存字串和非字串的混合。最好有專用的 dtype。
object dtype 會中斷特定於 dtype 的運算，例如 DataFrame.select_dtypes()。沒有明確的方法可以在排除非文字但仍為 object-dtype 欄位的同時，選擇僅文字。
在讀取程式碼時，object dtype 陣列的內容比 'string' 來的不明確。

目前，object dtype 字串陣列和 arrays.StringArray 的效能大約相同。我們預期未來的增強功能將大幅提升 StringArray 的效能並降低記憶體開銷。

警告

字串陣列 目前被視為實驗性質。實作和 API 的部分內容可能會在未事先警告的情況下變更。

為了向後相容，物件 dtype 仍然是我們推論字串清單的預設類型

In [1]: pd.Series(["a", "b", "c"])
Out[1]: 
0    a
1    b
2    c
dtype: object

若要明確要求 字串 dtype，請指定 dtype

In [2]: pd.Series(["a", "b", "c"], dtype="string")
Out[2]: 
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Out[3]: 
0    a
1    b
2    c
dtype: string

或者在建立 系列 或 資料框 之後 astype

In [4]: s = pd.Series(["a", "b", "c"])

In [5]: s
Out[5]: 
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]: 
0    a
1    b
2    c
dtype: string

您也可以在非字串資料上使用 字串 Dtype/「字串」 作為 dtype，它會轉換為 字串 dtype

In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")

In [8]: s
Out[8]: 
0       a
1       2
2    <NA>
dtype: string

In [9]: type(s[1])
Out[9]: str

或從現有的 pandas 資料轉換

In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")

In [11]: s1
Out[11]: 
0       1
1       2
2    <NA>
dtype: Int64

In [12]: s2 = s1.astype("string")

In [13]: s2
Out[13]: 
0       1
1       2
2    <NA>
dtype: string

In [14]: type(s2[0])
Out[14]: str

行為差異#

以下是 字串 Dtype 物件的行為與 物件 dtype 不同的部分

對於 字串 Dtype，傳回數字輸出的字串存取器方法將總是傳回可為空整數 dtype，而不是 int 或 float dtype（視 NA 值是否存在而定）。傳回布林輸出的方法將傳回可為空布林 dtype。

In [15]: s = pd.Series(["a", None, "b"], dtype="string")

In [16]: s
Out[16]: 
0       a
1    <NA>
2       b
dtype: string

In [17]: s.str.count("a")
Out[17]: 
0       1
1    <NA>
2       0
dtype: Int64

In [18]: s.dropna().str.count("a")
Out[18]: 
0    1
2    0
dtype: Int64

兩種輸出都是 Int64 資料型態。將其與物件資料型態進行比較

In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")

In [20]: s2.str.count("a")
Out[20]: 
0    1.0
1    NaN
2    0.0
dtype: float64

In [21]: s2.dropna().str.count("a")
Out[21]: 
0    1
2    0
dtype: int64

當存在 NA 值時，輸出資料型態為 float64。布林值回傳方法也類似。

In [22]: s.str.isdigit()
Out[22]: 
0    False
1     <NA>
2    False
dtype: boolean

In [23]: s.str.match("a")
Out[23]: 
0     True
1     <NA>
2    False
dtype: boolean

某些字串方法，例如 Series.str.decode() 在 StringArray 中不可用，因為 StringArray 僅儲存字串，不儲存位元組。
在比較運算中，arrays.StringArray 和以 StringArray 為後盾的 Series 會回傳一個具有 BooleanDtype 的物件，而不是 bool 資料型態物件。StringArray 中的遺漏值會在比較運算中傳播，而不是像 numpy.nan 一樣始終比較不相等。

本文件其餘部分中所述的所有其他內容均同樣適用於 string 和 object 資料型態。

字串方法#

Series 和 Index 配備了一組字串處理方法，讓您能輕鬆地對陣列中的每個元素進行操作。最重要的，這些方法會自動排除遺失/NA 值。這些方法可透過 str 屬性存取，而且通常名稱會與等效的 (純量) 內建字串方法相符

In [24]: s = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   ....: )
   ....: 

In [25]: s.str.lower()
Out[25]: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [26]: s.str.upper()
Out[26]: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [27]: s.str.len()
Out[27]: 
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

Index 上的字串方法對於清理或轉換 DataFrame 欄位特別有用。例如，您可能有包含前導或尾隨空白的欄位

In [32]: df = pd.DataFrame(
   ....:     np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
   ....: )
   ....: 

In [33]: df
Out[33]: 
   Column A   Column B 
0   0.469112  -0.282863
1  -1.509059  -1.135632
2   1.212112  -0.173215

由於 df.columns 是 Index 物件，因此我們可以使用 .str 存取器

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')

然後可以使用這些字串方法來視需要清理欄位。在此我們移除前導和尾隨空白、將所有名稱轉換為小寫，並將任何剩餘空白取代為底線

In [36]: df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [37]: df
Out[37]: 
   column_a  column_b
0  0.469112 -0.282863
1 -1.509059 -1.135632
2  1.212112 -0.173215

注意

如果您有 Series 其中許多元素重複 (亦即 Series 中唯一元素的數量遠小於 Series 的長度)，將原始 Series 轉換為 category 類型的 Series，然後在該 Series 上使用 .str.<method> 或 .dt.<property> 會比較快。效能差異來自於，對於 category 類型的 Series，字串運算會在 .categories 上執行，而不是在 Series 的每個元素上執行。

請注意，類型為 category 的 Series 字串 .categories 與字串類型的 Series 相較之下有一些限制（例如，您無法將字串彼此相加：s + " " + s 如果 s 是類型為 category 的 Series，則此方法將無法運作）。此外，在這種 Series 上，無法使用針對類型為 list 的元素運作的 .str 方法。

警告

Series 的類型會被推論出來，而允許的類型為（例如字串）。

一般來說，.str 存取器僅用於字串。除了極少數的例外情況，其他用途均不支援，且可能會在稍後停用。

分割和取代字串#

像 split 的方法會傳回一個清單的 Series

In [38]: s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")

In [39]: s2.str.split("_")
Out[39]: 
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

可以使用 get 或 [] 表示法存取分割清單中的元素

In [40]: s2.str.split("_").str.get(1)
Out[40]: 
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split("_").str[1]
Out[41]: 
0       b
1       d
2    <NA>
3       g
dtype: object

使用 expand 即可輕鬆擴充為傳回 DataFrame。

In [42]: s2.str.split("_", expand=True)
Out[42]: 
   1     2
   a     b     c
   c     d     e
<NA>  <NA>  <NA>
   f     g     h

當原始 Series 具有 StringDtype 時，輸出欄位也會全部是 StringDtype。

也可以限制拆分的數量

In [43]: s2.str.split("_", expand=True, n=1)
Out[43]: 
   1
   a   b_c
   c   d_e
<NA>  <NA>
   f   g_h

rsplit 類似於 split，但它以相反的方向運作，也就是從字串的結尾到開頭

In [44]: s2.str.rsplit("_", expand=True, n=1)
Out[44]: 
   1
 a_b     c
 c_d     e
<NA>  <NA>
 f_g     h

replace 選擇性地使用正規表示法

In [45]: s3 = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
   ....:     dtype="string",
   ....: )
   ....: 

In [46]: s3
Out[46]: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

In [47]: s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)
Out[47]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

在版本 2.0 中變更。

具有 regex=True 的單一字元模式也會被視為正規表示法

In [48]: s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="string")

In [49]: s4
Out[49]: 
0     a.b
1       .
2       b
3    <NA>
4        
dtype: string

In [50]: s4.str.replace(".", "a", regex=True)
Out[50]: 
0     aaa
1       a
2       a
3    <NA>
4        
dtype: string

如果您想要字串的逐字替換（等同於 str.replace()），您可以將選用的 regex 參數設定為 False，而不是跳脫每個字元。在此情況下，pat 和 repl 都必須是字串

In [51]: dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")

# These lines are equivalent
In [52]: dollars.str.replace(r"-\$", "-", regex=True)
Out[52]: 
0         12
1        -10
2    $10,000
dtype: string

In [53]: dollars.str.replace("-$", "-", regex=False)
Out[53]: 
0         12
1        -10
2    $10,000
dtype: string

replace 方法也可以將可呼叫物件當作替換。它會使用 re.sub() 在每個 pat 上呼叫。可呼叫物件應該預期一個位置引數（正規表示法物件）並傳回字串。

# Reverse every lowercase alphabetic word
In [54]: pat = r"[a-z]+"

In [55]: def repl(m):
   ....:     return m.group(0)[::-1]
   ....: 

In [56]: pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....: 
Out[56]: 
0    oof 123
1    rab zab
2       <NA>
dtype: string

# Using regex groups
In [57]: pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"

In [58]: def repl(m):
   ....:     return m.group("two").swapcase()
   ....: 

In [59]: pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....: 
Out[59]: 
0     bAR
1    <NA>
dtype: string

方法 replace 也接受編譯後的正規表示式物件，來自 re.compile() 作為樣式。所有旗標都應該包含在編譯後的正規表示式物件中。

In [60]: import re

In [61]: regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)

In [62]: s3.str.replace(regex_pat, "XX-XX ", regex=True)
Out[62]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

在使用編譯後的正規表示式物件呼叫 replace 時，包含 flags 參數會引發 ValueError。

In [63]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex

removeprefix 和 removesuffix 的作用與 Python 3.9 中新增的 str.removeprefix 和 str.removesuffix 相同 <https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__

版本 1.4.0 中的新增功能。

In [64]: s = pd.Series(["str_foo", "str_bar", "no_prefix"])

In [65]: s.str.removeprefix("str_")
Out[65]: 
0          foo
1          bar
2    no_prefix
dtype: object

In [66]: s = pd.Series(["foo_str", "bar_str", "no_suffix"])

In [67]: s.str.removesuffix("_str")
Out[67]: 
0          foo
1          bar
2    no_suffix
dtype: object

串接#

有許多方法可以串接 Series 或 Index，不論是與它自己或其他物件，所有方法都基於 cat()，分別為 Index.str.cat。

將單一 Series 串接為字串#

可以串接 Series（或 Index）的內容

In [68]: s = pd.Series(["a", "b", "c", "d"], dtype="string")

In [69]: s.str.cat(sep=",")
Out[69]: 'a,b,c,d'

如果未指定，分隔符號關鍵字 sep 預設為空字串，sep=''

In [70]: s.str.cat()
Out[70]: 'abcd'

預設情況下，會忽略遺失值。使用 na_rep，可以給予遺失值一個表示

In [71]: t = pd.Series(["a", "b", np.nan, "d"], dtype="string")

In [72]: t.str.cat(sep=",")
Out[72]: 'a,b,d'

In [73]: t.str.cat(sep=",", na_rep="-")
Out[73]: 'a,b,-,d'

將 Series 和某個類清單物件串接成 Series#

傳遞給 cat() 的第一個引數可以是類清單物件，前提是它與呼叫 Series（或 Index）的長度相符。

In [74]: s.str.cat(["A", "B", "C", "D"])
Out[74]: 
0    aA
1    bB
2    cC
3    dD
dtype: string

兩方的遺失值也會導致結果中的遺失值，除非指定 na_rep

In [75]: s.str.cat(t)
Out[75]: 
0      aa
1      bb
2    <NA>
3      dd
dtype: string

In [76]: s.str.cat(t, na_rep="-")
Out[76]: 
0    aa
1    bb
2    c-
3    dd
dtype: string

將 Series 和某個類陣列物件串接成 Series#

參數 others 也可以是二維的。在此情況下，列數必須與呼叫 Series（或 Index）的長度相符。

In [77]: d = pd.concat([t, s], axis=1)

In [78]: s
Out[78]: 
0    a
1    b
2    c
3    d
dtype: string

In [79]: d
Out[79]: 
      0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d

In [80]: s.str.cat(d, na_rep="-")
Out[80]: 
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

將 Series 和索引物件串接成 Series，並對齊#

對於與 Series 或 DataFrame 的串接，可以在串接之前透過設定 join 關鍵字來對齊索引。

In [81]: u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")

In [82]: s
Out[82]: 
0    a
1    b
2    c
3    d
dtype: string

In [83]: u
Out[83]: 
1    b
3    d
0    a
2    c
dtype: string

In [84]: s.str.cat(u)
Out[84]: 
0    aa
1    bb
2    cc
3    dd
dtype: string

In [85]: s.str.cat(u, join="left")
Out[85]: 
0    aa
1    bb
2    cc
3    dd
dtype: string

對於 join，可以使用一般選項（'left', 'outer', 'inner', 'right' 之一）。特別是，對齊也表示不同的長度不再需要一致。

In [86]: v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")

In [87]: s
Out[87]: 
0    a
1    b
2    c
3    d
dtype: string

In [88]: v
Out[88]: 
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [89]: s.str.cat(v, join="left", na_rep="-")
Out[89]: 
0    aa
1    bb
2    c-
3    dd
dtype: string

In [90]: s.str.cat(v, join="outer", na_rep="-")
Out[90]: 
-1    -z
 0    aa
 1    bb
 2    c-
 3    dd
 4    -e
dtype: string

當 others 是 DataFrame 時，可以使用相同的對齊方式

In [91]: f = d.loc[[3, 2, 1, 0], :]

In [92]: s
Out[92]: 
0    a
1    b
2    c
3    d
dtype: string

In [93]: f
Out[93]: 
      0  1
3     d  d
2  <NA>  c
1     b  b
0     a  a

In [94]: s.str.cat(f, join="left", na_rep="-")
Out[94]: 
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

將 Series 和多個物件串接成 Series#

幾個陣列狀項目（特別是：Series、Index 和 np.ndarray 的一維變異）可以組合在清單狀容器中（包括反覆運算器、dict 檢視等）。

In [95]: s
Out[95]: 
0    a
1    b
2    c
3    d
dtype: string

In [96]: u
Out[96]: 
1    b
3    d
0    a
2    c
dtype: string

In [97]: s.str.cat([u, u.to_numpy()], join="left")
Out[97]: 
0    aab
1    bbd
2    cca
3    ddc
dtype: string

傳遞的清單狀中沒有索引的所有元素（例如 np.ndarray）長度必須與呼叫 Series（或 Index）相符，但 Series 和 Index 可以有任意長度（只要對齊未停用，且 join=None）

In [98]: v
Out[98]: 
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [99]: s.str.cat([v, u, u.to_numpy()], join="outer", na_rep="-")
Out[99]: 
-1    -z--
0     aaab
1     bbbd
2     c-ca
3     dddc
4     -e--
dtype: string

如果對包含不同索引的 others 清單狀使用 join='right'，這些索引的聯集將用作最終串接的基礎

In [100]: u.loc[[3]]
Out[100]: 
3    d
dtype: string

In [101]: v.loc[[-1, 0]]
Out[101]: 
-1    z
 0    a
dtype: string

In [102]: s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-")
Out[102]: 
 3    dd-
-1    --z
 0    a-a
dtype: string

使用 `.str` 索引#

您可以使用 [] 符號直接根據位置索引。如果您索引超過字串的結尾，結果將會是 NaN。

In [103]: s = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....: 

In [104]: s.str[0]
Out[104]: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [105]: s.str[1]
Out[105]: 
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

擷取子字串#

擷取每個主旨中的第一個比對（擷取）#

extract 方法接受一個具有至少一個擷取群組的正規表示法。

擷取具有多個群組的正規表示法會傳回一個資料框，每個群組一欄。

In [106]: pd.Series(
   .....:     ["a1", "b2", "c3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])(\d)", expand=False)
   .....: 
Out[106]: 
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

不符合的元素會傳回填滿 NaN 的列。因此，可以將一連串雜亂的字串「轉換」成一個索引相同的資料框或資料框，其中包含已清理或更有用的字串，而無需使用 get() 來存取元組或 re.match 物件。結果的資料型態永遠是物件，即使找不到比對，結果也只包含 NaN。

也可以使用命名群組，例如

In [107]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
   .....:     r"(?P<letter>[ab])(?P<digit>\d)", expand=False
   .....: )
   .....: 
Out[107]: 
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>

以及選用群組，例如

In [108]: pd.Series(
   .....:     ["a1", "b2", "3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])?(\d)", expand=False)
   .....: 
Out[108]: 
      0  1
0     a  1
1     b  2
2  <NA>  3

請注意，正規表示法中的任何擷取群組名稱都將用於欄位名稱；否則將使用擷取群組編號。

如果 expand=True，則擷取具有單一群組的正規表示法會傳回具有單一欄位的 DataFrame。

In [109]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)
Out[109]: 
      0
0     1
1     2
2  <NA>

如果 expand=False，則會傳回 Series。

In [110]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)
Out[110]: 
0       1
1       2
2    <NA>
dtype: string

在具有完全一個擷取群組的正規表示法上呼叫 Index 會傳回一個 DataFrame，其中包含一欄，如果 expand=True。

In [111]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")

In [112]: s
Out[112]: 
A11    a1
B22    b2
C33    c3
dtype: string

In [113]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[113]: 
  letter
0      A
1      B
2      C

如果 expand=False，則會傳回 Index。

In [114]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[114]: Index(['A', 'B', 'C'], dtype='object', name='letter')

使用包含多個擷取群組的正規表示式呼叫 Index 會傳回 DataFrame，如果 expand=True。

In [115]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[115]: 
  letter   1
0      A  11
1      B  22
2      C  33

如果 expand=False，則會引發 ValueError。

In [116]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[116], line 1
----> 1 s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    132     msg = (
    133         f"Cannot use .str.{func_name} with values of "
    134         f"inferred dtype '{self._inferred_dtype}'."
    135     )
    136     raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2743, in StringMethods.extract(self, pat, flags, expand)
   2740     raise ValueError("pattern contains no capture groups")
   2742 if not expand and regex.groups > 1 and isinstance(self._data, ABCIndex):
-> 2743     raise ValueError("only one regex group is supported with Index")
   2745 obj = self._data
   2746 result_dtype = _result_dtype(obj)

ValueError: only one regex group is supported with Index

下表摘要 extract(expand=False) 的行為（第一欄為輸入主旨，第一行為正規表示式中的群組數目）

	1 個群組	>1 個群組
Index	Index	ValueError
Series	Series	DataFrame

擷取每個主旨中的所有比對（extractall）#

與 extract（僅傳回第一個比對）不同，

In [117]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")

In [118]: s
Out[118]: 
A    a1a2
B      b1
C      c1
dtype: string

In [119]: two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"

In [120]: s.str.extract(two_groups, expand=True)
Out[120]: 
  letter digit
A      a     1
B      b     1
C      c     1

extractall 方法會傳回每個比對。 extractall 的結果永遠是 DataFrame，其列上有 MultiIndex。 MultiIndex 的最後一層命名為 match，並指出主旨中的順序。

In [121]: s.str.extractall(two_groups)
Out[121]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

當 Series 中的每個主旨字串都正好有一個比對時，

In [122]: s = pd.Series(["a3", "b3", "c2"], dtype="string")

In [123]: s
Out[123]: 
0    a3
1    b3
2    c2
dtype: string

則 extractall(pat).xs(0, level='match') 會提供與 extract(pat) 相同的結果。

In [124]: extract_result = s.str.extract(two_groups, expand=True)

In [125]: extract_result
Out[125]: 
  letter digit
0      a     3
1      b     3
2      c     2

In [126]: extractall_result = s.str.extractall(two_groups)

In [127]: extractall_result
Out[127]: 
        letter digit
  match             
0 0          a     3
1 0          b     3
2 0          c     2

In [128]: extractall_result.xs(0, level="match")
Out[128]: 
  letter digit
0      a     3
1      b     3
2      c     2

索引 也支援 .str.extractall。它會回傳一個 DataFrame，其結果與具有預設索引 (從 0 開始) 的 Series.str.extractall 相同。

In [129]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[129]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

In [130]: pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)
Out[130]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

測試符合或包含某個模式的字串#

你可以檢查元素是否包含某個模式

In [131]: pattern = r"[0-9][a-z]"

In [132]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.contains(pattern)
   .....: 
Out[132]: 
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

或元素是否符合某個模式

In [133]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.match(pattern)
   .....: 
Out[133]: 
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [134]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.fullmatch(pattern)
   .....: 
Out[134]: 
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

注意

match、fullmatch 和 contains 之間的區別在於嚴謹性：fullmatch 測試整個字串是否符合正規表示式；match 測試是否有從字串第一個字元開始符合正規表示式的部分；contains 測試字串中任何位置是否有符合正規表示式的部分。

re 套件中對應於這三種比對模式的函式分別為 re.fullmatch、re.match 和 re.search。

像 match、fullmatch、contains、startswith 和 endswith 這樣的函式會多一個 na 參數，因此可以將遺失值視為 True 或 False

In [135]: s4 = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....: 

In [136]: s4.str.contains("A", na=False)
Out[136]: 
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

建立指標變數#

您可以從字串欄位中擷取虛擬變數。例如，如果它們以 '|' 分隔

In [137]: s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")

In [138]: s.str.get_dummies(sep="|")
Out[138]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

字串 Index 也支援 get_dummies，它會傳回 MultiIndex。

In [139]: idx = pd.Index(["a", "a|b", np.nan, "a|c"])

In [140]: idx.str.get_dummies(sep="|")
Out[140]: 
MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

另請參閱 get_dummies()。

方法摘要#

方法	說明
`cat()`	串接字串
`split()`	以分隔符號分割字串
`rsplit()`	從字串的結尾開始，以分隔符號分割字串
`get()`	索引每個元素（擷取第 i 個元素）
`join()`	使用傳遞的分隔符號，串接 Series 中每個元素的字串
`get_dummies()`	以分隔符號分割字串，傳回虛擬變數的 DataFrame
`contains()`	如果每個字串都包含模式/正規表示式，則傳回布林陣列
`replace()`	使用其他字串或給定發生時的可呼叫函式的傳回值，取代模式/正規表示式/字串的出現
`removeprefix()`	從字串中移除字首，也就是說，只有當字串以字首開頭時才會移除。
`removesuffix()`	從字串中移除字尾，也就是說，只有當字串以字尾結尾時才會移除。
`repeat()`	複製值（`s.str.repeat(3)` 等同於 `x * 3`）
`pad()`	在字串的左側、右側或兩側加入空白
`center()`	等同於 `str.center`
`ljust()`	等於 `str.ljust`
`rjust()`	等於 `str.rjust`
`zfill()`	等於 `str.zfill`
`wrap()`	將長字串拆成長度小於給定寬度的多行
`slice()`	切片 Series 中的每個字串
`slice_replace()`	用傳遞的值取代每個字串中的切片
`count()`	計算樣式的出現次數
`startswith()`	等於每個元素的 `str.startswith(pat)`
`endswith()`	等於每個元素的 `str.endswith(pat)`
`findall()`	計算每個字串中樣式/正規表示式的所有出現次數的清單
`match()`	對每個元素呼叫 `re.match`，將匹配的群組傳回為清單
`extract()`	對每個元素呼叫 `re.search`，傳回一個 DataFrame，每一列對應一個元素，每一行對應一個正規表示式擷取群組
`extractall()`	對每個元素呼叫 `re.findall`，傳回一個 DataFrame，每一列對應一個匹配，每一行對應一個正規表示式擷取群組
`len()`	計算字串長度
`strip()`	等於 `str.strip`
`rstrip()`	等於 `str.rstrip`
`lstrip()`	等同於 `str.lstrip`
`partition()`	等同於 `str.partition`
`rpartition()`	等同於 `str.rpartition`
`lower()`	等同於 `str.lower`
`casefold()`	等同於 `str.casefold`
`upper()`	等同於 `str.upper`
`find()`	等同於 `str.find`
`rfind()`	等同於 `str.rfind`
`index()`	等同於 `str.index`
`rindex()`	等同於 `str.rindex`
`capitalize()`	等同於 `str.capitalize`
`swapcase()`	等同於 `str.swapcase`
`normalize()`	傳回 Unicode 正規形式。等同於 `unicodedata.normalize`
`translate()`	等同於 `str.translate`
`isalnum()`	等同於 `str.isalnum`
`isalpha()`	等同於 `str.isalpha`
`isdigit()`	等同於 `str.isdigit`
`isspace()`	等同於 `str.isspace`
`islower()`	等同於 `str.islower`
`isupper()`	等同於 `str.isupper`
`istitle()`	相當於 `str.istitle`
`isnumeric()`	相當於 `str.isnumeric`
`isdecimal()`	相當於 `str.isdecimal`

使用文字資料#

文字資料類型#

行為差異#

字串方法#

分割和取代字串#

串接#

將單一 Series 串接為字串#

將 Series 和某個類清單物件串接成 Series#

將 Series 和某個類陣列物件串接成 Series#

將 Series 和索引物件串接成 Series，並對齊#

將 Series 和多個物件串接成 Series#

使用 .str 索引#

擷取子字串#

擷取每個主旨中的第一個比對（擷取）#

擷取每個主旨中的所有比對（extractall）#

測試符合或包含某個模式的字串#

建立指標變數#

方法摘要#

使用 `.str` 索引#