應用正則表達式的df，在新列中添加值

這是我的數據集：應用正則表達式的df，在新列中添加值

BlaBla 128 MB EE 
ADTD 6 gb DTS 
EEEDC 2GB RS 
STA 12MB DFA 
BBNB 32 mb YED

從這個數據集，我想提取MB/GB的數字和單位MB/GB。因此，我創建了下面的正則表達式：

(\d*)\s?(MB|GB)

，我已經創建，這樣的正則表達式將被應用到我的DF的代碼是：

pattern = re.compile(r'(\d*)\s?(MB|GB)') 
invoice_df['mbs'] = invoice_df['Rate Plan'].apply(lambda x: pattern.search(x).group(1)) 
invoice_df['unit'] = invoice_df['Rate Plan'].apply(lambda x: pattern.search(x).group(2))

但是應用正則表達式來我DF時，它給以下錯誤消息：

AttributeError: 'NoneType' object has no attribute 'group'

我該怎麼做才能解決這個問題？

來源

2017-02-15 Joe_ft

如果你使模式不區分大小寫？ '（\ d +）\ S *（MB | GB）（我？）'？我也會在'\ d'和'*'上用'\ d'和'*'用'\ s' –

@WiktorStribiżew仍然是相同的錯誤信息 –

所以，有些條目只是不包含匹配項，而你訪問'group（1） '和'組（2）'而不檢查是否發生了匹配。 –

由於某些條目可能不匹配，因此re.search失敗（不返回任何匹配）。你需要考慮的拉姆達內部的情況：

apply(lambda x: pattern.search(x).group(1) if pattern.search(x) else "")

我也建議使用

(?i)(\d+)\s*([MGK]B)

它會找到1+數字（\d+，第1組），然後用0+空格（\s*）並且將以不區分大小寫的方式將KB,GB,MB匹配到組2（([MGK]B)）。

來源

2017-02-15 10:38:38

你只需要檢查的東西已經發現請求組之前：

import re 

inputs = ["BlaBla 128 MB EE", 
"ADTD 6 gb DTS", 
"EEEDC 2GB RS", 
"STA 12MB DFA", 
"BBNB 32 mb YED", 
"Nothing to find here"] 

pattern = re.compile("(\d+)\s*([MG]B)", re.IGNORECASE) 

for input in inputs: 
    match = re.search(pattern, input) 
    if match: 
     mbs = match.group(1) 
     unit = match.group(2) 
     print (mbs, unit.upper()) 
    else: 
     print "Nothing found for : %r" % input 

# ('128', 'MB') 
# ('6', 'GB') 
# ('2', 'GB') 
# ('12', 'MB') 
# ('32', 'MB') 
# Nothing found for : 'Nothing to find here'

與您的代碼：

pattern = re.compile("(\d+)\s*([MG]B)", re.IGNORECASE) 
match = re.search(pattern, invoice_df['Rate Plan']) 
if match: 
    invoice_df['mbs'] = match.group(1) 
    invoice_df['unit'] = match.group(2)

它比拉姆達恕我直言更具可讀性，並且它不執行兩次搜索。

來源

2017-02-15 10:30:20

應用正則表達式的df，在新列中添加值

回答

相關問題