2017-03-16 90 views
0

我想從正則表達式產生的結果中創建一個熊貓數據框中的新列。pandas函數中的正則表達式

我期待的結果是:

In[1]: df 
Out[1]: 

    valueProduct valueService  totValue 
0  $465580.99  $322532.34 $788113.33 

我的數據框dtypes是:

df.dtypes 

Contracting Office Name    object 
Contracting Office Region    object 
PIID         object 
PIID Agency ID      object 
Major Program       object 
Description of Requirement   object 
Referenced IDV PIID     object 
Completion Date    datetime64[ns] 
Prepared By       object 
Funding Office Name     object 
Funding Agency ID      object 
Funding Agency Name     object 
Funding Office ID      object 
Effective Date    datetime64[ns] 
Fiscal Year       int64 
Ultimate Contract Value    float64 
Count         int64 

1行中題爲「要求的說明」一欄有如下的長字符串值(在這一列中的相似字符串值通過數據集):

管理員添加額外的體積和道路工作變化銀滑道監護項目 - ALLEGHENY國家產品的森林VALUE =服務$ 465580.99 VALUE =合同的$ 322532.34總額= $ 788113.33

我想成功地寫一個正則表達式從這個字符串中提取3項,但僅產生新列的美元價值:

VALUE OF PRODUCT = $465580.99 
VALUE OF SERVICE = $322532.34 
TOTAL VALUE OF CONTRACT = $788113.33 

下面的代碼做這個假設在數據幀的字符串進行一個簡單的字符串值數據框之外:

text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33" 


pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE) 
getPattern = re.search(pattern, text) 
print (getPattern.group()) 

將產生:

VALUE OF PRODUCT = $465580.99 

我可以爲其他兩個項目重複此操作。

現在,感覺我在一個數據幀的工作我試圖做類似如下:

def valProduct(row): 
    pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE) 
    findPattern = re.search(pattern, row['Description of Requirement']) 
    return findPatter 

df['valueProduct'] = df.apply(lambda row: valProduct(row), axis=1) 

In[2]: sf[['valueProduct']][:1] 
Out[2]: None 

這將產生一個新的列,但其空,但應該至少是表明:

VALUE OF PRODUCT = $465580.99 

任何幫助,非常感謝!

回答

1
import re  

text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33" 

re.findall(r'value.+?\d\b',text, re.I) 

輸出

['VALUE OF PRODUCT = $465580', 'VALUE OF SERVICE = $322532', 'VALUE OF CONTRACT = $788113']