2017-04-15 55 views
0

我有一個熊貓數據幀列在每個細胞幾個環節:給出一個包含多個垃圾鏈接的列表,如何以這種方式提取所有以.pdf結束的鏈接?

Name|COL 
San Diego|'https://foo.com/energy_docs/tyv/2004/019787_S30_gasTOC.cfm https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/293-_9302SDFS 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/98/019787-S16_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S15_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S14_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf https://foo.com/energy_docs/tyv/96/019787-S12_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S11_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S10_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S9_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S8_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/19-787s007_Amlodipine.cfm https://foo.com/energy_docs/tyv/pre96/019787-S6_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S5_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S4_gas GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S3_gas_toc.cfm https://foo.com/energy_docs/tyv/pre96/019787-S2_gas GAS_TPC.cfm' 
Washington|'https://foo.com/energy_docs/a32/2007/022136.cfm' 
Texas|'https://foo.com/energy/29380/no_ant/USA/2/2007.pdf' 

我怎麼能提取所有在.pdf結束以下方式聯繫:

Name|COL 
San Diego|https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
San Diego|https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
San Diego|https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf 
Washington|NaN 
Texas|https://foo.com/energy/29380/no_ant/USA/2/2007.pdf 

我想:

import re 

def url_extractor(row): 

    url=str(row) 

    r = re.compile('(http[^\s]+\.pdf)') 

    urls = r.findall(url) 

    if len(urls) == 0: 

     return 'NaN' 

    else: 

     return ' '.join(urls) 

​ 

在:

df4['COL'] = df4['COL'].apply(url_extractor) 
df4 

日期:

Name COL 
0 San Diego https://foo.com/energy_docs/tyv/99/19787s022_g... 
1 Washington NaN 
2 Texas https://foo.com/energy/29380/no_ant/USA/2/2007... 

但是我不知道如何才能得到每行一個鏈接/ URL做堆疊/拆分排部。例如,讓我們檢查的第一行:

在:

df4['COL'][0] 

日期:

'https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf'

每個鏈接都應該「映射」到其名稱San Diego

+1

這是您的實際數據嗎?如果是這樣,爲什麼'(?<= href =「)。*?(?=」)'是你嘗試的正則表達式。它離工作數英里之遙。 – Vallentin

+0

Ups對不起...我正在嘗試幾件事...我更新了... @Vallentin –

回答

1

如果已經裝入大熊貓數據幀時,可以使用內置的字符串的方法來打破COL字符串到列表中的大熊貓,從列表中提取所需的元素,將列表的列改爲長列,然後將其與原始數據框合併

# break COL into lists of strings that only end if '.pdf' 
COL_series = df.COL.str.split().apply(lambda x: [y for y in x if y.endswith('pdf')]) 
# create a long format series from the lists 
COL_series = COL_series.apply(pd.Series).stack().reset_index(level=1, drop=True) 
COL_series.name = 'COL' 

# merge with df 
pd.merge(df.Name.reset_index(), 
     COL_series.reset_index(), 
     how='outer', 
     on='index').drop('index', axis=1) 

# returns: 
     Name               COL 
0 San Diego  https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf 
1 San Diego  https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf 
2 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
3 San Diego  https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf 
4 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
5 San Diego  https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf 
6 Washington               NaN 
7  Texas   https://foo.com/energy/29380/no_ant/USA/2/2007.pdf 
+0

感謝您的幫助,儘管如此,我的數據框有其他列(3多列),當我應用它時,它刪除了其他列...如何做到這一點,而不刪除其他3列?...我只是把兩列在空間/可視化問題的問題..我試圖刪除pd.merge()中的名稱,但它添加了COL_x和COL_y –

2

而不是[^<]你應該做[^\s]或更短\S。然後加入\.pdf

(http\S+\.pdf) 

Live Demo

編輯:

是的,你也可以使用單詞邊界,如果你想。

(\bhttp.*?\.pdf\b) 

Live Demo

+0

謝謝,我認爲我可以用'\ b'做到這一點...任何想法如何分割/堆疊行只在每一行留下一個鏈接?... –

+1

是的,你也可以使用'\ b'(更新和添加示例到答案中)。分裂如何?如果有多個,那麼你將如何決定保留哪一個? – Vallentin

+0

好的,謝謝!...檢查我的更新! –

相關問題