2017-02-17 80 views
0

我似乎無法像我想要的那樣使用正則表達式。正則表達式與美麗的湯,提取':'後的所有字母

當我運行這段代碼,我得到下面

for paragraph in soup.find_all('p'): 
     print(paragraph.find_all(text =re.compile(":*\w*"))) 

我得到的文本文字

Continuing our series of surfacing 2016 stinkers, here are the 25 Russell 2000 stocks that imploded in 2016. Further down, you'll find the 25 worst stocks excluding pharma. Ophthotech (NASDAQ:OPHT) -94% Galena Biopharma (NASDAQ:GALE) -93% Cempra (NASDAQ:CEMP) -91% Toaki Pharma (NASDAQ:TKAI) -89% Anthera Pharma (NASDAQ:ANTH) -86% Adeptus Health (NYSE:ADPT) -86% CytRx (NASDAQ:CYTR) -86% Novavax (NASDAQ:NVAX) -85%

這只是要提取股票代碼所以理想的輸出是:

OPHT 
GALE 
CEMP 
TKAI 

等等。

我想這些代碼的變化:

for paragraph in soup.find_all('p'): 
    print(paragraph.find_all(text =re.compile('(:\w+)'))) 
for paragraph in soup.find_all('p'): 
    print(paragraph.find_all(text =re.compile("(:*\w*)"))) 
for paragraph in soup.find_all('p'): 
    print(paragraph.find_all(text =re.compile('(:)?\w+'))) 

但大部分我結束了與

`['Continuing our ', 'series', " of surfacing 2016 stinkers, here are the 25 Russell 2000 stocks that imploded in 2016. Further down, you'll find the 25 worst stocks excluding pharma."] 
['Ophthotech (NASDAQ:', 'OPHT', ') -94%'] 
['Galena Biopharma (NASDAQ:', 'GALE', ') -93%'] 
['Cempra (NASDAQ:', 'CEMP', ') -91%'] 
['Toaki Pharma (NASDAQ:', 'TKAI', ') -89%'] 
['Anthera Pharma (NASDAQ:', 'ANTH', ') -86%'] 
['Adeptus Health (NYSE:', 'ADPT', ') -86%'] 
['CytRx (NASDAQ:', 'CYTR', ') -86%'] 
['Novavax (NASDAQ:', 'NVAX', ') -85%']` 

不知道我在做什麼錯輸出的時間。

謝謝。

+0

是什麼,你正在試圖解析看原文喜歡? – serk

回答

2

前加入R「」你可以試試這個:

import re 

text = """Continuing our series of surfacing 2016 stinkers, here are the 25 Russell 2000 stocks that imploded in 2016. Further down, you'll find the 25 worst stocks excluding pharma. 
Ophthotech (NASDAQ:OPHT) -94% 
Galena Biopharma (NASDAQ:GALE) -93% 
Cempra (NASDAQ:CEMP) -91% 
Toaki Pharma (NASDAQ:TKAI) -89% 
Anthera Pharma (NASDAQ:ANTH) -86% 
Adeptus Health (NYSE:ADPT) -86% 
CytRx (NASDAQ:CYTR) -86% 
Novavax (NASDAQ:NVAX) -85%""" 

#Its better to compile a regex outside a loop 
pattern = re.compile(r':(\w+)\)') 

results = pattern.findall(text) 

for items in results: 
    print(items) 
+0

我仍然必須使用循環來從soup.find_all('p')獲取文本,但是這段代碼似乎工作。 '在soup.find_all中的段落('p'): \t print(pattern.findall(paragraph.text))'。我仍然不明白爲什麼'r'需要放在正則表達式前面。謝謝。 – Moondra

+0

前綴「r」表示「原始字符串」。沒有它,你應該加倍轉義元字符。例如,「r」\ d +「'等同於」\\ d +「' –

1

這可能是很好的方向

re.search(r':(\w+)\)', paragraph.text).group(1) 

嘗試模式

+0

謝謝。 (r':(\ w +)\)',paragraph.text)使用這個建議,我想出了這個cde: ' 我收到以下輸出(這似乎是正確的),但不知道如何打印它 '<_sre.SRE_Match object; span =(18,24),match =':OPHT)'> <_sre.SRE_Match object; span =(24,30),match =':GALE)'>' – Moondra

+0

是的,您是對的,請參閱更新。總之,.group(1)會給你想要的字符串 – josifoski