2014-05-20 28 views
4

我試圖抓取一個網站並找到所有飼料的標題。我無法得到我需要的a標籤的文本。這裏是一個html的例子。查找與bs4的特定鏈接文本

<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a> 
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a> 

我試圖讓與的c一個ID爲每a標籤文字和打印每一個新行。

我的輸出應該是這樣的。

TF4 - Oreos 
Awesome Game Boy Facts 

到目前爲止,我已經嘗試過。

soup = bs4.BeautifulSoup(html) 
links = soup.find_all('a',{'id' : 'c'}) 
for link in links: 
    print link.text 

但它沒有找到或打印任何東西?

+0

我會接受,如果我能所有這些工作的答案和響應。 – user3077033

回答

3

您可以pass a regular expression到位的屬性值:

links = soup.find_all('a', {'id': re.compile('^c\d+')}) 

^表示字符串的開始,\d+匹配一個或多個數字。

演示:

>>> import re 
>>> from bs4 import BeautifulSoup 
>>> 
>>> html = """ 
... <tr> 
...  <td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a></td> 
...  <td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a></td> 
... </tr> 
... """ 
>>> soup = BeautifulSoup(html) 
>>> links = soup.find_all('a', {'id': re.compile('^c\d+')}) 
>>> for link in links: 
...  print link.text 
... 
TF4 - Oreos 
Awesome Game Boy Facts 
2

有沒有a標籤的屬性c,但c1c2

links = soup.find_all('a',{'id' : 'c1'}) 

如果你想找到所有a與屬性與c開始,你需要通過正則表達式:

import re 

links = soup.findAll('a', {'id': re.compile('^c')}) 
2

可以在呼叫內傳遞regular expression對象find_all()

import re 
import bs4 

html = ''' 
<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a> 
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a> 
''' 

soup = bs4.BeautifulSoup(html) 
for links in soup.find_all('a', {'id' : re.compile('^c') }): 
    print ''.join(links.find_all(text=True)) 

輸出

TF4 - Oreos 
Awesome Game Boy Facts