使用Python的HTML抓取BeautifulSoup

我有以下的HTML文件，我試圖用BeautifulSoup刮完整句，但無法得到它。目前我只得到突出顯示的單詞。我希望的輸出應該是使用Python的HTML抓取BeautifulSoup

天線助推器已停止發送信號文件，可能的用戶網絡問題或BOOSTER問題。

任何解決方案？

</table> 
    <!--Record Header End--> 
    <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff"> 
    Antenna 
    </span> 
    <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff"> 
    booster 
    </span> 
    has stopped 
    <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff"> 
    sending 
    </span> 
    signal files ,possible user 
    <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff"> 
    network 
    </span> 
    <span style="BACKGROUND-COLOR: #ff0000"> 
    issue 
    </span> 
    or BOOSTER 
    <span style="BACKGROUND-COLOR: #ff0000"> 
    issue 
    </span> 
    . 
    <br> 
    <br> 
    <br>

這裏是我的嘗試：

issue_field = soup.find_all('span', {'style':'BACKGROUND-COLOR: #0000ff; color: #ffffff'}) 
issue_str = str(issue_field) 
Issue_corpora = [word.lower() for word in BeautifulSoup(issue_str,'html.parser').get_text().strip().sp‌lit(',')] 
print(Issue_corpora)

來源

2017-03-28 Nikhil Mangire

顯示您嘗試過的'bs'代碼。 –

issue_field = soup.find_all（'span'， {'style'：'BACKGROUND-COLOR：＃0000ff; color：#ffffff'}） issue_str = str（issue_field） Issue_corpora = [word.lower（）for word在 BeautifulSoup（issue_str，'html.parser'）。get_text（）。strip（）。split（'，'）] print（Issue_corpora） –

也許你正則表達式（'re'）就足以滿足你的這個需求了例如：'re.sub（''，''，t）.replace（'\ n'，''）' –

的問題是有元素之外的文本。在SO上有一個重複的問題：Get text outside known element beautifulsoup

所以這裏是解決方案，可能需要一點點拋光。（注意變量t包含HTML文本）

from bs4 import BeautifulSoup as bs 
soup = bs(t) 
text = '' 
for span in soup.findAll('span'): 
    text += getattr(span, 'text', '').strip() + ' ' 
    text += getattr(span, 'nextSibling', '').strip() + ' '

結果使用這種方法是：

>>> In : text 
>>> Out: u'Antenna booster has stopped sending signal files ,possible user network issue or BOOSTER issue . '

你可以用一個空格代替了一倍空格或逗號之前刪除空間或定義規則在通過span元素循環時處理它。

來源

2017-03-28 10:40:34

from bs4 import BeautifulSoup 
import re 

example = """your example""" 

soup = BeautifulSoup(example, "html.parser") 

_text = "" 
for span in soup.find_all('span', style=re.compile('BACKGROUND-COLOR:')): 
    _text += "%s %s" % (span.get_text(strip=True), span.next_sibling.replace("\n", "")) 

print (re.sub(" +"," ", _text))

最後使用re修剪多餘的空格。

輸出：

天線放大器已經停止發送信號的文件，可能的用戶網絡問題或BOOSTER問題。

來源

2017-03-28 10:49:42 Zroq

使用Python的HTML抓取BeautifulSoup

回答

相關問題