BeautifulSoup抓取可見網頁文本

基本上，我想使用BeautifulSoup在網頁上嚴格抓取可見文本。例如，this webpage是我的測試用例。我主要想獲得正文（文章），甚至可以在這裏和那裏獲得一些標籤名稱。我已經嘗試了這個SO question中的建議，它返回很多<script>標籤和html註釋，我不想要。我無法弄清功能findAll()所需的參數，以便在網頁上顯示可見文本。BeautifulSoup抓取可見網頁文本

那麼，我應該如何找到所有可見的文本，不包括腳本，評論，CSS等？

來源

2009-12-20 user233864

142

試試這個：

from bs4 import BeautifulSoup 
from bs4.element import Comment 
import urllib.request 


def tag_visible(element): 
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: 
     return False 
    if isinstance(element, Comment): 
     return False 
    return True 


def text_from_html(body): 
    soup = BeautifulSoup(body, 'html.parser') 
    texts = soup.findAll(text=True) 
    visible_texts = filter(tag_visible, texts) 
    return u" ".join(t.strip() for t in visible_texts) 

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read() 
print(text_from_html(html))

來源

2009-12-31 00:06:12 jbochi

@jbochi我已經用re.match（'。* 。*'，string，re.DOTALL）替換了第3行的visible（）。你的似乎只工作，如果*整個*文本內容是一個評論，但如果有一個初始空間或新行，那麼'不可見'的HTML將被返回。我的解決方案過於激進，因爲它會將整個元素標記爲隱形，但對於我的目的來說，這很好。 – Trindaz

+25

+1 for'soup.findAll（text = True）'永遠不知道該功能 –

對於最近的BS4（至少），你可以用'isinstance（element，Comment）'識別註釋，而不是匹配正則表達式。 – tripleee

標題位於<nyt_headline>標記內，該標記嵌套在<h1>標記內，<div>標記的ID爲「article」。

soup.findAll('nyt_headline', limit=1)

應該工作。

文章正文位於<nyt_text>標籤內，該標籤嵌套在ID爲「articleBody」的<div>標籤內。在<nyt_text>元素內部，文本本身包含在<p>標籤內。圖像不在這些<p>標籤內。我對語法進行實驗很困難，但我期望看到這樣的工作。

text = soup.findAll('nyt_text', limit=1)[0] 
text.findAll('p')

來源

2009-12-20 18:40:54

我敢肯定，這適用於這個測試用例然而，尋找可能被應用到各種其他網站更通用的答案......到目前爲止，我一直在使用正則表達式查找標籤和意見和嘗試替換爲「」，但這甚至證明有點難以總結理由。 – user233864

從@jbochi經批准的回答沒有爲我工作。 str（）函數調用引發異常，因爲它無法編碼BeautifulSoup元素中的非ascii字符。以下是將示例網頁過濾爲可見文本的更簡潔的方法。

html = open('21storm.html').read() 
soup = BeautifulSoup(html) 
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])] 
visible_text = soup.getText()

來源

2013-11-04 00:35:55 nmgeek

如果'str（element）'由於編碼問題而失敗，則應該嘗試'unicode（element）'，而不是使用Python 2. – mknaf

我完全尊重使用美麗的湯來獲得呈現的內容，但它可能不是獲取一個頁面上呈現的內容的理想包裝。

我有一個類似的問題，以獲得呈現內容，或在典型的瀏覽器中的可見內容。特別是我有許多非典型案例可以用下面這樣一個簡單的例子來工作。在這種情況下，不可顯示的標籤嵌套在一個樣式標籤中，並且在我檢查過的許多瀏覽器中都不可見。存在其他變體，例如將類別標籤設置顯示定義爲無。然後使用這個類的div。

<html> 
    <title> Title here</title> 

    <body> 

    lots of text here <p> <br> 
    <h1> even headings </h1> 

    <style type="text/css"> 
     <div > this will not be visible </div> 
    </style> 


    </body> 

</html>

一個解決方案上面貼的是：

html = Utilities.ReadFile('simple.html') 
soup = BeautifulSoup.BeautifulSoup(html) 
texts = soup.findAll(text=True) 
visible_texts = filter(visible, texts) 
print(visible_texts) 


[u'\n', u'\n', u'\n\n  lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

這個方案肯定有很多情況下的應用和做這項工作非常好一般，但在它上面貼的HTML保留未呈現的文本。 SO搜索後，一對夫婦的解決方案來這兒BeautifulSoup get_text does not strip all tags and JavaScript這裏Rendered HTML to plain text using Python

我嘗試這兩種解決方案：html2text和nltk.clean_html，並通過定時結果感到驚訝這麼認爲，他們有理由爲後人的答案。當然，速度高度依賴於數據的內容...

@Helge的一個答案是關於使用nltk的所有東西。

import nltk 

%timeit nltk.clean_html(html) 
was returning 153 us per loop

它很好地返回字符串與呈現的HTML。這個nltk模塊甚至比html2text更快，但也許html2text更強大。

betterHTML = html.decode(errors='ignore') 
%timeit html2text.html2text(betterHTML) 
%3.09 ms per loop

來源

2013-11-05 19:37:08 Paul

import urllib 
from bs4 import BeautifulSoup 

url = "https://www.yahoo.com" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 

# kill all script and style elements 
for script in soup(["script", "style"]): 
    script.extract() # rip it out 

# get text 
text = soup.get_text() 

# break into lines and remove leading and trailing space on each 
lines = (line.strip() for line in text.splitlines()) 
# break multi-headlines into a line each 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
# drop blank lines 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print(text.encode('utf-8'))

來源

2014-07-26 06:54:26 bumpkin

以前的答案對我來說不起作用，但是這樣做:) – rjurney

如果我試試這個在網址imfuna.com它只返回6個字（Imfuna財產庫存和檢查應用程序）儘管事實上有更多的文字/文字在網頁上......任何想法爲什麼這個答案不適用於該網址？ @bumpkin –

一段時間，我會徹底建議一般用美麗的湯，如果有人正在顯示不良HTML的可見部分（例如當你有一個基於web的只是一個段或線頁）無論什麼-原因，下面就刪除內容<和>標籤之間：

import re ## only use with malformed html - this is not efficient 
def display_visible_html_using_re(text):    
    return(re.sub("(\<.*?\>)", "",text))

來源

2015-05-03 20:39:31 kyrenia

使用BeautifulSoup最簡單的方法用更少的代碼只得到弦，不空行和廢話。

tag = <Parent_Tag_that_contains_the_data> 
soup = BeautifulSoup(tag, 'html.parser') 

for i in soup.stripped_strings: 
    print repr(i)

來源

2017-05-01 03:44:42

如果你關心性能，這裏是另一種更有效的方式：

import re 

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title') 
RE_SPACES = re.compile(r'\s{3,}') 

def visible_texts(soup): 
    """ get visible text from a document """ 
    text = ' '.join([ 
     s for s in soup.strings 
     if s.parent.name not in INVISIBLE_ELEMS 
    ]) 
    # collapse multiple spaces to two spaces. 
    return RE_SPACES.sub(' ', text)

soup.strings是一個迭代器，並返回NavigableString這樣就可以直接檢查父的標籤名，而無需通過多次去循環。

來源

2017-06-18 03:26:18

BeautifulSoup抓取可見網頁文本

回答

相關問題