2015-11-03 86 views
3

我很迷惑我如何使用帶有BeautifulSoup的ResultSet對象,即bs4.element.ResultSetBeautifulSoup,提取HTML標籤內的字符串,ResultSet對象

使用find_all()後,如何提取文本?

實施例:

bs4文檔,HTML文檔html_doc看起來像:

<p class="story"> 
    Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1"> 
    Elsie 
    </a> 
    , 
    <a class="sister" href="http://example.com/lacie" id="link2"> 
    Lacie 
    </a> 
    and 
    <a class="sister" href="http://example.com/tillie" id="link2"> 
    Tillie 
    </a> 
    ; and they lived at the bottom of a well. 
    </p> 

One開始通過創建soup和查找所有href

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 
soup.find_all('a') 

其輸出

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 

我們也可以做

for link in soup.find_all('a'): 
    print(link.get('href')) 

其輸出

http://example.com/elsie 
http://example.com/lacie 
http://example.com/tillie 

我想從class_="sister"得到文本,即

Elsie 
Lacie 
Tillie 

一個可以嘗試

for link in soup.find_all('a'): 
    print(link.get_text()) 

但這會導致一個錯誤:

AttributeError: 'ResultSet' object has no attribute 'get_text' 

回答

4

請在class_='sister'一個find_all()過濾。

注:通知的class後強調。這是一個特例,因爲課是一個保留字。

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, 「class」, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_ :

來源:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你把所有帶班的妹妹標籤,呼籲他們.text來獲取文本。一定要去掉文字。

例如:

from bs4 import BeautifulSoup 

html_doc = '''<p class="story"> 
    Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1"> 
    Elsie 
    </a> 
    , 
    <a class="sister" href="http://example.com/lacie" id="link2"> 
    Lacie 
    </a> 
    and 
    <a class="sister" href="http://example.com/tillie" id="link2"> 
    Tillie 
    </a> 
    ; and they lived at the bottom of a well. 
    </p>''' 

soup = BeautifulSoup(html_doc, 'html.parser') 
sistertags = soup.find_all(class_='sister') 
for tag in sistertags: 
    print tag.text.strip() 

輸出:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py 
Elsie 
Lacie 
Tillie 
+0

完美的作品,謝謝。我很困惑,因爲「sistertags.text」正在拋出一個錯誤 – ShanZhengYang

相關問題