2013-11-25 58 views
0

我的代碼BeautifulSoup不從H1返回正確

from BeautifulSoup import BeautifulSoup 

htmls = ''' 
<div class="main-content"> 
<h1 class="student"> 
    <p>Name: <br /> 
    Alex</p> 
    <p>&nbsp;</p> 
</h1> 
</div> 
<div class="department"> 
... more text 
</div> 
''' 
soup = BeautifulSoup(htmls) 
h1 = soup.find("h1", {"class": "student"}) 
print h1 

預期結果

<h1 class="student"> 
    <p>Name: <br /> 
    Alex</p> 
    <p>&nbsp;</p> 
</h1> 

但不幸的是返回

<h1 class="student"> 
</h1> 

我的問題是,爲什麼它吃之間的一切p標籤?它是否執行renderContents()?或者它的解析失敗?

回答

1

這是因爲您在h1標記中使用p標記。例如,如果你這樣做:

>>> htmls 
'\n<div class="main-content">\n<h1 class="student">\n <p>Name: <br />\n Alex</p>\n <p>&nbsp;</p>\n</h1>\n</div>\n<div class="department">\n... more text\n</div>\n' 
>>> soup = BeautifulSoup(htmls) 
>>> soup 

<div class="main-content"> 
<h1 class="student"> 
</h1><p>Name: <br /> 
    Alex</p> 
<p>&nbsp;</p> 

</div> 
<div class="department"> 
... more text 
</div> 

你可以看到,美麗的湯已解析它有點不同。 ph1關閉。

然而,

>>> htmls = ''' 
... <div class="main-content"> 
... <h1 class="student"> 
...  <span>Name: <br /> 
...  Alex</span> 
...  <span>&nbsp;</span> 
... </h1> 
... </div> 
... <div class="department"> 
... ... more text 
... </div> 
... ''' 
>>> 
>>> htmls.contents 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
AttributeError: 'str' object has no attribute 'contents' 
>>> soup = BeautifulSoup(htmls) 
>>> h1 = soup.find("h1", {"class": "student"}) 
>>> 
>>> h1 
<h1 class="student"> 
<span>Name: <br /> 
    Alex</span> 
<span>&nbsp;</span> 
</h1> 

你可以看到孩子們。

這是HTML p標籤行爲的方式。因此,這個問題。 (瞭解更多關於block level elements這裏)

1

嘗試通過不同的解析器到您BeautifulSoup:

pip install html5lib 

>>> htmls = ''' 
... <div class="main-content"> 
... <h1 class="student"> 
...  <span>Name: <br /> 
...  Alex</span> 
...  <span>&nbsp;</span> 
... </h1> 
... </div> 
... <div class="department"> 
... ... more text 
... </div> 
... ''' 

>>> soup = BeautifulSoup(htmls, 'html5lib') 
>>> h1 = soup.find('h1', 'student') 
>>> print h1 
<h1 class="student"> 
    <p>Name: <br/> 
    Alex</p> 
    <p> </p> 
</h1> 

你想要做什麼,我想。否則,您不應該在合規性中使用塊元素。

請參閱:http://www.crummy.com/software/BeautifulSoup/bs4/doc/這用於插入解析器