BeautifulSoup不從H1返回正確

我的代碼BeautifulSoup不從H1返回正確

from BeautifulSoup import BeautifulSoup 

htmls = ''' 
<div class="main-content"> 
<h1 class="student"> 
    <p>Name: <br /> 
    Alex</p> 
    <p>&nbsp;</p> 
</h1> 
</div> 
<div class="department"> 
... more text 
</div> 
''' 
soup = BeautifulSoup(htmls) 
h1 = soup.find("h1", {"class": "student"}) 
print h1

預期結果

<h1 class="student"> 
    <p>Name: <br /> 
    Alex</p> 
    <p>&nbsp;</p> 
</h1>

但不幸的是返回

<h1 class="student"> 
</h1>

我的問題是，爲什麼它吃之間的一切p標籤？它是否執行renderContents（）？或者它的解析失敗？

來源

2013-11-25 Sabuj Hassan

這是因爲您在h1標記中使用p標記。例如，如果你這樣做：

>>> htmls 
'\n<div class="main-content">\n<h1 class="student">\n <p>Name: <br />\n Alex</p>\n <p>&nbsp;</p>\n</h1>\n</div>\n<div class="department">\n... more text\n</div>\n' 
>>> soup = BeautifulSoup(htmls) 
>>> soup 

<div class="main-content"> 
<h1 class="student"> 
</h1><p>Name: <br /> 
    Alex</p> 
<p>&nbsp;</p> 

</div> 
<div class="department"> 
... more text 
</div>

你可以看到，美麗的湯已解析它有點不同。 p是後h1關閉。

然而，

>>> htmls = ''' 
... <div class="main-content"> 
... <h1 class="student"> 
...  <span>Name: <br /> 
...  Alex</span> 
...  <span>&nbsp;</span> 
... </h1> 
... </div> 
... <div class="department"> 
... ... more text 
... </div> 
... ''' 
>>> 
>>> htmls.contents 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
AttributeError: 'str' object has no attribute 'contents' 
>>> soup = BeautifulSoup(htmls) 
>>> h1 = soup.find("h1", {"class": "student"}) 
>>> 
>>> h1 
<h1 class="student"> 
<span>Name: <br /> 
    Alex</span> 
<span>&nbsp;</span> 
</h1>

你可以看到孩子們。

這是HTML p標籤行爲的方式。因此，這個問題。（瞭解更多關於block level elements這裏）

來源

2013-11-25 18:08:05 karthikr

嘗試通過不同的解析器到您BeautifulSoup：

pip install html5lib 

>>> htmls = ''' 
... <div class="main-content"> 
... <h1 class="student"> 
...  <span>Name: <br /> 
...  Alex</span> 
...  <span>&nbsp;</span> 
... </h1> 
... </div> 
... <div class="department"> 
... ... more text 
... </div> 
... ''' 

>>> soup = BeautifulSoup(htmls, 'html5lib') 
>>> h1 = soup.find('h1', 'student') 
>>> print h1 
<h1 class="student"> 
    <p>Name: <br/> 
    Alex</p> 
    <p> </p> 
</h1>

你想要做什麼，我想。否則，您不應該在合規性中使用塊元素。

請參閱：http://www.crummy.com/software/BeautifulSoup/bs4/doc/這用於插入解析器

來源

2013-11-25 18:22:16 hyleaus

BeautifulSoup不從H1返回正確

回答

相關問題