BeautifulSoup解析器沒有按標籤正確分割

我正在抓取一個網站，然後試圖拆分成段落。通過查看被刮掉的文本，我可以清楚地看到一些段落分隔符沒有被正確拆分。請參閱下面的代碼來重新創建問題！BeautifulSoup解析器沒有按標籤正確分割

from bs4 import BeautifulSoup 
import requests 

link = "http://www.presidency.ucsb.edu/ws/index.php?pid=111395" 
response = requests.get(link) 
soup = BeautifulSoup(response.content, 'html.parser') 
paras = soup.findAll('p') 
# Note that in printing the below, there are still a lot of "<p>" in that paragraph :( 
print paras[614]

我嘗試過使用其他解析器 - 類似的問題。

來源

2016-07-23 Craig

這是設計。這是因爲該頁面包含嵌套的段落，例如：

<p>Neurosurgeon Ben Carson. [<i>applause</i>] <p>New Jersey

我會用這個小黑客來解決這個問題：

html = response.content.replace('<p>', '</p><p>') # so there will be no nested <p> tags in your soup 

# then your code

來源

2016-07-24 01:57:05 Bob

你試過嗎，lxml解析器？我有類似的問題和lxml解決了我的問題。

import lxml 
... 
soup = BeautifulSoup(response.text, "lxml")

而且不是response.content嘗試response.text得到Unicode的對象。

來源

2016-07-24 01:50:28

不行的，不幸的是（或者LXML或使用response.text）。感謝您的建議！ – Craig

BeautifulSoup解析器沒有按標籤正確分割

回答

相關問題