BeautifulSoup缺失/跳過標籤

如果您能指出我正確的方向，我們將不勝感激。有沒有更好的方式做到這一點，並捕獲所有的數據（與HTML標籤類「文本文本」））...BeautifulSoup缺失/跳過標籤

如果我喜歡這樣做。我錯過了一些標籤，最終原始html字符串的大小是20K（所以它的大量數據）。

soup = BeautifulSoup(r.content, 'html5lib') 
c.case_html = str(soup.find('div', class_='DocumentText') 
print(self.case_html)

以下是用於抓取的代碼，現在可以正常工作，但第二個新的標籤被添加它已損壞。

soup = BeautifulSoup(r.content, 'html5lib') 
c.case_html = str(soup.find('div', class_='DocumentText').find_all(['p','center','small'])) 
print(self.case_html)

樣本HTML如下：原來是周圍的20K字符串大小

<form name="form1" id="form1"> 
<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;"> 
<p>PTag</p> 
<p> <center> First center </center> </p> 
<small> this is small</small> 
<p>...</p> 
<p> <center> Second Center </center> </p> 
<p>....</p> 
</div> 
</form>

預計輸出是這個

<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;"> 
<p>PTag</p> 
<p> <center> First center </center> </p> 
<small> this is small</small> 
<p>...</p> 
<p> <center> Second Center </center> </p> 
<p>....</p> 
</div>

來源

2017-09-06 Pbch

'c.case_html = STR（soup.find（ '格'，類_ = 'DocumentText'）'你爲什麼把它改爲'string'？ –

你的元素想解析上面粘貼的什麼短信？ – SIM

您的預期產量是多少？ – chad

你可以試試這個。我只是基於你給定的HTML代碼的基礎上回答。如果您需要澄清，請讓我知道。謝謝！

soup = BeautifulSoup(r.content, 'html5lib') 
case_html = soup.select('div.DocumentText') 
print(case_html.get_text())

來源

2017-09-06 10:49:29 chad

BeautifulSoup缺失/跳過標籤

回答

相關問題