使用BeautifulSoup獲取內嵌套標籤數據

我想獲取內部標籤中的信息，但它始終返回空白。這是我的代碼：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.krak.dk/cafe/s%C3%B8g.cs?consumer=suggest?search_word=cafe" 
r = requests.get(url) 

soup = BeautifulSoup(r.content, 'html.parser') 

genData = soup.find_all("ol", {"class": "hit-list"}) 
print genData 
for infoX in genData: 
    print inforX.text

我在想什麼？

來源

2016-09-22 Prayson Daniel

的HTML是壞，你需要一個不同的解析器，你可以使用LXML如果您有它：

soup = BeautifulSoup(r.content, 'lxml')

或者使用html5lib：

soup = BeautifulSoup(r.content, 'html5lib')

LXML有像libxml，html5lib的依賴關係可以與pip一起安裝。

In [9]: url = "http://www.krak.dk/cafe/s%C3%B8g.cs?consumer=suggest?search_word=cafe" 

In [10]: r = requests.get(url) 
In [11]: soup = BeautifulSoup(r.content, 'html.parser') 
In [12]: len(soup.find_all("ol", {"class": "hit-list"}))Out[12]: 0 

In [13]: soup = BeautifulSoup(r.content, 'lxml') 
In [14]: len(soup.find_all("ol", {"class": "hit-list"})) 
Out[14]: 1 

In [15]: soup = BeautifulSoup(r.content, 'html5lib') 

In [16]: len(soup.find_all("ol", {"class": "hit-list"})) 
Out[16]: 1

也有隻有一個hit-list所以你可以用找到的地方find_all的，你可以使用也使用id soup.find(id="hit-list")。如果您通過運行html來訪問w3c's html validator，則可以看到有很多問題。

來源

2016-09-22 11:57:50

問題在於字符編碼utf-8。由於該網頁包含特殊的丹麥字符Åå，Øø，Ææ。謝謝Padraic，我不會注意到這個破碎的地址。

在第一行添加 - * - coding：utf- 8 - * - 解決了問題。

- *- coding: utf- 8 - *- 
import requests 
from bs4 import BeautifulSoup 

url = "http://www.krak.dk/cafe/søg.cs?consumer=suggest?search_word=cafe" 
r = requests.get(url).content 
soup = BeautifulSoup(r, 'html5lib')

來源

2016-10-03 10:18:04

使用BeautifulSoup獲取內嵌套標籤數據

回答

相關問題