2017-08-30 44 views
1

我使用python 3.6和我能夠用刮文字BeautifulSoup.I刮與沃爾瑪website.I試圖從沃爾瑪刮文本練習。這是我的代碼。網絡使用beautifulSoup和urllib的

from bs4 import BeautifulSoup 
from urllib.request import urlopen 
main_page=urlopen('http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159') 
soup = BeautifulSoup(main_page,"lxml") 
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text() 
price=soup.select_one("span.Price-group").get_text() 
highLights=soup.select_one("div.ProductPage-short-description-body").get_text() 
description=soup.select_one("div.about-desc").get_text() 
print(title,"\n",highLights,"\n",description,"\n",price) 

在上面的代碼中,我提取產品名稱,價格,高燈和描述,但我不能夠提取的說明(關於這個項目)。而不是描述我得到別的東西。

請幫我解決這個問題。

回答

0

因爲有2個div class =「about-desc」,因爲你使用select_one只返回第一個div,但你需要第二個div。這裏的好辦法:

description=soup.select("div.about-desc")[1].get_text() 

更新:該網站實際上塊的urllib的默認用戶代理,所以你應該掩蓋。

from bs4 import BeautifulSoup 
from urllib.request 
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'} 
req = urllib.request.Request(url="http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159", headers=user_agent) 
main_page = urllib.request.urlopen(req) 
soup = BeautifulSoup(main_page,"lxml") 
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text() 
price=soup.select_one("span.Price-group").get_text() 
highLights=soup.select_one("div.ProductPage-short-description-body").get_text() 
description=soup.select("div.about-desc")[1].get_text() 
print(title,"\n",highLights,"\n",description,"\n",price) 
相關問題