2014-12-30 69 views
0

我試圖從網站檢索評級和評論。這是我使用的源代碼。您可以找到示例頁面並檢查頁面源。美麗的內部javascript

但是我有一個很大的問題,評論部分是用JavaScript代碼編寫的。評論按最相關(默認)排序,但我希望最新的不是最相關的,我該如何解決?

url = 'http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371#customer_reviews' 
content = urllib2.urlopen(url).read() 
content = preprocess_yelp_page(content) 
soup = BeautifulSoup(content) 
items = soup.findAll('span',{'itemprop':'ratingValue'}) 
for item in items: 
    a = item.contents[0].encode('utf-8') 
    indexlist.append(a) 
lines = soup.findAll('span',{'itemprop':'description'}) 
for line in lines: 
    a = line.contents[0].encode('utf-8') 
    indexlist2.append(a)  

回答

2

所有對項目進行排序的數據出現在頁面。只需提取數據和排序由datePublished項目屬性:

<meta itemprop="datePublished" content="2014-11-27" /> 

該網站甚至使用Schema.org microdata Review type來標記的評論,使得它額外容易讓你在這裏分析的項目。

要麼使用BeautifulSoup查找itemprop屬性,要麼使用類似rdflib-microdata的工具將頁面中的信息轉換爲RDF,然後使用RDF工具進行進一步處理。與

reviews = [] 
for rating in soup.find_all(itemprop='review'): 
    data = {} 
    for item in rating.find_all(itemprop=True): 
     data[item['itemprop']] = item.attrs.get('content') or item.get_text() 
    reviews.append(data) 

之後就可以按日期(最新的在前)對它們進行排序:

reviews.sort(key=lambda i: i['datePublished'], reverse=True) 

演示:

>>> from bs4 import BeautifulSoup 
>>> from pprint import pprint 
>>> import requests 
>>> response = requests.get('http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371') 
>>> soup = BeautifulSoup(response.content) 
>>> reviews = [] 
>>> for rating in soup.find_all(itemprop='review'): 
...  data = {} 
...  for item in rating.find_all(itemprop=True): 
...   data[item['itemprop']] = item.attrs.get('content') or item.get_text() 
...  reviews.append(data) 
... 
>>> reviews.sort(key=lambda i: i['datePublished'], reverse=True) 
>>> pprint(reviews) 
[{u'author': u'RHS381', 
    u'bestRating': u'5', 
    u'datePublished': u'2014-12-15', 
    u'description': u' I\'ve shopped all the home improvement centers evaluating the tool box brands they carry. Including specialty stores like Sears, Harbor Freight and Northern tool. Without a doubt, HomeDepots HUSKY tool boxes and tool storage systems were a land slide winner. These box\'s are SOLID, desent gauge steel, bullet proof hardware and casters you\'d expect to find on military grade equipment. The clean finish and over all appearance was so impressed with these boxes & cabinites, that I was able to convince my wife to let me use these as our 2 year old sons bedroom furniture set. I bought 2- 46" 9 draw mobile workbenches, 2- 27" 8-draw chest and cabinet sets with 2 draw intermediate chest and 1- 41" 16 drawer tool chest and cabinet set. I don\'t know about the rest of those boxs out there that cost twice as much and are half as husky, or the men they buy them, but as for me, my son & my house, WE WILL SERVE THE HUSKY!!!!! (Pictures of completed bedroom to follow)\n', 
    u'itemReviewed': u'HUSKY 41 in. 16-Drawer Tool Chest and Cabinet Set', 
    u'name': u" Husky Box's as bedroom furniture\n", 
    u'ratingValue': u'5', 
    u'reviewRating': u'Rated 5 out of 5'}, 
{u'author': u'Echristen68', 
    u'bestRating': u'5', 
    u'datePublished': u'2014-11-27', 
    u'description': u" I've had this box for a year and it's been great for a home mechanic. I've had Craftsman cheats in the last and this is comparable as far as the steel goes. It's got some nicer features like the extra space on top with the big hinged lid. I like to organize new parts on top. This isn't a heavy duty box but it's plenty strong enough for your tools. If you organize the drawers like you should you'll never exceed the strength. The slides work smoothly but the latches are a little flimsy and can be tricky.\nThe varity of drawer sizes in nice but I would have liked to have at least one more deeper drawer for my 3/8 sockets. The large top drawer in the bottom chest is just big enough for all my 1/2 sockets.\nAll-in-all I love the look and feel. It's an OUTSTANDING value for the price vs Snap-On or even Harbor Freight boxes!\n", 
    u'itemReviewed': u'HUSKY 41 in. 16-Drawer Tool Chest and Cabinet Set', 
    u'name': u' I really like this box!\n', 
    u'ratingValue': u'4', 
    u'reviewRating': u'Rated 4 out of 5'}, 
{u'author': u'Razor', 
    u'bestRating': u'5', 
    u'datePublished': u'2014-11-20', 
    u'description': u' I spent the last month checking and looking at all tool boxes that I could find. Online and at available stores. In comparison to all, this is by far the best deal for the money. Quality, workmanship and construction of this is by far the best for the money. Some I looked at are twice as much money for the same quality... I have had this approx. a month and filled with tools and shop stuff and with the ball bearing drawers loaded, does not make any difference on drawer operation. Granted we still need the test of time..\n', 
    u'itemReviewed': u'HUSKY 41 in. 16-Drawer Tool Chest and Cabinet Set', 
    u'name': u' solid construction\n', 
    u'ratingValue': u'5', 
    u'reviewRating': u'Rated 5 out of 5'}, 
{u'author': u'Skip', 
    u'bestRating': u'5', 
    u'datePublished': u'2014-10-28', 
    u'description': u' Love the tall Top space. Drills and other rechargeable tools fit nicely up there. Drawers slide better with the more weight you put in them. Wheels rolls really well. Overall, very pleased with this box for what it cost.\n', 
    u'itemReviewed': u'HUSKY 41 in. 16-Drawer Tool Chest and Cabinet Set', 
    u'name': u' Nice Box for the price.\n', 
    u'ratingValue': u'4', 
    u'reviewRating': u'Rated 4 out of 5'}, 
{u'author': u'48Kilo', 
    u'bestRating': u'5', 
    u'datePublished': u'2014-10-10', 
    u'description': u' This unit is solid and heavy duty. Drawers and bins are strong and provide various sizes for all our tools and accessories. Very competitive pricing. Easy to assemble. Would recommend this chest and cabinet to anyone needing to get tools organized.\n', 
    u'itemReviewed': u'HUSKY 41 in. 16-Drawer Tool Chest and Cabinet Set', 
    u'name': u' Exactly what we needed...\n', 
    u'ratingValue': u'5', 
    u'reviewRating': u'Rated 5 out of 5'}] 
+0

感謝您使用BeautifulSoup

解析Schema.org數據你的回覆!但是,如果你點擊最新的按鈕排序,最近的記錄將是12月30日。這個解決方案只是在頁面內最新排序。 –

+0

好的,但你能告訴我如何檢索datePublished嗎?我想將它們保存到列表中。 –

+0

@plainvanilla:'[tag ['content']爲soup.find_all中的標記(itemprop ='datePublished')]'給你列出所有這些日期。 –