2014-01-11 84 views
11

我想解析網頁http://dcsd.nutrislice.com/menu/meadow-view/lunch/搶今天的午餐菜單。 (我已經構建了一個Adafruit#IoT熱敏打印機,並且我想每天自動打印菜單。)解析從BeautifulSoup返回的JavaScript

我最初使用BeautifulSoup來解決這個問題,但事實證明,大部分數據都是用JavaScript加載的,而我我不確定BeautifulSoup能處理它。如果您查看源代碼,您將看到存儲在bootstrapData['menuMonthWeeks']中的相關數據。

import urllib2 
from BeautifulSoup import BeautifulSoup 

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/" 
soup = BeautifulSoup(urllib2.urlopen(url).read()) 

這是獲取源代碼和評論的簡單方法。

我的問題是:什麼是最簡單的方法來提取這些數據,以便我可以做些什麼?從字面上看,我要的是一個字符串一樣的東西:

西南奶酪煎蛋卷,土豆角,收穫吧(THB)泰銖 - 芝士香蒜麪包,火腿熟食三明治,紅辣椒棒,草莓

我想過使用webkit來處理頁面並獲取HTML(即瀏覽器的功能),但似乎不必要的複雜。我寧願找一些可以解析數據的東西。

回答

10

喜歡的東西PhantomJS可能會更強勁,但這裏的一些基本的Python代碼提取它完整的菜單:

import json 
import re 
import urllib2 

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read() 
menu = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1)) 

print menu 

之後,你會想通過對你的日期菜單搜索。感興趣的

編輯:對我而言有些矯枉過正:

import itertools 
import json 
import re 
import urllib2 

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read() 
menus = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1)) 

days = itertools.chain.from_iterable(menu['days'] for menu in menus) 

day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None) 

if day: 
    print '\n'.join(item['food']['description'] for item in day['menu_items']) 
else: 
    print 'Day not found.' 
4

所有你需要的是一個小字符切片:

import json 

soup = BeautifulSoup(urllib2.urlopen(url).read()) 
script = soup.findAll('script')[1].string 
data = script.split("bootstrapData['menuMonthWeeks'] = ", 1)[-1].rsplit(';', 1)[0] 
data = json.loads(data) 

JSON畢竟是JavaScript的一個子集。

+0

非常有幫助!需要更多的導入和URL定義,但最終這也很適合獲得該值。 – Wade

0

沒有BeautifulSoup,一個簡單的方法,我們才能:

import urllib2 
import json 
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/" 
for line in urllib2.urlopen(url): 
    if "bootstrapData['menuMonthWeeks']" in line: 
     data = json.loads(line.split("=")[1].strip('\n;')) 
     print data[0]["last_updated"] 

輸出:

2013-11-11T11:18:13.636 

一個更通用的方式看JavaScript parser in Python

0

沒有與json搞亂,如果你願意,這不推薦,你可以試試以下內容:

import urllib2 
import re 

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/" 
data = urllib2.urlopen(url).readlines()[60].partition('=')[2].strip() 

foodlist = [] 

prev = 'name' 
for i in re.findall('"([^"]*)"', data): 
    if "The Harvest Bar (THB)" in i or i == "description" or i == "start_date": 
     prev = i 
     continue 
    if prev == 'name': 
     if i.startswith("THB - "): 
      i = i[6:] 
     foodlist.append(i) 
    prev = i 

我想這是你最終會需要:

Orange Chicken Bowl 
Roasted Veggie Pesto Pizza 
Cheese Sandwich & Yogurt Tube 
Steamed Peas 
Peaches 
Southwest Cheese Omelet 
Potato Wedges 
Cheesy Pesto Bread 
Ham Deli Sandwich 
Red Pepper Sticks 
Strawberries 
Hamburger 
Cheeseburger 
Potato Wedges 
Chicken Minestrone Soup 
Veggie Deli Sandwich 
Baked Beans 
Green Beans 
Fruit Cocktail 
Cheese Pizza 
Pepperoni Pizza 
Diced Chicken w/ Cornbread 
Turkey Deli Sandwich 
Celery Sticks 
Blueberries 
Cowboy Mac 
BYO Asian Salad 
Sunbutter Sandwich 
Stir Fry Vegetables 
Pineapple Tidbits 
Enchilada Blanco 
Sausage & Black Olive Pizza 
Cheese Sandwich & Yogurt Tube 
Southwest Black Beans 
Red Pepper Sticks 
Applesauce 
BBQ Roasted Chicken. 
Hummus Cup w/ Pita bread 
Ham Deli Sandwich 
Mashed potatoes w/ gravy 
Celery Sticks 
Kiwi 
Popcorn Chicken Bowl 
Tuna Salad w/ Pita Bread 
Veggie Deli Sandwich 
Corn Niblets 
Blueberries 
Cheese Pizza 
Pepperoni Pizza 
BYO Chef Salad 
BYO Vegetarian Chef Salad 
Turkey Deli Sandwich 
Steamed Cauliflower 
Banana, Whole 
Bosco Sticks 
Chicken Egg Roll & Chow Mein Noodles 
Sunbutter Sandwich 
California Blend Vegetables 
Fresh Pears 
Baked Mac & Cheese 
Italian Dunker 
Ham Deli Sandwich 
Red Pepper Sticks 
Pineapple Tidbits 
Hamburger 
Cheeseburger 
Baked Fries 
BYO Taco Salad 
Veggie Deli Sandwich 
Baked Beans 
Coleslaw 
Fresh Grapes 
Cheese Pizza 
Pepperoni Pizza 
Diced Chicken w/ Cornbread 
Turkey Deli Sandwich 
Steamed Cauliflower 
Fruit Cocktail 
French Dip w/ Au Jus 
Baked Fries 
Turkey Noodle Soup 
Sunbutter Sandwich 
Green Beans 
Warm Cinnamon Apples 
Rotisserie Chicken 
Mashed potatoes w/ gravy 
Bacon Cheeseburger Pizza 
Cheese Sandwich & Yogurt Tube 
Steamed Peas 
Apple Wedges 
Turkey Chili 
Cornbread Muffins 
BYO Chef Salad 
BYO Vegetarian Chef Salad 
Ham Deli Sandwich 
Celery Sticks 
Fresh Pears 
Beef, Bean & Red Chili Burrito 
Popcorn Chicken & Breadstick 
Veggie Deli Sandwich 
California Blend Vegetables 
Strawberries 
Cheese Pizza 
Pepperoni Pizza 
Hummus Cup w/ Pita bread 
Turkey Deli Sandwich 
Green Beans 
Orange Wedges 
Bosco Sticks 
Cheesy Bean Soft Taco Roll Up 
Sunbutter Sandwich 
Pinto Bean Cup 
Baby Carrots 
Blueberries 

隨着json

import urllib2 
import json 
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/" 
for line in urllib2.urlopen(url): 
    if "bootstrapData['menuMonthWeeks']" in line: 
     data = json.loads(line.split("=")[1].strip('\n;')) 
     print data[0]["name"] 
    break