美麗的湯桌表解析

我們正在做一個大學項目，我們想從大學時間表中提取數據並將其用於我們自己的項目中。我們有一個提取數據的python腳本，它在本地機器上運行良好，但是當我們嘗試在Amazon ec2上使用相同的腳本時，出現錯誤。美麗的湯桌表解析

from bs4 import BeautifulSoup 
import requests 

# url from timetable.ucc.ie showing 3rd Year semester 1 timetable 
url = 'http://timetable.ucc.ie/showtimetable2.asp?filter=%28None%29&identifier=BSCS3&days=1-5&periods=1-20&weeks=5-16&objectclass=programme%2Bof%2Bstudy&style=individual' 

# Retrieve the web page at url and convert the data into a soup object 
r = requests.get(url) 
data = r.text 
soup = BeautifulSoup(data) 

# Retrieve the table containing the timetable from the soup object for parsing 
timetable_to_parse = soup.find('table', {'class' : 'grid-border-args'}) 

i = 0 # i is an index into pre_format_day 
pre_format_day = [[],[],[],[],[],[]] # holds un-formatted day information 
day = [[],[],[],[],[],[]] # hold formatted day information 
day[0] = pre_format_day[0] 

# look at each td within the table 
for slot in timetable_to_parse.findAll('td'): 
    # if slot content is a day of the week, move pointer to next day 
    # indicated all td's relating to a day have been looked at 
    if slot.get_text() in ('Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri'): 
     i += 1 
    else: # otherwise the td related to a time slot in a day 
     try: 
      if slot['colspan'] is "4": #test if colspan of td is 4 
       # if it is, append to list twice to represent 2 hours 
       pre_format_day[i].append(slot.get_text().replace('\n','')) 
       pre_format_day[i].append(slot.get_text().replace('\n','')) 
     except: 
      pass 
     # if length of text of td is 1, > 11 or contains ":00" 
     if len(slot.get_text()) == 1 or len(slot.get_text()) > 11 or ":00" in\ 
       slot.get_text(): 
      # add to pre_format_day 
      pre_format_day[i].append(slot.get_text().replace('\n','')) 

# go through each day in pre_format_day and insert formatted version in day[] 
for i in range(1,6): 
    j = 0 
    while j < 20: 
     if len(pre_format_day[i][j]) > 10: # if there is an event store in day 
      day[i].append(pre_format_day[i][j]) 
     else: # insert space holder into slots with no events 
      day[i].append('----- ') 
     j += 2 

# creates a string containing a html table for output 
timetable = '<table><tr>' 
timetable += '<th></th>' 
for i in range(0, 10): 
    timetable += '<th>' + day[0][i] + '</th> ' 

days = ['', 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri'] 

for i in range(1,6): 
    timetable += '</tr><tr><th>' + days[i] + '</th>' 
    for j in range(0,10): 
     if len(day[i][j]) > 10: 
      timetable += '<td class="lecture">' + day[i][j] + '</td>' 
     else: 
      timetable += '<td></td>' 

timetable += '</tr></table>' 

# output timetable string 
print timetable

本地機器上的輸出是一個包含所需數據的表。

的EC2實例的輸出是回溯（最近通話最後一個）：在timetable_to_parse.findAll 文件「parse2.py」，第21行，在的插槽（ 'TD'）： AttributeError的：' NoneType'對象沒有屬性'findAll'

這兩臺機器運行Ubuntu 14.10，Python 2.7，但由於某種原因，我不明白它似乎沒有從URL中獲取所需的頁面，並從該表中提取表但之後，我輸了。

任何幫助非常感謝。

來源

2015-03-02 sir_t

您是否檢查過在ec2實例上運行時返回的html數據？它是否服務於不同的頁面？ – 2015-03-02 12:11:03

數據如預期 – 2015-03-04 11:17:35

問題是ec2在本地機器上使用了不同的解析器。用固定的。

apt-get install python-lxml

來源

2015-03-04 11:41:48

登錄到EC2實例並在Python CLI中逐行瀏覽它，直到找到問題。出於某種原因，BeautifulSoup解析在不同的系統上稍有不同。我有同樣的問題，我不知道背後的原因。在不知道HTML內容的情況下，我們很難給你特定的幫助。

來源

2015-03-02 12:39:15 Stewart

因此，我嘗試打印每個變量，他們打印幷包括湯。如果我打印table_to_parse，我得到一個沒有錯誤，湯的輸出是http://cs1.ucc.ie/~jmt2/soup.txt – 2015-03-04 10:27:43

美麗的湯桌表解析

回答

相關問題