解析HTML表BeautifulSoup

我想從這個時間表採取某一天的數據：click here 解析HTML表BeautifulSoup

我已經能夠用美麗的湯加入一整行的任何一天（在這種情況下，，週一或「星期一」）使用該代碼的列表：

from BeautifulSoup import BeautifulSoup 

day ='Mon' 

with open('timetable.txt', 'rt') as input_file: 
    html = input_file.read() 
    soup = BeautifulSoup(html) 
    #finds correct day tag 
    starttag = soup.find(text=day).parent.parent 
    print starttag 
    nexttag = starttag 
    row=[] 
    x = 0 
    #puts all td tags for that day in a list 
    while x < 18: 
    nexttag = nexttag.nextSibling.nextSibling 
    row.append(nexttag) 
    x += 1 
print row

，你可以看到，該命令返回TD標籤，從而彌補了「星期一」排時間表的列表。

我的問題是，我不知道如何進一步解析或搜索返回的列表來查找相關信息（COMP1740等）。

如果我可以找到如何搜索列表中的每個元素的模塊代碼，我可以將它們與另一個時序列表連接在一起，給出一天的時間表。

歡迎所有幫助！（包括完全不同的方法）

來源

2011-12-03 Ben Hirsh

您可以使用正則表達式查找類似課程編號的信息，即模式匹配。

我不知道你的經驗，但Python包含一個're'模塊。查看「四個字母C-O-M-P後跟一個或多個數字」的模式。給出COMP\d+的RegEx，其中\d是一個數字，並且以下+表示尋找儘可能多的（在這種情況下是4）。

from BeautifulSoup import BeautifulSoup 
import re 

day ='Mon' 
codePat = re.compile(r'COMP\d+') 

with open('timetable.txt', 'rt') as input_file: 
    html = input_file.read() 
    soup = BeautifulSoup(html) 
    #finds correct day tag 
    starttag = soup.find(text=day).parent.parent 
# print starttag 
    nexttag = starttag 
    row=[] 
    x = 0 
    #puts all td tags for that day in a list 
    while x < 18: 
    nexttag = nexttag.nextSibling.nextSibling 
    found = codePat.search(repr(nexttag)) 
    if found: 
     row.append(found.group(0)) 
    x += 1 
print row

這給我的輸出，

['COMP1940', 'COMP1550', 'COMP1740']

就像我說的，我不知道你的正則表達式的知識，所以如果你能描述的模式，我可以嘗試把它們寫。 Here's a good resource如果你決定自己做。

來源

2011-12-04 03:02:37 FakeRainBrigand

非常感謝您的幫助。原來只有我的模塊代碼以'COMP'開始，所以我只是將搜索模式改爲'rowspan =「1」'，因爲那是代碼中唯一另外一件事情，它在表格中的那個位置提供了一個模塊。我將發佈新代碼作爲答案。 –

@Ben，關於你的新答案：當你過去最後一個兄弟姐妹時，nexttag將是None，所以你可以說'''如果不是nexttag：break'''。它比try/catch更清潔。 – FakeRainBrigand

from BeautifulSoup import BeautifulSoup 
import re 

#day input 
day ='Thu' 
#searches for a module (where html has rowspan="1") 
module = re.compile(r'rowspan=\"1\"') 
#lengths of module search (depending on html colspan attribute) 
#1.5 hour 
perlen15 = re.compile(r'colspan=\"3\"') 
#2 hour 
perlen2 = re.compile(r'colspan=\"4\"') 
#2.5 hour etc. 
perlen25 = re.compile(r'colspan=\"5\"') 
perlen3 = re.compile(r'colspan=\"6\"') 
perlen35 = re.compile(r'colspan=\"7\"') 
perlen4 = re.compile(r'colspan=\"8\"') 
#times correspond to first row of timetable. 
times = ['8:00', '8:30', '9:00', '9:30', '10:00', '10:30', '11:00', '11:30', '12:00', '12:30', '13:00', '13:30', '14:00', '14:30', '15:00', '15:30'] 

#opens full timetable html 
with open('timetable.txt', 'rt') as input_file: 
    html = input_file.read() 
    soup = BeautifulSoup(html) 
    #finds correct day tag 
    starttag = soup.find(text=day).parent.parent 
    nexttag = starttag 
    row=[] 
    #movement of cursor iterating over times list 
    curmv = 0 
    #puts following td tags for that day in a list 
    for time in times: 
    nexttag = nexttag.nextSibling.nextSibling 
    #detect if a module is found 
    found = module.search(repr(nexttag)) 
    #detect length of that module 
    hour15 = perlen15.search(repr(nexttag)) 
    hour2 = perlen2.search(repr(nexttag)) 
    hour25 = perlen25.search(repr(nexttag)) 
    hour3 = perlen3.search(repr(nexttag)) 
    hour35 = perlen35.search(repr(nexttag)) 
    hour4 = perlen4.search(repr(nexttag)) 
    if found: 
     row.append(times[curmv]) 
     row.append(nexttag) 
     if hour15: 
     curmv += 3 
     elif hour2: 
     curmv += 4 
     elif hour25: 
     curmv += 5 
     elif hour3: 
     curmv += 6 
     elif hour35: 
     curmv += 7 
     elif hour4: 
     curmv += 8 
     else: 
     curmv += 2 
    else: 
     curmv += 1 
#write day to html file 
with open('output.html', 'wt') as output_file: 
    for e in row: 
    output_file.write(str(e))

，你可以看到，代碼可以1小時和2小時的講座以及1.5，2.5小時多頭者等區分

現在我唯一的問題是第32行，我需要一個更好的方法告訴代碼停止在表格中水平移動aka：知道何時停止for循環（在前面的代碼中，我有while x < 18:，因爲行中有18個td標籤，所以它只能在星期一工作。當它碰撞到父母</tr>標籤時停止循環？

謝謝！

編輯：我要嘗試使用try和except塊來捕捉我得到的錯誤，如果我將「時間」設置爲一直到18:00。

編輯2：它工作！：D

來源

2011-12-04 17:11:35

解析HTML表BeautifulSoup

回答

相關問題