2016-02-16 20 views
0

我目前有問題解析表中發生的所有tr標記,我能夠解析第一個tr標記,但我無法理解如何解析所有後續的tr標籤,我想過使用for循環,但它沒有工作。 我只包含了部分代碼,其中包含我想要存儲在json文件中的tr標籤。如何解析python中的多個tr標記

這裏是我的嘗試:

def parseFacultyPage(br, facultyID): 
    if br is None: 
     return None 

    br.open('https://academics.vit.ac.in/student/stud_home.asp') 
    response = br.open('https://academics.vit.ac.in/student/class_message_view.asp?sem=' + facultyID) 
    html = response.read() 
    soup = BeautifulSoup(html) 
    tables = soup.findAll('table') 

    # Extracting basic information of the faculty 
    infoTable = tables[0].findAll('tr') 
    name = infoTable[2].findAll('td')[0].text 
    if (len(name) is 0): 
     return None 
    subject = infoTable[2].findAll('td')[1].text 
    msg = infoTable[2].findAll('td')[2].text 
    sent = infoTable[2].findAll('td')[3].text 
    emailmsg = 'Subject: New VIT Email' + msg 

這裏是HTML代碼示例如果tr標籤存在不止一個。

<table width="79%" border="0" cellpadding="0" cellspacing="0" height="350"> 
    <tr> 
    <td valign="top" width="1%" bgcolor=#FFFFFF> 
     &nbsp; 
    </td> 
    <td valign="top" width="78%" bgcolor=#FFFFFF> 



    <center><b><u>VIEW CLASS MESSAGE - Winter Semester 2015~16</u></b></center> 
    <br><br> 


     <br> 
     <table cellpadding=4 cellspacing=2 border=0 bordercolor='black' width="100%"> 

     <tr bgcolor=#5A768D> 
      <td width="25%"><font color=#FFFFFF>From</font></td> 
      <td width="25%"><font color=#FFFFFF>Course</font></td> 
      <td><font color=#FFFFFF>Message</font></td> 
      <td width="10%"><font color=#FFFFFF>Posted On</font></td> 
     </tr> 

      <tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'"> 
       <td valign="top">RAGHAVAN R (SITE)</td> 
       <td valign="top">ITE308 - Distributed Systems - TH</td> 
       <td valign="top">Dear students, 

As informed in the class, this is to remind you Today special class from 6 to 6.50 pm at same venue SJT 126. 

regards 

R. Raghavan 
SITE</td> 
       <td valign="top">11/02/2016 11:42:57</td> 
      </tr> 

      <tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'"> 
       <td valign="top">SMART (APT) (ACAD)</td> 
       <td valign="top">STS302 - Soft Skills - SS</td> 
       <td valign="top">Dear Students, 

As 04 Feb 16 to 08 Feb 16 were announced as 「No Instruction days」, the first assessment that was supposed to happen from 08 Feb 16 to 12 Feb 16 is being postponed to 7th week (15 Feb 16 to 19 Feb 16) 
</td> 
       <td valign="top">10/02/2016 21:48:14</td> 
      </tr> 

     <tr bgcolor=#5A768D> 
      <td>&nbsp;</td> 
      <td>&nbsp;</td> 
      <td>&nbsp;</td> 
      <td>&nbsp;</td> 
     </tr> 

     </table> 


    <br><br> 
    </td> 
    </tr> 
</table> 

回答

3

你應該先扔迭代像下面各行中的行,查詢列到columns變量在開始

for index, row in enumerate(tables[1].findAll('tr')): 
    if index==0: 
     continue 

    columns= row.findAll('td') 
    name = columns[0].text 
    if not name: 
     return None 
    subject = columns[1].text 
    msg = columns[2].text 
    sent = columns[3].text 

編輯:看起來你的HTML有兩個表的結構。你需要內在的一個。因此,請使用索引1代替tables[1]

我還在迭代器周圍添加了enumerate,因此您也有行索引。並使用此,您可以跳過標題行,當index==0

+0

你的答案是正確的,但我無法得到它的工作的html頁面我只包括部分HTML代碼,以便您的回答是不工作的這一點,它只是正確存儲消息可以請您查看html代碼並告訴我如何正確定位它? –

+0

請檢查'tables [1]'是否讓你成爲內部表格。用一些解釋更新答案 – Obsidian

+0

非常感謝! –