2012-05-01 59 views
0

我的HTML:如何用BeautifulSoup和Regex解析這個HTML表格?

  <table cellspacing="0" cellpadding="2" rules="all" border="1" id="branchTable" width="100%"> 
      <tr class="TitleTable"> 
       <th scope="col" width="250"><b>Branch Name</b></th><th scope="col" width="35%"><b>Branch Date</b></th><th scope="col" width="35%"><b>Branch Origin</b></th> 
      </tr><tr class="RowSet"> 
       <td><a class="blue" href="javascript: OpenWindow(&#39;/home/data/files/fetchRecord.php?fileID=342&#39;)">SFO Branch</a></td><td class="red">03/16/2012</td><td class="red">&nbsp;</td> 
      </tr><tr class="RowSet"> 
       <td><a class="blue" href="javascript: OpenWindow(&#39;/home/data/files/fetchRecord.php?fileID=884&#39;)">LAX Branch</a></td><td class="red">03/16/2012</td><td class="red">06/16/1985</td> 
      </tr><tr class="RowSet"> 
       <td><a class="blue" href="javascript: OpenWindow(&#39;/home/data/files/fetchRecord.php?fileID=83&#39;)">DC Branch</a></td><td class="red">03/16/2012</td><td class="red">&nbsp;</td> 
      </tr> 
      </table> 

到目前爲止我的代碼:

from BeautifulSoup import BeautifulSoup 

soup = BeautifulSoup(pageSource) 
table = soup.find("table", id = "branchTable") 
rows = table.findAll("tr", {"class":"RowSet"}) 

data = [[td.findChildren(text=True) for td in tr.findAll("td")] for tr in rows] 
print data 

輸出:

SFO Branch 03/16/2012 &nbsp; 
LAX Branch 03/16/2012 06/16/1985 
DC Branch 03/16/2012 &nbsp; 

期望:

我想抓住封閉在標籤中的數據以及ID(fetchRecord.php?fileID = )。不知道如何獲取該值。 BeautifulSoup或正則表達式,請幫助。謝謝!

+0

什麼是U究竟意欲何爲? –

+0

解析html並獲取輸出中顯示的數據。我也想抓住fileID,不知道該怎麼做。 – ThinkCode

回答

1

你可以使用正則表達式來解析href,但我懶得寫一個。見href_parse下面的檢索URI之後解析查詢字符串的正確方法:

from urlparse import urlparse 
from urlparse import parse_qs 

def href_parse(value): 
    if (value.startswith('javascript: OpenWindow(&#39;') and 
     value.endswith('&#39;)'): 
     begin_length = len('javascript: OpenWindow(&#39;') 
     end_length = len('&#39;)') 
     file_location = value[begin_length:-end_length] 

     query_string = urlparse(file_location).query 
     query_dict = parse_qs(query_string) 
     return query_dict.get('fileId', None) 


href_data = [[href_parse(td.find('a', attrs={'class': 'blue'})['href']) 
       for td in tr.findAll("td")] 
       for tr in rows] 
print href_data 
0

這個怎麼樣

import re 
urlRE = re.compile('javascript: OpenWindow\(\&#39;(.*)#39;\)') 
... 
urlMat = urlRE.match(value) 
if urlMat: 
    url = urlMat.groups()[0]