2016-08-10 54 views
2

我通過.csv合約迭代,嘗試從網站中提取單個列。使用BeautifulSoup和Python抓取格式不完整的表中的一列

下面是該網站的例子:https://www.austintexas.gov/financeonline/contract_catalog/OCCViewMA.cfm?cd=CT&dd=6100&id=13060600641

我要搶標「商品名稱」從表在網頁的最後一列。但是,我無法弄清楚如何抓取列 - 只是行。

這是我目前正在

def scraper(first, second, third): 
    url = "https://www.austintexas.gov/financeonline/contract_catalog/OCCViewMA.cfm?cd=%s&dd=%d&id=%s" % (first, second, third) 
    soup = BeautifulSoup(urllib2.urlopen(url).read()) 
    foundtext = soup.find('td',text="Commodity Description") 
    table = foundtext.findPrevious('table') 
    rows = table.findAll('tr') 
    second_column = [] 
    for row in rows: 
     print row.contents 

我想最終輸出返回從該列中所有行的文本與行之間返回車廂使用的代碼。

有什麼想法?

回答

2

對於找到的每一行,找到所有td要素和指標得到想要的一個:

table = soup.find('td', text="Commodity Description").find_parent("table") 
for row in table.select("tr")[2:]: # skipping the header rows 
    cell = row.find_all("td")[1] 
    print(cell.get_text()) 
    print("----") 

打印:

WATERLINE REPLACEMENTCONSTRUCTION, PIPELINEPER YUEJIAO LIU, ADD THE REMAINING FUNDS BACK INTO THIS FUNDING LINE // PEMBERTON HEIGHTS PHASE III PROJECT ++ ENC. $53,209.97 
---- 
WATERLINE REPLACEMENTCONSTRUCTION, PIPELINEPEMBERTON HEIGHTS PHASE III PROJECT 
---- 
WATERLINE REPLACEMENTCONSTRUCTION, PIPELINEPEMBERTON HEIGHTS PHASE III PROJECT 
---- 
+0

輝煌!謝謝一堆 – Parseltongue

相關問題