2013-11-23 23 views
1

我需要在表格中填入表格,該表格可以在HTML頁面中與soup.findAll('table',{'id':'taxHistoryTable'})一起找到。現在我需要在這湯創建一個類似的指針有這樣我就可以在使用beautifulsoup解析表格數據時出錯

<table id="taxHistoryTable" class="view-history responsive-table yui3-toggle-content-minimized ceilingless"><thead> 
<tr><th class="year">Year</th> 
<th class="numeric property-taxes">Property taxes</th> 
<th class="numeric">Change</th><th class="numeric tax-assessment">Tax assessment</th> 
<th class="numeric">Change</th></tr></thead><tfoot> 
<tr><td colspan="5"><span class="yui3-toggle-content-link-block"><a href="#" class="yui3-toggle-content-link"> 
<span class="maximize">More</span><span class="minimize">Fewer</span></a></span></td>    </tr></tfoot><tbody> 
<tr class="alt"><td>2011</td><td class="numeric">$489</td><td class="numeric"><span class="delta-value"><span class="inc">-81.8%</span></span></td> 
<td class="numeric">$34,730</td> 
<td class="numeric"><span class="delta-value"><span class="inc">-6.9%</span></span> </td></tr><tr> 
<td>2010</td><td class="numeric">$2,683</td><td class="numeric"><span class="delta-value"><span class="dec">177%</span></span></td><td class="numeric">$37,300</td><td class="numeric"><span class="delta-value"><span class="dec">98.7%</span></span></td></tr><tr class="alt"><td>2009</td><td class="numeric">$969</td><td class="numeric"><span class="delta-value">--</span></td><td class="numeric">$18,770</td><td class="numeric"><span class="delta-value">--</span></td></tr><tr class="minimize"><td>2008</td><td class="numeric">$0</td><td class="numeric"><span class="delta-value">--</span></td><td class="numeric">$18,770</td><td class="numeric"><span class="delta-value">--</span></td></tr></tbody></table> 

這是表類條目taxHistoryTable中得到的值。我編寫了2個循環來準確識別地點,然後嘗試將其分配給一個變量名稱,然後將其寫入CSV文件。

 page = urllib2.urlopen(houselink).read() #opening link 
     soup = BeautifulSoup(page) #parsing link 
     address = soup.find('h1',{'class':'prop-addr'}) #finding html address of house address 
     price = soup.find('h2',{'class':'prop-value-price'}) #finding html address of price info, find used to find only instance of price 
     price1 = price.find('span',{'class':'value'}) #Had to do this as price address was not unique at granular level, used upper level to identify it 
     #Price address was not unique becuase of presence of Zestimate price also on page 
     bedroom = soup.findAll('span',{'class':'prop-facts-value'})[0] 
     bathroom = soup.findAll('span',{'class':'prop-facts-value'})[1] 
     #zestimate 
     zestimate = soup.findAll('td',{'class':'zestimate'})[1] 
     #tax 
     loop1 = soup.findAll('table',{'id':'taxHistoryTable'}) 
     for form1 in loop1: 
      loop2=form1.findAll('tr',{'class':'alt'}) 
      for form2 in loop2: 
       #year1=form2.find('td')[0] 
       tax1=form2.find('td',{'class':'numeric'})[0] 
       percent1=form2.find('span',{'class':'inc'})[0] 
       asses1=form2.find('td',{'class':'numeric'})[1] 
       precent2=form2.find('span',{'class':'inc'})[1] 
try: 
      q_cleaned = unicode(u' '.join(zestimate.stripped_strings)).encode('utf8').strip() 
     except AttributeError: 
      q_cleaned = "" 
     try: 
      r_cleaned = unicode(u' '.join(tax1.stripped_strings)).encode('utf8').strip() 
     except AttributeError: 
      r_cleaned = "" 
     try: 
      s_cleaned = unicode(u' '.join(percent1.stripped_strings)).encode('utf8').strip() 
     except AttributeError: 
      s_cleaned = "" 
     try: 
      t_cleaned = unicode(u' '.join(asses1.stripped_strings)).encode('utf8').strip() 
     except AttributeError: 
      t_cleaned = "" 
     try: 
      u_cleaned = unicode(u' '.join(percent2.stripped_strings)).encode('utf8').strip() 
     except AttributeError: 
      u_cleaned = "" 

     spamwriter.writerow([a_cleaned,b_cleaned,d_cleaned,e_cleaned,f_cleaned,g_cleaned,h_cleaned,i_cleaned,j_cleaned,k_cleaned,l_cleaned,m_cleaned,n_cleaned,o_cleaned,p_cleaned,coordinates,q_cleaned,r_cleaned,s_cleaned,t_cleaned,u_cleaned]) #writing row for that address price combination 

實際的代碼我的工作是很長,所以我只包括特定的錯誤「UnboundLocalError:局部變量‘tax1’分配之前引用的」碎片,這我收到了。

有人可以幫助我瞭解如何分配這些變量,使得這些變量的值在循環完成後可用。

+0

「tax1」是否出現在代碼的早期?如果是這樣,你可以發佈這些行嗎?另外,當你收到這個錯誤信息時,你確定你的代碼已經到達你定義「tax1」的行嗎? – duhaime

回答

0

您試圖在zestimate之後找到的元素,例如tax ..等都是來自urllib2響應的空標記。簡而言之,loop1 = soup.findAll('table',{'id':'taxHistoryTable'})將找不到任何內容,因爲如果您使用urllib2或機械化進行請求,則其父項是空的div標記。所以你的代碼之後肯定不行。

要收集完整的HTML source_code,您可以在瀏覽器中看到,您需要一個可以處理javascript等的工具,並且可以像真正的瀏覽器一樣工作,那麼您可以將Selenium ghost.py phantomjs ...等。

順便說一句,因爲你試圖刮zillow。在啓動機器人之前,您最好檢查一下他們的API。祝你好運。