2013-06-06 81 views
2

在這個page,我試圖刮,我想排除那些有屬性的<td>查找標籤,除了那些屬性:BeautifulSoup

<td >點擊此處查看阿根廷</td>一個全面的區域代碼列表

我想知道用什麼功能/ s到排除與屬性

這個標籤

我的代碼獲取所有城市和地區代碼

from bs4 import BeautifulSoup 
import urllib2 
import re 

url = "http://www.howtocallabroad.com/argentina" 
html_page = urllib2.urlopen(url) 
soup = BeautifulSoup(html_page) 

areatable = soup.find('table',{'id':'codes'}) 
if areatable is None: 
    print "areatable is None" 
else: 
    d = {} 

    def chunks(l, n): 
      return [l[i : i + n] for i in range(0, len(l), n)] 

    all_td = areatable.findAll('td') 
    print all_td 

    li = dict(chunks([i.text for i in all_td], 2)) 
    print li 

但是,當我嘗試打印li,它拋出一個異常:

Traceback (most recent call last): 
    File "extract_table.py", line 21, in <module> 
    li = dict(chunks([i.text for i in all_td], 2)) 
ValueError: dictionary update sequence element #30 has length 1; 2 is required 

這就是我得到我的呼喚areatable.findAll('td')

[ 
<td>Buenos Aires</td>, 
<td>11</td>, 
<td>La Rioja</td>, 
<td>380</td>, 
<td>Salta</td>, 
<td>387</td>, 
<td>Bahia Blanca</td>, 
<td>291</td>, 
<td>Mar del Plata</td>, 
<td>223</td>, 
<td>San Juan</td>, 
<td>264</td>, 
<td>Catamarca<br/></td>, 
<td>383</td>, 
<td>Mendoza</td>, 
<td>261</td>, 
<td>San Luis</td>, 
<td>266</td>, 
<td>Comodoro Rivadavia</td>, 
<td>297</td>, 
<td>Mercedes/Prov. B.A.</td>, 
<td>2324</td>, 
<td>San Nicolas</td>, 
<td>336</td>, 
<td>Concordia</td>, 
<td>345</td>, 
<td>Neuquen</td>, 
<td>299</td>, 
<td>San Rafael</td>, 
<td>260</td>, 
<td>Cordoba</td>, 
<td>351</td>, 
<td>Parana</td>, 
<td>343</td>, 
<td>Santa Fe</td>, 
<td>342</td>, 
<td>Corrientes</td>, 
<td>379</td>, 
<td>Posadas</td>, 
<td>376</td>, 
<td>Santiago del Estero</td>, 
<td>385</td>, 
<td>Formosa</td>, 
<td>370</td>, 
<td>Resistencia</td>, 
<td>362</td>, 
<td>Santo Tome</td>, 
<td>3756</td>, 
<td>Jesus Maria</td>, 
<td>3525</td>, 
<td>Rio Cuarto</td>, 
<td>358</td>, 
<td>Tandil</td>, 
<td>249</td>, 
<td>La Plata</td>, 
<td>221</td>, 
<td>Rosario</td>, 
<td>341</td>, 
<td>Trelew</td>, 
<td>280</td>, 
<td colspan="6" id="more"><a href="http://www.cnc.gov.ar/infotecnica/numeracion/indicativosinter.asp" target="_blank">Click here</a> for a comprehensive area code list for Argentina</td> 
] 

回答

4

的問題是,all_td爲奇數長度,所以chunks功能不能很好地工作。下面是一個簡單lambda功能,發現如果標籤沒有屬性,你可以用它來只趕上<td>stuff</td>標籤:

>>> all_td = filter(lambda x: x.attrs == {}, all_td) 
# all_td now contains [<td>Buenos Aires</td>, <td>11</td>, <td>La Rioja</td>, <td>380</td>, <td>Salta</td>, <td>387</td>, <td>Bahia Blanca</td>, <td>291</td>, <td>Mar del Plata</td>, <td>223</td>, <td>San Juan</td>, <td>264</td>, <td>Catamarca<br/></td>, <td>383</td>, <td>Mendoza</td>, <td>261</td>, <td>San Luis</td>, <td>266</td>, <td>Comodoro Rivadavia</td>, <td>297</td>, <td>Mercedes/Prov. B.A.</td>, <td>2324</td>, <td>San Nicolas</td>, <td>336</td>, <td>Concordia</td>, <td>345</td>, <td>Neuquen</td>, <td>299</td>, <td>San Rafael</td>, <td>260</td>, <td>Cordoba</td>, <td>351</td>, <td>Parana</td>, <td>343</td>, <td>Santa Fe</td>, <td>342</td>, <td>Corrientes</td>, <td>379</td>, <td>Posadas</td>, <td>376</td>, <td>Santiago del Estero</td>, <td>385</td>, <td>Formosa</td>, <td>370</td>, <td>Resistencia</td>, <td>362</td>, <td>Santo Tome</td>, <td>3756</td>, <td>Jesus Maria</td>, <td>3525</td>, <td>Rio Cuarto</td>, <td>358</td>, <td>Tandil</td>, <td>249</td>, <td>La Plata</td>, <td>221</td>, <td>Rosario</td>, <td>341</td>, <td>Trelew</td>, <td>280</td>] 

簡單地說,lambda函數將返回True如果標籤沒有屬性。 filter()所做的是遍歷all_td中的每個元素,併爲每個元素運行lambda函數。如果lambda函數返回給定標籤False,它將從列表中移除。新的列表被返回。

現在,當調用塊時,列表中將會有偶數的元素,所以不會出現錯誤。

相關問題