2
我是一個**非常新的Python程序員。使用urllib和beautifulsoup在webcrawler上工作。請忽略頂部的while循環和我的增量,我只是運行這個測試版本,併爲一頁,但它最終將包括一整套。我的問題是,這會得到湯,但會產生一個錯誤。我不確定我是否正確收集表格數據,但我希望這段代碼可以忽略鏈接並將文本寫入.csv文件。現在我專注於將文本正確地打印到屏幕上。美麗的湯錯誤:列表索引超出範圍
line 17, in <module>
uspc = col[0].string
IndexError: list index out of range
這裏是代碼:
for row in table.findAll('tr')[1:]:
到:
for row in table.findAll('tr')[2:]:
的
import urllib
from bs4 import BeautifulSoup
i=125
while i==125:
url = "http://www.uspto.gov/web/patents/classification/cpc/html/us" + str(i) + "tocpc.html"
print url + '\n'
i += 1
data = urllib.urlopen(url).read()
print data
#get the table data from dump
#append to csv file
soup = BeautifulSoup(data)
table = soup.find("table", width='80%')
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
uspc = col[0].string
cpc1 = col[1].string
cpc2 = col[2].string
cpc3 = col[3].string
record = (uspc, cpc1, cpc2, cpc3)
print "|".join(record)
[Beautifulsoup for row loop只能運行一次?](http://stackoverflow.com/questions/15908604/beautifulsoup-for-row-loop-only-runs-once) – gauden 2013-04-09 18:10:16