我嘗試使用下面的代碼從wikipedia提取表:的Python,beautifulsoup:從表格單元格中提取文本
import urllib2
from bs4 import BeautifulSoup
file = open('belarus_wiki.txt', 'w')
url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
country = ""
visa = ""
notes = ""
table = soup.find("table", "sortable wikitable")
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 3:
country = cells[0].findAll(text=True)
visa = cells[1].findAll(text=True)
notes = cells[2].find(text=True)
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")
file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n')
file.close()
但我看到錯誤消息:
Traceback (most recent call last):
File "...\belarus_wiki.py", line 27, in <module>
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")
IndexError: list index out of range
請告訴我如何從這些單元格中提取所有文本?
始終包含*在Python中看到的任何錯誤的*完整回溯*。這樣我們就不必猜測你的錯誤在哪裏。 –
你應該鏈接解析的頁面+完整的stackstrace。 –
感謝您的評論。我添加了一個鏈接到頁面,以及回溯的全文。 – Anton