2014-12-22 102 views
0

我嘗試使用下面的代碼從wikipedia提取表:的Python,beautifulsoup:從表格單元格中提取文本

import urllib2 

from bs4 import BeautifulSoup 

file = open('belarus_wiki.txt', 'w') 

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens" 
page = urllib2.urlopen(url) 

soup = BeautifulSoup(page) 

country = "" 
visa = "" 
notes = "" 

table = soup.find("table", "sortable wikitable") 
for row in table.findAll("tr"): 
    cells = row.findAll("td") 
    if len(cells) == 3: 
     country = cells[0].findAll(text=True) 
     visa = cells[1].findAll(text=True) 
     notes = cells[2].find(text=True) 

     print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8") 

     file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n') 

file.close() 

但我看到錯誤消息:

Traceback (most recent call last): 
File "...\belarus_wiki.py", line 27, in <module> 
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8") 
IndexError: list index out of range 

請告訴我如何從這些單元格中提取所有文本?

+1

始終包含*在Python中看到的任何錯誤的*完整回溯*。這樣我們就不必猜測你的錯誤在哪裏。 –

+0

你應該鏈接解析的頁面+完整的stackstrace。 –

+0

感謝您的評論。我添加了一個鏈接到頁面,以及回溯的全文。 – Anton

回答

3

您可以使用此:

for line in table.findAll('tr'): 
    for l in line.findAll('td'): 
     if l.find('sup'): 
      l.find('sup').extract() 
     print l.getText(),'|', 
    print 

這裏的它打印什麼的摘錄:

Romania | Visa required | | 
 Russia | Freedom of movement | | 
 Rwanda | Visa required | Visa is obtained online. | 
 Saint Kitts and Nevis | Visa required | Visa obtainable online. | 
 Saint Lucia | Visa required | | 
 Saint Vincent and the Grenadines | Visa not required | 1 month | 
 Samoa | Visa on arrival !Entry Permit on arrival | 60 days | 
 San Marino | Visa required | | 
 São Tomé and Príncipe | Visa required | Visa is obtained online. | 
 Saudi Arabia | Visa required | | 
 Senegal | Visa required | | 
 Serbia | Visa not required | 30 days | 
 Seychelles | Visa on arrival !Visitor's Permit on arrival | 1 month | 
 Sierra Leone | Visa required | | 
 Singapore | Visa required | May obtain online. | 
 Slovakia | Visa required | | 
 Slovenia | Visa required | | 
+0

它的工作原理,謝謝。但它不是我所需要的。此代碼從「簽證要求」欄中提取不僅是主要信息,而且還提供了一個我不需要的參考號碼。在第一種情況下,我僅使用visa = cells [1] .findAll(text = True)構造提取第一個值。你能告訴我,我應該如何轉換你的代碼才能提取我需要的表格單元的那些部分? – Anton

+0

我編輯了我的答案;告訴我它是否有效。 – DavidK

+0

它像我需要的那樣工作,非常感謝! – Anton

0

錯誤:

print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8") 

正確:

if notes is None: 
    print country[1].encode("utf-8"), visa[0].encode("utf-8") 
else: 
    print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8") 

完整代碼:

import urllib2 

from bs4 import BeautifulSoup 

file = open('belarus_wiki.txt', 'w') 

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens" 
page = urllib2.urlopen(url) 

soup = BeautifulSoup(page) 

country = "" 
visa = "" 
notes = "" 

table = soup.find("table", "sortable wikitable") 
for row in table.findAll("tr"): 
    cells = row.findAll("td") 
    if len(cells) == 3: 
     country = cells[0].findAll(text=True) 
     visa = cells[1].findAll(text=True) 
     notes = cells[2].find(text=True) 
     if notes is None: 
      print country[1].encode("utf-8"), visa[0].encode("utf-8") 
      file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n') 
     else: 
      print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8") 
      file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + ',' + notes.encode("utf-8") + '\n') 

我的環境:
OS X 10.10.1
的Python 2.7.8
BeautifulSoup 4.1.3

+0

根據此建議更正了代碼後,我看到以下錯誤消息:文件「... \ belarus_wiki.py」,第30行,在 print country [1] .encode(「utf-8」),visa [ 0] .encode(「utf-8」),notes.encode(「utf-8」)AttributeError:'ResultSet'對象沒有屬性'encode' – Anton

+0

您還需要修復file.write()行,打印線。 – tomute

+0

當我更正了「打印」和「寫入」代碼行時,我又看到了同樣的錯誤消息。但是如果我在這一行使用「find」而不是「findAll」:「notes = cells [2] .findAll(text = True)」,代碼有效,但是在幾個單元格中有文本和代碼部分的列表只提取其中的第一個。請告訴我,我如何從表格單元中提取全文。 – Anton

相關問題