的Python，beautifulsoup：從表格單元格中提取文本

我嘗試使用下面的代碼從wikipedia提取表：的Python，beautifulsoup：從表格單元格中提取文本

import urllib2 

from bs4 import BeautifulSoup 

file = open('belarus_wiki.txt', 'w') 

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens" 
page = urllib2.urlopen(url) 

soup = BeautifulSoup(page) 

country = "" 
visa = "" 
notes = "" 

table = soup.find("table", "sortable wikitable") 
for row in table.findAll("tr"): 
    cells = row.findAll("td") 
    if len(cells) == 3: 
     country = cells[0].findAll(text=True) 
     visa = cells[1].findAll(text=True) 
     notes = cells[2].find(text=True) 

     print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8") 

     file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n') 

file.close()

但我看到錯誤消息：

Traceback (most recent call last): 
File "...\belarus_wiki.py", line 27, in <module> 
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8") 
IndexError: list index out of range

請告訴我如何從這些單元格中提取所有文本？

來源

2014-12-22 Anton

始終包含*在Python中看到的任何錯誤的*完整回溯*。這樣我們就不必猜測你的錯誤在哪裏。 –

你應該鏈接解析的頁面+完整的stackstrace。 –

感謝您的評論。我添加了一個鏈接到頁面，以及回溯的全文。 – Anton

您可以使用此：

for line in table.findAll('tr'): 
    for l in line.findAll('td'): 
     if l.find('sup'): 
      l.find('sup').extract() 
     print l.getText(),'|', 
    print

這裏的它打印什麼的摘錄：

Romania | Visa required | | 
 Russia | Freedom of movement | | 
 Rwanda | Visa required | Visa is obtained online. | 
 Saint Kitts and Nevis | Visa required | Visa obtainable online. | 
 Saint Lucia | Visa required | | 
 Saint Vincent and the Grenadines | Visa not required | 1 month | 
 Samoa | Visa on arrival !Entry Permit on arrival | 60 days | 
 San Marino | Visa required | | 
 São Tomé and Príncipe | Visa required | Visa is obtained online. | 
 Saudi Arabia | Visa required | | 
 Senegal | Visa required | | 
 Serbia | Visa not required | 30 days | 
 Seychelles | Visa on arrival !Visitor's Permit on arrival | 1 month | 
 Sierra Leone | Visa required | | 
 Singapore | Visa required | May obtain online. | 
 Slovakia | Visa required | | 
 Slovenia | Visa required | |

來源

2014-12-22 16:50:11 DavidK

它的工作原理，謝謝。但它不是我所需要的。此代碼從「簽證要求」欄中提取不僅是主要信息，而且還提供了一個我不需要的參考號碼。在第一種情況下，我僅使用visa = cells [1] .findAll（text = True）構造提取第一個值。你能告訴我，我應該如何轉換你的代碼才能提取我需要的表格單元的那些部分？ – Anton

我編輯了我的答案;告訴我它是否有效。 – DavidK

它像我需要的那樣工作，非常感謝！ – Anton

錯誤：

print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")

正確：

if notes is None: 
    print country[1].encode("utf-8"), visa[0].encode("utf-8") 
else: 
    print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8")

完整代碼：

import urllib2 

from bs4 import BeautifulSoup 

file = open('belarus_wiki.txt', 'w') 

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens" 
page = urllib2.urlopen(url) 

soup = BeautifulSoup(page) 

country = "" 
visa = "" 
notes = "" 

table = soup.find("table", "sortable wikitable") 
for row in table.findAll("tr"): 
    cells = row.findAll("td") 
    if len(cells) == 3: 
     country = cells[0].findAll(text=True) 
     visa = cells[1].findAll(text=True) 
     notes = cells[2].find(text=True) 
     if notes is None: 
      print country[1].encode("utf-8"), visa[0].encode("utf-8") 
      file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n') 
     else: 
      print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8") 
      file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + ',' + notes.encode("utf-8") + '\n')

我的環境：
OS X 10.10.1
的Python 2.7.8
BeautifulSoup 4.1.3

來源

2014-12-22 16:25:08 tomute

根據此建議更正了代碼後，我看到以下錯誤消息：文件「... \ belarus_wiki.py」，第30行，在 print country [1] .encode（「utf-8」），visa [ 0] .encode（「utf-8」），notes.encode（「utf-8」）AttributeError：'ResultSet'對象沒有屬性'encode' – Anton

您還需要修復file.write（）行，打印線。 – tomute

當我更正了「打印」和「寫入」代碼行時，我又看到了同樣的錯誤消息。但是如果我在這一行使用「find」而不是「findAll」：「notes = cells [2] .findAll（text = True）」，代碼有效，但是在幾個單元格中有文本和代碼部分的列表只提取其中的第一個。請告訴我，我如何從表格單元中提取全文。 – Anton

的Python，beautifulsoup：從表格單元格中提取文本

回答

相關問題