從BeautifulSoup寫文本到文件

我想解析貨幣表http://en.wikipedia.org/wiki/List_of_circulating_currencies。問題是我沒有得到正確格式的輸出。我所要的輸出的形式爲：從BeautifulSoup寫文本到文件

country currency

多的幣種要麼是下一行或空格前面的貨幣後的情況。這是多遠我能得到

from bs4 import BeautifulSoup 

import urllib2 
url="http://en.wikipedia.org/wiki/List_of_circulating_currencies" 
soup=BeautifulSoup(urllib2.urlopen(url).read()) 
i=1 
fr=open("out.txt","w") 
for row in soup.findAll('table')[0].findAll('tr'): 
    if i==1: 
     i+=1 
     continue 


    temp_row=row.findAll('td') 
    print len(temp_row) 
    """Handling the case for multiple currencies""" 
    if(len(temp_row)==5): 
     ans=row.findAll('td')[0].findAll('a') 
     if len(ans)==0 : 
      ans=row.findAll('td')[0].contents 
     else : 
      ans=row.findAll('td')[0].findAll('a')[0].contents 
     fr.write("  "+str(ans)+"\n") 
    else: 
     first=row.findAll('td')[0].findAll('a')[0].contents 

     ans=row.findAll('td')[1].findAll('a') 
     if len(ans)==0 : 
      ans=row.findAll('td')[1].contents 
     else : 
      ans=row.findAll('td')[1].findAll('a')[0].contents 
    #print first 
     fr.write(str(first)+" "+str(ans)+"\n")

問題我想字符串時，我使用的內容，[0]，而不是內容是給予：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 15: ordinal not in range(128)

錯誤我也沒有以確切的格式獲得輸出。文件out.txt必須由用VB編寫的其他程序讀取，因此我希望文件格式儘可能接近指定的格式。也請幫我清理代碼。

更新：

我使用編碼得到以下錯誤：

File "D:/scrap.py", line 33, in <module> 
    first = first.encode('ascii', 'ignore') 
    File "C:\Python27\lib\site-packages\bs4\element.py", line 992, in encode 
    u = self.decode(indent_level, encoding, formatter) 
    File "C:\Python27\lib\site-packages\bs4\element.py", line 1056, in decode 
    indent_space = (' ' * (indent_level - 1)) 
TypeError: unsupported operand type(s) for -: 'str' and 'int'

更新：新增以下行之初，使其工作

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")

來源

2014-01-14 user2179293

我想我們錯過了一些東西，在你的代碼的某個地方，它似乎正在對你的字符串執行一個數學公式。 –

當貨幣名稱由兩個單詞組成時，您如何處理這種情況？如「俄羅斯盧布」。對於俄羅斯，輸出字符串應該是（根據您的規格）「俄羅斯盧布」。讀者如何知道這是一種貨幣還是兩種貨幣？ csv文件不是更好的選擇嗎？ –

@SteinarLima優點看起來像我必須將貨幣代碼添加到列表中。但首先我需要正確解析頁面。感謝幫助。你有解決這個問題的方法嗎？ – user2179293

如果你確定在你的文件UTF字符，你可以使用encode('utf8')轉換Unicode對象爲utf編碼字符串。

來源

2014-01-14 16:06:34

，看看你」重新獲得可靠的結果試圖忽略unicode片刻，看看結果是什麼。

first = first.encode('ascii', 'ignore') 
ans = ans.encode('ascii', 'ignore') 

print first + " " + ans

來源

2014-01-14 14:12:57

看到我更新的問題 – user2179293

這是完整的代碼 – user2179293

我首先改變爲str（第一）回答str（ans）不支持的類型的錯誤消失了，但之前的錯誤仍然存在 – user2179293

從BeautifulSoup寫文本到文件

回答

相關問題