2017-09-08 50 views
-1

我想下載,有一個表,在一個單元格,多個項目的單元格。 我得到三個問題:多條線路在網頁抓取

  1. 的曲目列表列不正確生成(它是在下面一行insted的OT了同一行的其他數據類型(如寫在標題]);在最近
  2. 歌曲列(曲目列表)沒有被嵌入在一個細胞,我無法找到一個方法來擺脫多行格式化;
  3. 下載停止在1990年的一年錯誤

    "UnicodeEncodeError: 'charmap' codec can't encode character '\x91' in position 2886: character maps to < undefined >"

    我找到了一些答案但是我 仍然無法理解如何明確解決問題。昨天我有同樣的問題,並通過在線閱讀,似乎它涉及的不是被系統識別怪異字符。是不是有一定解決問題的方式(我將出口的Excel中使用CSV)。

這是代碼(我試圖從@Anurag建議後):

import codecs 
import urllib 
import urllib.request 
from bs4 import BeautifulSoup 
from urllib.request import urlopen as uReq 
import unicodecsv as csv 
years = list(range(1965,2016)) 

for year in years: 
    my_urls = ('http://www.hitparadeitalia.it/hp_yenda/lpe' + str(year) + '.htm',) 
    my_url = my_urls[0] 
    for my_url in my_urls: 
     uClient = uReq(my_url) 
     html_input = uClient.read() 
     uClient.close() 
     page_soup = BeautifulSoup(html_input, "html.parser") 
     [s.extract() for s in page_soup('script')] 
     filename = "ALBUM" + str(year) + ".csv" 
     f = open(filename, "w") 
     headers = "NN, album, interprete, etichetta, mass, tracklist" + "/n" 
     f.write(headers) 
     containers = page_soup.findAll("table", {"class":"piccolo"}) 
     containerr = containers[0].findAll("tr") 
     container = containerr[0] 
     for container in containerr: 
      items = container.findAll("td") 
      NN = items[0].text 
      album = items[1].text 
      interprete = items[2].text 
      etichetta = items[3].text 
      mass = items[4].text 
      tracklist = items[5].text.strip() 

      print("NN: " + NN) 
      print("album: " + album) 
      print("interprete: " + interprete) 
      print("etichetta: " + etichetta) 
      print("mass: " + mass) 
      print("tracklist: " + tracklist) 

      f.write(NN + "," + album + "," + interprete + "," + etichetta + "," + mass + "," + tracklist + "\n") 
     f.close() 

我從所述打印功能輸出見:

  1. 第一行正確地產生將數據附加到每個列標題;
  2. 從第二行開始,它會一直運行到tracklist列,而不是它在下列行的所有文本中引發,並且通過執行相同的錯誤迭代從下一行重新開始。

瞭解問題的最佳方法是運行代碼並查看輸出(向下滾動它應該清楚迭代的問題)。

回答

0
...  
for container in containerr: 
    items = container("td") 
    NN = items[0].text.encode('utf-8','ignore') 
    album = items[1].text.encode('utf-8','ignore') 
    interprete = items[2].text.encode('utf-8','ignore') 
    etichetta = items[3].text.encode('utf-8','ignore') 
    mass = items[4].text.encode('utf-8','ignore') 
    tracklist = items[5].text.encode('utf-8','ignore') 

    print("NN: " + NN) 
    print("album: " + album) 
    print("interprete: " + interprete) 
    print("etichetta: " + etichetta) 
    print("mass: " + mass) 
    print("tracklist: " + tracklist) 
... 

您可以encode你的輸出到utf-8ascii

+0

它不起作用,因爲它顯然需要一些東西來告訴csv獲得utf-8。我嘗試了f = open(文件名,「w」,'utf-8'),如另一篇文章中所建議的,但它不起作用。 –

+0

如果你正在使用'python2。+' –

+0

,我會使用'ascii'編碼風格我正在使用python 3 –