瞭解無效的文字錯誤的網頁抓取

我想從維基百科1992年至2014年的廣告牌前100名刮，然後清理數據。我在最後得到一個「無效文字」錯誤：瞭解無效的文字錯誤的網頁抓取

years = range(1992,2015) 
yearstext = dict() 
for year in years: 
    t_1992=requests.get('http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_%(year)s' % {"year":year}) 
    soup = BeautifulSoup(t_1992.text, "html.parser") 
    yearstext[year]=soup 

def parse_year(year, ytextdixt): 
    rows = soup.find("table", attrs={"class": "wikitable"}).find_all("tr")[1:] 
    cleaner = lambda r: [r[0].get_text(), int(r[1].get_text()), r[2].get_text(), r[2].find("a").get("href"), r[3].get_text(),r[3].find("a").get("href")] 
    fields = ["band_singer", "ranking", "song", "songurl","titletext","url"] 
    songs = [dict(zip(fields, cleaner(row.find_all("td")))) for row in rows] 

ValueError: invalid literal for int() with base 10: 'Pharrell Williams'

任何人都知道這是爲什麼？

來源

2015-09-22 meow234

第1列中的數據不包含排名，它包含樂隊/歌手。但看看這個頁面，似乎並不是這樣。也許在頁面上有多個表格，而你正在弄錯了嗎？ –

'R [1] .get_text（）' 返回 '菲瑞威廉斯' 在某些情況下

然後 'INT（R [1] .get_text（））' 觸發此異常。

重新檢查從網址獲得的詳細信息。

來源

2015-09-22 05:56:10 beviniy

做一個小實驗，我發現：

from bs4 import BeautifulSoup 
import requests 

year = 1992 
t_1992=requests.get('http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_%(year)s' % {"year":year}) 
soup = BeautifulSoup(t_1992.content, "lxml.parser") 
rows = soup.find("table", attrs={"class": "wikitable"}).find_all("tr")[1:] 
rows[0].get_text()

給出：

u'\n1\n"End of the Road"\nBoyz II Men\n'

因此，使用：

rows[0].get_text().strip().split('\n')

給出：

[u'1', u'"End of the Road"', u'Boyz II Men']

應該讓你走上正軌。

來源

2015-09-22 06:30:30

直到我重新閱讀答案時才意識到雙關語 - 但我非常喜歡它，所以要離開它！ –

瞭解無效的文字錯誤的網頁抓取

回答

相關問題