使用Python編寫數據抓取

我想用Python抓取網站的內容。就像這樣：使用Python編寫數據抓取

Apple’s stock continued to dominate the news over the weekend, with Barron’s placing it on the top of its favorite 2013 stock list.

但隨着錯誤結果打印出來：

Apple âs stock continued to dominate the news over the weekend, with Barronâs placing it on the top of its favorite 2013 stock list.

符號「'」無法顯示，這裏是我的代碼：

#-*- coding: utf-8 -*- 

    import sys 
    reload(sys) 
    sys.setdefaultencoding('utf-8') 
    import urllib 
    from lxml import * 
    import urllib 
    import lxml.html as HTML 

    url = "http://www.forbes.com/sites/panosmourdoukoutas/2012/12/09/apple-tops-barrons- 10-favorite-stocks-for-2013/?partner=yahootix" 
    sock = urllib.urlopen(url) 
    htmlSource = sock.read() 
    sock.close() 

    root = HTML.document_fromstring(htmlSource) 
    contents = ' '.join([x.strip() for x in root.xpath("//div[@class='body']/descendant::text()")]) 

    print contents 

    f = open('C:/Users/yinyao/Desktop/Python Code/data.txt','w') 
    f.write(contents) 
    f.close()

然而，設置之後，printf的功能就沒用了。爲什麼？我該怎麼做？我使用的是Windows，默認的編碼方式是gbk。

來源

2012-12-18 yinyao

你可以張貼在執行該刮的代碼？ –

你是如何印製這份聲明的？請發佈您執行的確切命令以打印聲明。 Python中沒有printf函數，是嗎？ – stackoverflowery

試試[Beautiful Soup]（http://www.crummy.com/software/BeautifulSoup/） –

首先，要確保你知道The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

其次，總是內部使用Unicode格式。儘早解碼，編碼時間較晚：當您取消網站時，將其解碼爲unicode並在腳本內部將其作爲unicode內部處理。否則，你的代碼將隨機崩潰，例如，因爲在某些中文網頁的評論中遇到意外字符。只有當你通過它以後的某個地方（例如，一些可寫流），你應該對其進行編碼（「UTF-8」最好）

三，使用BeautifulSoup 4

來源

2012-12-18 08:56:19

謝謝！但我不知道何時以及如何將網站數據解碼爲unicode.I已重新編輯我的問題並顯示了我的代碼，您能否給我更多關於我的代碼的建議？ – yinyao

首先，格式化你的問題*正確* http://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks，所以代碼是可讀的。其次，BautifulSoup會爲你處理unicode –

謝謝！ BeautifulSoup很有用，但我已經通過將htmlSource解碼爲unicode來修復它。 – yinyao

使用Python編寫數據抓取

回答

相關問題