2012-03-07 149 views
3

我使用修改後的腳本Logging into facebook with python後解析Facebook手機時:LXML錯誤「IO錯誤:讀取文件錯誤」在Python刮刀腳本

#!/usr/bin/python2 -u 
# -*- coding: utf8 -*- 

facebook_email = "[email protected]" 
facebook_passwd = "YOUR_PASSWORD" 


import cookielib, urllib2, urllib, time, sys 
from lxml import etree 

jar = cookielib.CookieJar() 
cookie = urllib2.HTTPCookieProcessor(jar)  
opener = urllib2.build_opener(cookie) 

headers = { 
    "User-Agent" : "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7", 
    "Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,text/png,*/*;q=0.5", 
    "Accept-Language" : "en-us,en;q=0.5", 
    "Accept-Charset" : "utf-8", 
    "Content-type": "application/x-www-form-urlencoded", 
    "Host": "m.facebook.com" 
} 

try: 
    params = urllib.urlencode({'email':facebook_email,'pass':facebook_passwd,'login':'Log+In'}) 
    req = urllib2.Request('http://m.facebook.com/login.php?m=m&refsrc=m.facebook.com%2F', params, headers) 
    res = opener.open(req) 
    html = res.read() 

except urllib2.HTTPError, e: 
    print e.msg 
except urllib2.URLError, e: 
    print e.reason[1] 

def fetch(url): 
    req = urllib2.Request(url,None,headers) 
    res = opener.open(req) 
    return res.read() 

body = unicode(fetch("http://www.facebook.com/photo.php?fbid=404284859586659&set=a.355112834503862.104278.354259211255891&type=1"), errors='ignore') 
tree = etree.parse(body) 
r = tree.xpath('/see_prev') 
print r.text 

當我執行的代碼,會出現問題:

$ ./facebook_fetch_coms.py 
Traceback (most recent call last): 
    File "./facebook_fetch_coms_classic_test.py", line 42, in <module> 
    tree = etree.parse(body) 
    File "lxml.etree.pyx", line 2957, in lxml.etree.parse (src/lxml/lxml.etree.c:56230) 
    File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313) 
    File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606) 
    File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645) 
    File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554) 
    File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498) 
    File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389) 
    File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74691) 
IOError: Error reading file '<?xml version="1.0" encoding="utf-8"?> 
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="description" content="Facebook helps you connect and share with the people in your life." 

的目標是第一個獲得與id=see_prev的聯繫與lxml,然後用while循環打開所有的意見,最終獲取文件中的所有信息。任何幫助將非常感激!

編輯: 我在archlinux x86_64和lxml 2.3.3上使用Python 2.7.2。

回答

12

這是你的問題:

tree = etree.parse(body) 

documentation說: 「source是包含XML數據的文件名或文件對象。」您所提供的字符串,所以LXML正在你的HTTP響應體的文本作爲名稱要打開的文件的。沒有這樣的文件存在,所以你得到一個IOError

你甚至會說錯誤消息「Error reading file」,然後給出你的XML字符串作爲它試圖讀取的文件的名字,這是一個很大的暗示關於正在發生的事情。

你可能想etree.XML(),這需要從字符串輸入。或者你可以只是做tree = etree.parse(res)直接到lxml去HTTP請求(的opener.open()結果是一個類似文件的對象,並etree.parse()應該完全樂於使用它)來讀取。

+0

我刪除'解析()'支持HTML的'()',它工作得更好,謝謝。 – 2012-03-07 02:33:50