2015-09-08 109 views
0

以此爲出發點.. http://docs.python-guide.org/en/latest/scenarios/scrape/的Python刮網站,請求和LXML ..

from lxml import html 
import requests 
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') 
tree = html.fromstring(page.text) 

一切正常expected..But,....

from lxml import html 
import requests 

page = requests.get('http://www.streetinsider.com/ipo_history.php?type=upcoming') 
tree = html.fromstring(page.text) 

給出了這樣的錯誤...

File "<string>", line unknown 
XMLSyntaxError: line 1: Document is empty 

使用pyquery ....

from pyquery import PyQuery as pq 
from lxml import etree,html 
import requests 


response = pq(url='http://www.streetinsider.com/ipo_history.php?type=upcoming') 

doc = pq(response.content) 

拋出這個錯誤...

File "<string>", line unknown 
XMLSyntaxError: line 1504: Unexpected end tag : h2 

任何從網頁獲取表的幫助。

回答

2

某些網站檢測並阻止某些用戶代理。 (類似於web機器人)。www.streetinsider.com背後的Web應用程序似乎檢測到python請求的用戶代理,並(被動地)阻止其HTTP請求。

您可以使用requests.get函數調用的headers = {'User-Agent':''}參數來設置user-aget。

page = requests.get('http://www.streetinsider.com/ipo_history.php', \ 
        headers={'User-Agent': 'tester'}, \ 
        params={'type':'upcoming'}) 
+0

你能得到表...還是顯示'頁面'不是空白.... – Merlin

+0

上面的代碼從服務器接收到非空的HTTP主體。 – rein