2011-05-08 27 views
3

我正在使用NodeJs和ZombieJS在虛擬瀏覽器環境中獲取url請求。殭屍錯誤 - 獲取http請求時出錯

我使用下面的代碼:

var zombie = require('zombie'), 
jsdom = require('jsdom'), 
my_sandbox = require('sandbox'), 
url = require('url'), 
http = require('http'), 
request = require('request'), 
httpProxy = require('./lib/node-http-proxy'), 
des = '', 
util = require('util'), 
colors = require('colors'), 
is_host = true; 

var s = new my_sandbox(); 
var browser = new zombie.Browser; 

httpProxy.createServer(9000, 'localhost').listen(8000); 

function zombieFetching(page) { 
    browser.visit(page, { debug: false }, 
    function(err, browser, status) { 
     if(err) { 
     console.log('There is an error. Fix it'); 
     throw(err.message); 
     } else { 
      console.log('Browser visit successful') ; 
     } 
    }); 
} 

var server = http.createServer(function (req, res) { 
    var pathname = ''; 

    if(is_host) { 
     dest = req.url.substr(0, req.url.length); 
     pathname = dest; 
     is_host = false; 
    } else { 
     pathname = req.url.substr(0, req.url.length); 
     if(pathname.charAt(0) == "/") { 
      console.log('new request'); 
      console.log(pathname); 
      pathname = dest + pathname; 
     } 
    } 

    request.get({uri: pathname}, function (err, response, html) { 
      console.log('The pathname is:::::::::: ' + pathname); 
      zombieFetching(pathname); 
      res.end(html); 
    }); 
}); 

server.listen(9000); 

我看到下面的錯誤,當我嘗試獲取的URL: 「www.yahoo.com」

home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:62 
       throw(e); 
    ^
Error: undefined: Invalid character in tag name: �� 
    at Object.createElement (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/jsdom/lib/jsdom/level1/core.js:1174:13) 
    at TreeBuilder.createElement (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/treebuilder.js:29:25) 
    at TreeBuilder.insert_element_normal (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/treebuilder.js:61:21) 
    at TreeBuilder.insert_element (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/treebuilder.js:52:15) 
    at Object.startTagOther (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/parser/in_body_phase.js:483:12) 
    at Object.processStartTag (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/parser/phase.js:43:44) 
    at EventEmitter.do_token (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/parser.js:94:20) 
    at EventEmitter.<anonymous> (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/parser.js:112:30) 
    at EventEmitter.emit (events.js:64:17) 
    at EventEmitter.emitToken (/home/seed/Desktop/Cloud project/node_modules/zombie/node_modules/html5/lib/html5/tokenizer.js:84:7) 

此外,日誌報表如下:

The pathname is:::::::::: http://www.yahoo.com/ 
The pathname is:::::::::: http://l1.yimg.com/a/i/ww/news/2011/05/06/zuckhouse-sm.jpg 
The pathname is:::::::::: http://l1.yimg.com/a/i/ww/news/2011/05/07/cable-sm.jpg 
The pathname is:::::::::: http://l.yimg.com/a/a/1-/flash/promotions/yahoo/081120/70x50iltlb_2.jpg 

Browser visit successful 

Browser visit successful 

Browser visit successful 

Browser visit successful 

The pathname is:::::::::: http://l.yimg.com/a/i/vm/2011may/bird74.jpg 
The pathname is:::::::::: http://www.yahoo.com/jserror?ad=1&target=cms&data=FPAD 

從我所瞭解的情況來看,前四個獲得請求是成功的。 不過,我不知道爲什麼殭屍被提取的無效請求:

"http://www.yahoo.com/jserror?ad=1&target=cms&data=FPAD" 

而且,是什麼原因造成的標記名稱錯誤的無效字符?

感謝, 索尼

+0

如果我請求url:http://unixhelp.ed.ac.uk/CGI/man-cgi?grep,錯誤日誌是:throw(err.message); ^ 無法加載資源http://unixhelp.ed.ac.uk/favicon.ico,得到404.此網址無效,我不確定爲何要提取此請求。我不確定這是否是node/zombie中的錯誤,或者我的代碼中是否有錯誤。 – sony 2011-05-08 23:03:54

回答

0

favicon.ico總是由瀏覽器請求;殭屍正在模擬這種行爲。它不在HTTP協議的任何位置,但它只是瀏覽器傾向於做的事情,所以它們在支持它的網站的地址欄中顯示那個漂亮的圖標。您可能會看到jserror?請求,因爲Zombie在某個時間收到301(重定向)該URL,並且一直跟隨它,或者頁面上的某個其他元素引用它。默認情況下,Zombie的處理程序試圖關注所有內容,這就是爲什麼你要獲取圖像等等,就像瀏覽器一樣。

如果你設置了browser.debug = true我認爲你可以獲得比你的日誌語句給你更多的信息。