2016-01-23 67 views
0

我正在嘗試使用下面的腳本。爲什麼它不檢索這個網站的URL列表?它適用於其他網站。無法獲取網址列表

最初我以爲問題是robots.txt不允許,但是當我運行它時沒有返回錯誤。

import urllib 
from bs4 import BeautifulSoup 
import urlparse 
import mechanize 

url = "https://www.danmurphys.com.au" 

br = mechanize.Browser() 
br.set_handle_robots(False) 

urls = [url] 
visited =[url] 

print 
while len(urls)>0: 
try: 
    br.open(urls[0]) 
    urls.pop(0) 
    for link in br.links(): 
     #print link 
     #print "The base url is :" + link.base_url # just check there is this applicable to all sites. 
     #print "The url is: " + link.url # This gives generally just the page name 
     new_url = urlparse.urljoin(link.base_url,link.url) 
     b1 = urlparse.urlparse(new_url).hostname 
     b2 = urlparse.urlparse(new_url).path 
     new_url = "http://"+ b1 + b2 

     if new_url not in visited and urlparse.urlparse(url).hostname in new_url: 
      visited.append(new_url) 
      urls.append(new_url) 
      print new_url 
except: 
    print "error" 
    urls.pop(0) 

回答

0

你需要用別的東西來湊這個URL,例如scrapyscrapyJSPhantom JS作爲機械化庫不使用JavaScript工作。

r = br.open(urls[0]) 
html = r.read() 
print html 

,你會看到輸出:

<noscript>Please enable JavaScript to view the page content.</noscript>