它以網頁上的一個URL(例如:http://python.org)開頭,獲取與該URL相對應的網頁,並將該頁面上的所有鏈接解析爲鏈接存儲庫。接下來,它從剛剛創建的存儲庫中獲取任何url的內容,將這個新內容中的鏈接解析到存儲庫中,並繼續處理存儲庫中所有鏈接的這個過程,直到停止或獲取給定數量的鏈接之後。Scrapy遞歸鏈接爬蟲
我該如何使用python和scrapy?我能湊網頁中的所有鏈接,但如何在深度遞歸執行它
它以網頁上的一個URL(例如:http://python.org)開頭,獲取與該URL相對應的網頁,並將該頁面上的所有鏈接解析爲鏈接存儲庫。接下來,它從剛剛創建的存儲庫中獲取任何url的內容,將這個新內容中的鏈接解析到存儲庫中,並繼續處理存儲庫中所有鏈接的這個過程,直到停止或獲取給定數量的鏈接之後。Scrapy遞歸鏈接爬蟲
我該如何使用python和scrapy?我能湊網頁中的所有鏈接,但如何在深度遞歸執行它
的幾點意見:
下面以一個簡單的實現:它不fetcch內部鏈接(只絕對建制超鏈接),也沒有任何錯誤處理(403,404,沒有鏈接,...),它是慢得可憐(在multiprocessing
模塊在這種情況下可以幫助很多)。
import BeautifulSoup
import urllib2
import itertools
import random
class Crawler(object):
"""docstring for Crawler"""
def __init__(self):
self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()
self.counter = 0 # Simple counter for debug purpose
def open(self):
# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)
# Fetch every links
self.soup = BeautifulSoup.BeautifulSoup(html_code)
page_links = []
try :
page_links = itertools.ifilter( # Only deal with absolute links
lambda href: 'http://' in href,
(a.get('href') for a in self.soup.findAll('a')) )
except Exception: # Magnificent exception handling
pass
# Update links
self.links = self.links.union(set(page_links))
# Choose a random url from non-visited set
self.current_page = random.sample(self.links.difference(self.visited_links),1)[0]
self.counter+=1
def run(self):
# Crawl 3 webpages (or stop if all url has been fetched)
while len(self.visited_links) < 3 or (self.visited_links == self.links):
self.open()
for link in self.links:
print link
if __name__ == '__main__':
C = Crawler()
C.run()
輸出:
In [48]: run BFScrawler.py
0 : http://www.python.org/
1 : http://twistedmatrix.com/trac/
2 : http://www.flowroute.com/
http://www.egenix.com/files/python/mxODBC.html
http://wiki.python.org/moin/PyQt
http://wiki.python.org/moin/DatabaseProgramming/
http://wiki.python.org/moin/CgiScripts
http://wiki.python.org/moin/WebProgramming
http://trac.edgewall.org/
http://www.facebook.com/flowroute
http://www.flowroute.com/
http://www.opensource.org/licenses/mit-license.php
http://roundup.sourceforge.net/
http://www.zope.org/
http://www.linkedin.com/company/flowroute
http://wiki.python.org/moin/TkInter
http://pypi.python.org/pypi
http://pycon.org/#calendar
http://dyn.com/
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.
google.com/public/basic.ics
http://www.pygame.org/news.html
http://www.turbogears.org/
http://www.openbookproject.net/pybiblio/
http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
http://support.flowroute.com/forums
http://www.pentangle.net/python/handbook/
http://dreamhost.com/?q=twisted
http://www.vrplumber.com/py3d.py
http://sourceforge.net/projects/mysql-python
http://wiki.python.org/moin/GuiProgramming
http://software-carpentry.org/
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.
google.com/public/basic.ics
http://wiki.python.org/moin/WxPython
http://wiki.python.org/moin/PythonXml
http://www.pytennessee.org/
http://labs.twistedmatrix.com/
http://www.found.no/
http://www.prnewswire.com/news-releases/voip-innovator-flowroute-relocates-to-se
attle-190011751.html
http://www.timparkin.co.uk/
http://docs.python.org/howto/sockets.html
http://blog.python.org/
http://docs.python.org/devguide/
http://www.djangoproject.com/
http://buildbot.net/trac
http://docs.python.org/3/
http://www.prnewswire.com/news-releases/flowroute-joins-voxbones-inum-network-fo
r-global-voip-calling-197319371.html
http://www.psfmember.org
http://docs.python.org/2/
http://wiki.python.org/moin/Languages
http://sip-trunking.tmcnet.com/topics/enterprise-voip/articles/341902-grandstrea
m-ip-voice-solutions-receive-flowroute-certification.htm
http://www.twitter.com/flowroute
http://wiki.python.org/moin/NumericAndScientific
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.
google.com/public/basic.ics
http://freecode.com/projects/pykyra
http://www.xs4all.com/
http://blog.flowroute.com
http://wiki.python.org/moin/PyGtk
http://twistedmatrix.com/trac/
http://wiki.python.org/moin/
http://wiki.python.org/moin/Python2orPython3
http://stackoverflow.com/questions/tagged/twisted
http://www.pycon.org/
這裏是從網頁寫入廢料鏈接遞歸主爬行方法。該方法將抓取URL並將所有抓取的URL放入緩衝區中。現在多個線程將等待從此全局緩衝區中彈出URL並再次調用此爬網方法。
def crawl(self,urlObj):
'''Main function to crawl URL's '''
try:
if ((urlObj.valid) and (urlObj.url not in CRAWLED_URLS.keys())):
rsp = urlcon.urlopen(urlObj.url,timeout=2)
hCode = rsp.read()
soup = BeautifulSoup(hCode)
links = self.scrap(soup)
boolStatus = self.checkmax()
if boolStatus:
CRAWLED_URLS.setdefault(urlObj.url,"True")
else:
return
for eachLink in links:
if eachLink not in VISITED_URLS:
parsedURL = urlparse(eachLink)
if parsedURL.scheme and "javascript" in parsedURL.scheme:
#print("***************Javascript found in scheme " + str(eachLink) + "**************")
continue
'''Handle internal URLs '''
try:
if not parsedURL.scheme and not parsedURL.netloc:
#print("No scheme and host found for " + str(eachLink))
newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme,"netloc":urlObj.netloc}))
eachLink = newURL
elif not parsedURL.scheme :
#print("Scheme not found for " + str(eachLink))
newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme}))
eachLink = newURL
if eachLink not in VISITED_URLS: #Check again for internal URL's
#print(" Found child link " + eachLink)
CRAWL_BUFFER.append(eachLink)
with self._lock:
self.count += 1
#print(" Count is =================> " + str(self.count))
boolStatus = self.checkmax()
if boolStatus:
VISITED_URLS.setdefault(eachLink, "True")
else:
return
except TypeError:
print("Type error occured ")
else:
print("URL already present in visited " + str(urlObj.url))
except socket.timeout as e:
print("**************** Socket timeout occured*******************")
except URLError as e:
if isinstance(e.reason, ConnectionRefusedError):
print("**************** Conn refused error occured*******************")
elif isinstance(e.reason, socket.timeout):
print("**************** Socket timed out error occured***************")
elif isinstance(e.reason, OSError):
print("**************** OS error occured*************")
elif isinstance(e,HTTPError):
print("**************** HTTP Error occured*************")
else:
print("**************** URL Error occured***************")
except Exception as e:
print("Unknown exception occured while fetching HTML code" + str(e))
traceback.print_exc()
完整的源代碼和指令可在https://github.com/tarunbansal/crawler
您應該在您的答案中發佈相關代碼。 – Bram
僅適用於python3! – erdomester
由於一噸@georgesl爲了節省我的勞動天。實際上我對這個遞歸爬行非常困惑。我在生產中使用的這段代碼對於如何通過實施多重處理來說明如何快速完成模塊將是有益的模塊 –