2010-11-01 72 views
1

REDIT:試圖避免只是把對論壇整個代碼塊,說我對其進行修復,但在這裏它是,簡單地確定錯誤的過程:如何將隊列中的項目與集合中的項目進行比較?

#! /usr/bin/python2.6 
import threading 
import Queue 
import sys 
import urllib 
import urllib2 
from urlparse import urlparse 
from lxml.html import parse, tostring, fromstring 

THREAD_NUMBER = 1 


class Crawler(threading.Thread): 

def __init__(self, queue, mal_urls, max_depth): 
    self.queue = queue 
    self.mal_list = mal_urls 
    self.crawled_links = [] 
    self.max_depth = max_depth 
    self.count = 0 
    threading.Thread.__init__(self) 

def run(self): 
    while True: 
     if self.count <= self.max_depth: 
      self.crawled = set(self.crawled_links) 
      url = self.queue.get() 
      if url not in self.mal_list: 
       self.count += 1 
       self.crawl(url) 
      else: 
       #self.queue.task_done() 
       print("Malicious Link Found: {0}".format(url)) 
       continue 
     else: 
      self.queue.task_done() 
      break 
    print("\nFinished Crawling! Reached Max Depth!") 
    sys.exit(2) 

def crawl(self, tgt): 
    try: 
     url = urlparse(tgt) 
     self.crawled_links.append(tgt) 
     print("\nCrawling {0}".format(tgt)) 
     request = urllib2.Request(tgt) 
     request.add_header("User-Agent", "Mozilla/5,0") 
     opener = urllib2.build_opener() 
     data = opener.open(request) 

    except: # TODO: write explicit exceptions the URLError, ValueERROR ... 
     return 

    doc = parse(data).getroot() 
    for tag in doc.xpath("//a[@href]"): 
     old = tag.get('href') 
     fixed = urllib.unquote(old) 
     self.queue_links(fixed, url) 


def queue_links(self, link, url): 

    if link.startswith('/'): 
     link = "http://" + url.netloc + link 

    elif link.startswith("#"): 
     return 

    elif not link.startswith("http"): 
     link = "http://" + url.netloc + "/" + link 


    if link not in self.crawled_links: 
     self.queue.put(link) 
     self.queue.task_done() 
    else: 
     return 


def make_mal_list(): 
"""Open various malware and phishing related blacklists and create a list 
of URLS from which to compare to the crawled links 
""" 

hosts1 = "hosts.txt" 
hosts2 = "MH-sitelist.txt" 
hosts3 = "urls.txt" 

mal_list = [] 

with open(hosts1) as first: 
    for line1 in first: 
     link = "http://" + line1.strip() 
     mal_list.append(link) 

with open(hosts2) as second: 
    for line2 in second: 
     link = "http://" + line2.strip() 
     mal_list.append(link) 

with open(hosts3) as third: 
    for line3 in third: 
     link = "http://" + line3.strip() 
     mal_list.append(link) 

return mal_list 

def main(): 
    x = int(sys.argv[2]) 
    queue = Queue.Queue() 

    mal_urls = set(make_mal_list()) 
    for i in xrange(THREAD_NUMBER): 
     cr = Crawler(queue, mal_urls, x) 
     cr.start() 


    queue.put(sys.argv[1]) 

    queue.join() 


if __name__ == '__main__': 
    main() 

所以我」網絡蜘蛛首先創建了一組包含「惡意鏈接」的文本文件的行。然後啓動一個線程,傳遞一組壞鏈接和sys.argv [1]。啓動的線程然後調用從sys.argv [1]中檢索lxml.html解析的抓取函數,然後在解析出該初始頁面之外的所有鏈接之後,將它們放入隊列中。循環繼續,每個鏈接放置在隊列中,並用self.queue.get()刪除。然後將相應的鏈接與SUPPOSED進行比較,並與一組不良鏈接進行比較。如果鏈接發現不好,則應該將循環輸出到屏幕,然後繼續到下一個鏈接,除非它已經爬過該鏈接。

如果沒有問題,抓取它,解析它,把它的鏈接放入隊列等等,每次鏈接被抓取時遞增一個計數器,直到計數器達到由作爲sys傳遞的值.argv [2]。問題在於,它應該觸發'if url not in mal_list'的if/else語句的項目不是,並且已放置在「crawled_already」列表中的鏈接正在被第2次,第3次和第4次抓取無論如何。

+0

它是如何工作的?這似乎是完全有效的。 – mikerobi 2010-11-01 16:12:36

+1

除非你錯誤地描述了你的問題,否則這裏的隊列是完全不相關的,你在測試'a not in x'中遇到了問題。 'a'是修改了'__hash__'或'__eq__'方法的自定義類嗎?如果不是代碼是好的,你需要提供一個更好的例子。 – katrielalex 2010-11-01 16:15:12

+0

你讓我了。那爲什麼我轉向堆棧尋求幫助:) X是由一個函數創建的,該函數打開幾個txt文件並在列表中添加所述文件的行,創建一個列表集合並返回集合。我將測試字符串放在其中一個文本文件的頂部,然後運行代碼。 do_something部分實際上是一個網絡蜘蛛功能。它保持正確的spidering,而不是調用do_something_else。 – Stev0 2010-11-01 16:17:25

回答

0

我不明白這個代碼的一個細節:隊列被標記爲task_done如果在self.queue_links發現任何新的鏈接,但不作爲當然的self.crawl的問題。我還以爲,該代碼會更有意義:

def crawl(self, tgt): 
    try: 
     url = urlparse(tgt) 
     self.crawled_links.append(tgt) 
     print("\nCrawling {0}".format(tgt)) 
     request = urllib2.Request(tgt) 
     request.add_header("User-Agent", "Mozilla/5,0") 
     opener = urllib2.build_opener() 
     data = opener.open(request) 
     doc = parse(data).getroot() 
     for tag in doc.xpath("//a[@href]"): 
      old = tag.get('href') 
      fixed = urllib.unquote(old) 
      self.queue_links(fixed, url) 
     self.queue.task_done() 
    except: # TODO: write explicit exceptions the URLError, ValueERROR ... 
     pass 

def queue_links(self, link, url): 
    if not link.startswith("#"): 
     if link.startswith('/'): 
      link = "http://" + url.netloc + link 
     elif not link.startswith("http"): 
      link = "http://" + url.netloc + "/" + link 
     if link not in self.crawled_links: 
      self.queue.put(link) 

我不能說,不過,我有一個完整的回答你的問題。


後來:在docsQueue.task_done建議task_done應爲1:1 Queue.get電話:

Queue.task_done()¶

Indicate that a formerly enqueued task is complete. Used by queue consumer threads. For each get() used to fetch a task, a subsequent call to task_done() tells the queue that the processing on the task is complete.

If a join() is currently blocking, it will resume when all items have been processed (meaning that a task_done() call was received for every item that had been put() into the queue).

Raises a ValueError if called more times than there were items placed in the queue.

是你得到[未捕獲] ValueError異常?看起來這可能是這樣的。

相關問題