所以我的蜘蛛需要一個網站列表,它通過start_requests
爬過每個網站,其中yield
request
通過item
作爲meta
。如果我不知道什麼時候蜘蛛會完成,何時返回一個物品?
然後,蜘蛛探索單個網站的所有內部鏈接,並收集到item
的所有外部鏈接。問題是,我不知道什麼時候蜘蛛完成爬取所有的內部鏈接,所以我不能yield
和item
。
class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
request = Request(seed_url, callback=self.parse_seed)
request.meta['item'] = item
requests.append(request)
return requests
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
你能展示你當前的蜘蛛代碼嗎?謝謝。 – alecxe
@alecxe編輯! –