2014-07-18 32 views
0

試圖使用Scrapy從搜索引擎中獲取基於關鍵字的文件列表。在搜索引擎上使用Scrapy在文件中使用關鍵字

下面是Scrapy輸出錯誤:

Redirecting (301) to <GET https://duckduckgo.com/?q=> from <GET https://www.duckduckgo.com/?q=> 
2014-07-18 16:23:39-0500 [wnd] DEBUG: Crawled (200) <GET https://duckduckgo.com/?q=> (referer: None) 

這裏是代碼:

import re 
import os 
import sys 
import json 

from scrapy.spider import Spider 
from scrapy.selector import Selector 

searchstrings = "wnd.config" 
searchoutcome = "searchResults.json" 


class wndSpider(Spider): 
    name = "wnd" 
    allowed_domains = ['google.com'] 
    url_prefix = [] 
    #start_urls = ['https://www.google.com/search?q='] 
    start_urls = ['https://www.duckduckgo.com/?q='] 
    for line in open(searchstrings, 'r').readlines(): 
     url_prefix = start_urls[0] + line 
     #url = url_prefix[0] + line 


     #f = open(searchstrings 
     #start_urls = [url_prefix] 
     #for f in f.readlines(): 
     #f.close() 


     def parse(self, response): 
      sel = Selector(response) 
      goog_search_list = sel.xpath('//h3/a/@href').extract() 
     #goog_search_list = [re.search('q=(.*&sa',n).group(1) for n in goog_search_list] 
     #if re.search('q=(.*)&sa',n)] 
     #title = sel.xpath('//title/text()').extract() 
     #if len(title)>0: title = tilstle[0] 
     #contents = sel.xpath('/html/head/meta[@name="description"] /@content').extract() 
     #if len(contents)>0: contents = contents[0]   

     ## dump output 
     #with open(searchoutcome, "w") as outfile: 
      #json.dump(searchoutcome ,outfile, indent=4) 
+0

爲什麼你認爲這是一個錯誤?你會期待什麼?第一個輸出顯示301 HTTP代碼,這只是一個重定向。下一個是200,這是成功的。所以我沒有看到一個錯誤。這種行爲可能不是你所期待的,但你並沒有告訴我們你期待的是什麼。所以沒有機會提供幫助。 ;-) – Achim

+1

'parse()'有錯誤的縮進級別。 – kev

回答

0

您需要添加urlstart_urls在for循環。

start_urls = [] 
base_url = 'https://www.duckduckgo.com/?q=' 
for line in open(searchstrings, 'r'): 
    url = base + line.strip() 
    start_urls.append(url) 

如果您的關鍵字中包含特殊字符,嘗試urllib.urlencode

相關問題