2014-04-02 55 views
2

這是我的代碼從易趣即LINK3拿到物品的網址:如何讓scrapy中的start_urls獲取由另一個python函數生成的url?

def url_soup(url): 

    source=(urllib2.urlopen(url)).read() 
    soup=BeautifulSoup(source) 
    link=soup.select('a.ListItemLink') 
    for links in link: 
     link3=('http://www.ebay.com/'+'%s') % (links['href']) 


Dept={"All Departments":"0","Apparel":"5438","Auto":"91083","Baby":"5427","Beauty":"1085666", 
"Books":"3920","Electronics":"3944","Gifts":"1094765","Grocery":"976759","Health":"976760", 
"Home":"4044","Home Improvement":"1072864","Jwelery":"3891","Movies":"4096","Music":"4104", 
"Party":"2637","Patio":"5428","Pets":"5440","Pharmacy":"5431","Photo Center":"5426", 
"Sports":"4125","Toys":"4171","Video Games":"2636"} 

def gen_url(keyword,domain): 

    if domain in Dept.keys(): 
     main_url=('http://www.ebay.com/search/search-ng.do?search_query='+'%s'+'&ic=16_0&Find=Find&search_constraint='+'%s') % (keyword,Dept.get(domain)) 
    url_soup(main_url) 

gen_url('Bags','Apparel') 

現在我想我的蜘蛛來接start_urlslink3每次。 P.s.我是scrapy的新手!

回答

5

您需要定義start_requests()方法來動態定義蜘蛛的起始地址。

例如,你應該有這樣的事情:

from scrapy.http import Request 
from scrapy.selector import Selector 
from scrapy.spider import BaseSpider 


class MySpider(BaseSpider): 
    name = "my_spider" 
    domains = ['Auto'] 
    departments = {"All Departments": "0", "Apparel": "5438", "Auto": "91083", "Baby": "5427", "Beauty": "1085666", 
        "Books": "3920", "Electronics": "3944", "Gifts": "1094765", "Grocery": "976759", "Health": "976760", 
        "Home": "4044", "Home Improvement": "1072864", "Jwelery": "3891", "Movies": "4096", "Music": "4104", 
        "Party": "2637", "Patio": "5428", "Pets": "5440", "Pharmacy": "5431", "Photo Center": "5426", 
        "Sports": "4125", "Toys": "4171", "Video Games": "2636"} 
    keyword = 'Auto' 

    allowed_domains = ['ebay.com'] 

    def start_requests(self): 
     for domain in self.domains: 
      if domain in self.departments: 
       url = 'http://www.ebay.com/search/search-ng.do?search_query=%s&ic=16_0&Find=Find&search_constraint=%s' % (self.keyword, self.departments.get(domain)) 
       print "YIELDING" 
       yield Request(url) 

    def parse(self, response): 
     print "IN PARSE" 
     sel = Selector(response) 
     links = sel.select('//a[@class="ListItemLink"]/@href') 
     for link in links: 
      href = link.extract()[0] 
      yield Request('http://www.ebay.com/' + href, self.parse_data) 

    def parse_data(self, response): 
     # do your actual crawling here 
     print "IN PARSE DATA" 

希望有所幫助。

+0

感謝您的幫助!現在發生了什麼事是請求URL parse_data它不起作用。同時給我的網址作爲輸出。手段爬行沒有正確發生,或者沒有得到這些特定網址的迴應。 – user3488659

+0

@ user3488659我已經更新了代碼 - 顯示了我現在使用的內容。至少有一個問題:易趣在'start_requests'的搜索頁面顯示404:'http://www.ebay.com/search/search-ng.do ...'。你確定這是你需要的正確搜索網址嗎? – alecxe

+0

哦,我很抱歉that.i剛剛註冊,所以SO不允許我粘貼兩個以上的鏈接,我將它們改爲example.com,但仍然無法繼續,所以試着用ebay完成。我真的很抱歉。這實際上是沃爾瑪。對造成的不便表示歉意。 – user3488659

相關問題