2017-06-26 102 views
1

我最近開始使用scrapy進行網絡抓取,我生成了一個我想從一個新行分隔的txt文檔中刪除的url列表。這是我的履帶代碼:從Scrapy中的csv文件導入start_urls

import scrapy 
import csv 
import sys 
from realtor.items import RealtorItem 

from scrapy.spider import BaseSpider 
#from scrapy.selector import HtmlXPathSelector 
#from realtor.items import RealtorItem 
class RealtorSpider(scrapy.Spider): 
    name = "realtor" 
    allowed_domains = ["realtor.com"] 

    with open('realtor2.txt') as f: 
     start_urls = [url.strip() for url in f.readlines()] 


    def parse(self, response): 
     #hxs = HtmlXPathSelector(response) 
     #sites = hxs.select('//div/li/div/a/@href') 
     sites = response.xpath('//a[contains(@href, "/realestateandhomes-detail/")]') 
     items = [] 
     for site in sites: 
      print(site.extract()) 
      item = RealtorItem() 
      item['link'] = site.xpath('@href').extract() 
      items.append(item) 
     return items 

我現在的目標就是從文件讀取realtor2.txt的鏈接,並開始通過這些分析,但是我得到的請求URL一個ValueError丟失方案:

File "C:\Users\Ash\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: 
%FF%FEw%00w%00w%00.%00r%00e%00a%00l%00t%00o%00r%00.%00c%00o%00m%00/%00r%00e%00a%00l%00e%00s%00t%00a%00t%00e%00a%00n%00d%00h%00o%00m%00e%00s%00-%00d%00e%00t%00a%00i%00l%00/%005%000%00-%00M%00e%00n%00o%00r%00e%00s%00-%00A%00v%00e%00-%00A%00p%00t%00-%006%001%000%00_%00C%00o%00r%00a%00l%00-%00G%00a%00b%00l%00e%00s%00_%00F%00L%00_%003%003%001%003%004%00_%00M%005%003%008%000%006%00-%005%008%006%007%007%00%0D%00 
2017-06-25 22:28:35 [scrapy.core.engine] INFO: Closing spider (finished) 

我認爲在定義start_urls時可能存在問題,但我不知道如何繼續,

+1

你可以發佈你的csv的前幾個項目? – Granitosaurus

回答

0

"ValueError: Missing scheme in request url"表示您缺少http
您可以使用urljoin來避免此問題。