我想運行scrapy作爲python腳本,但我無法弄清楚如何正確設置設置或我如何提供它們。我不確定這是否是一個設置問題,但我假設它。如何將運行scrapy的默認設置設置爲python腳本?
我的配置:
- Python 2.7版的x86(虛擬環境)
- Scrapy 1.2.1
- Win 7的64位
我把建議從https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script得到它運行。我有以下建議的一些問題:
如果您在Scrapy項目中有一些額外的幫助程序,您可以使用它們在項目中導入這些組件。您可以自動將您的蜘蛛名稱傳遞給CrawlerProcess,並使用get_project_settings通過項目設置獲取Settings實例。
那麼,「Scrapy項目內」是指什麼?當然,我必須導入庫並安裝依賴項,但我想避免使用scrapy crawl xyz
開始抓取過程。
這裏的myScrapy.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse
#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')
parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url
#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
urlToScanNoProt = urlToScan.replace("https://","")
print "used protocol: https"
if "http" in urlToScan:
urlToScanNoProt = urlToScan.replace("http://","")
print "used protocol: http"
class myItem(Item):
url = Field()
class mySpider(CrawlSpider):
name = "linkspider"
allowed_domains = [urlToScanNoProt]
start_urls = [urlToScan,]
rules = (Rule(LinkExtractor(), callback='parse_url', follow=True),)
def generateDirs(self):
if not os.path.exists(generalOutputDir):
os.makedirs(generalOutputDir)
specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
if not os.path.exists(specificOutputDir):
os.makedirs(specificOutputDir)
return specificOutputDir
def parse_url(self, response):
for link in LinkExtractor().extract_links(response):
item = myItem()
item['url'] = response.url
specificOutputDir = self.generateDirs()
filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
with open(filename, "wb") as f:
f.write(response.body)
return CrawlSpider.parse(self, response)
return item
process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished
爲什麼我要叫process.crawl(mySpider)
,而不是process.crawl(linkspider)
的代碼?我認爲這是一個設置問題,因爲它們設置在「正常」scrapy項目中(您必須運行scrapy crawl xyz
),因爲輸入值爲 2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {}
我希望你能理解我的問題(英文不是我的母語......;)) 在此先感謝!
感謝您的回答!我會嘗試在scrapy項目中使用'get_project_settings()'運行我的腳本 – R0rschach
我的蜘蛛正在工作,謝謝! – R0rschach