我有一個scrapy項目的例子。它幾乎是默認的。它的文件夾結構:如何將簡單項目與scrapy項目結合使用?
craiglist_sample/
├── craiglist_sample
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── test.py
└── scrapy.cfg
當你寫scrapy crawl craigs -o items.csv -t csv
到Windows命令提示符寫入Craiglist上的項目和鏈接到控制檯。
我想在主文件夾中創建一個example.py並將它們打印到python控制檯中。
我試圖
from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())
但作爲Windows外殼輸出寫入相同。我怎樣才能讓它只打印項目和列表?
test.py :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
## allowed_domains = ["sfbay.craigslist.org"]
## start_urls = ["http://sfbay.craigslist.org/npo/"]
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]
##search\/npo\?s=
rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[@class="button next"]',))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"]')
## titles = hxs.select("//p[@class='row']")
items = []
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return(items)
感謝您的回答,但我需要從腳本運行。我發現這個網頁http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script。如果我在該目錄中創建一個.py文件,testpider似乎可以工作。你能否爲我的蜘蛛「MySpider」修改這個蜘蛛https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py? – St3114 2015-01-21 10:32:34
建議的修改已將它與您的工作集成在一起。使用您寫下的腳本:「從scrapy導入cmdline cmdline.execute(」scrapy crawl craigs「.split())」 – aberna 2015-01-21 10:53:43
@ St3114建議的解決方案是否適合您? – aberna 2015-01-23 09:58:08