我正在使用scrapy來提取我的博客的所有帖子。問題是我不知道如何創建一個讀取任何給定的博客類別中的所有帖子的規則?Scrapy-如何從一個類別中提取所有博客文章?
示例:在我的博客上,「環境設置」類別有17個帖子。所以在scrapy代碼我可以硬它編碼給出,但是這不是一個非常實用的方法
start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
我已閱讀與此相關的問題在這裏張貼在類似的職位,像1,2,,4,5 ,6,7,但我似乎無法找到我的答案。正如你所看到的,唯一的區別就是上面url的頁數。我如何在scrapy中編寫一個可讀取類別中所有博客文章的規則。另一個微不足道的問題是,我如何配置蜘蛛來抓取我的博客,以便當我發佈新的博客文章條目時,爬蟲可以立即檢測到它並將其寫入文件。
這是我迄今爲止的蜘蛛類
from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class MySpider(CrawlSpider):
name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages
#allowed domain restricts the spider crawling
allowed_domains=["https://edumine.wordpress.com/"]
# in start_urls you have to specify the urls to crawl from
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/"]
'''
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
rules = [
Rule(SgmlLinkExtractor
(allow=("https://edumine.wordpress.com/category/ide-configuration/environment-setup/\d"),unique=False,follow=True))
]
'''
rules= Rule(LinkExtractor(allow='https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/'),follow=True,callback='parse_page')
def parse_page(self, response):
hxs=Selector(response)
titles = hxs.xpath("//h1[@class='entry-title']")
items = []
with open("itemLog.csv","w") as f:
for title in titles:
item = BlogscraperItem()
item["post_title"] = title.xpath("//h1[@class='entry-title']//text()").extract()
item["post_time"] = title.xpath("//time[@class='entry-date']//text()").extract()
item["text"]=title.xpath("//p//text()").extract()
item["link"] = title.select("a/@href").extract()
items.append(item)
f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
print "#### \tTotal number of posts= ",len(items), " in category####"
f.close()
任何幫助或建議,以解決呢?
由於wordpress存儲在數據庫中的帖子,爲什麼不使用d atabase而不是scrapy? –
因爲,我希望蜘蛛不僅可以抓取我的博客,還可以抓取其他博客,我認爲單靠scrapy就是最好的選擇。 – Ashish