Scrapy：如何共同選擇頭部和身體標記

所以，我有一個爬蟲，它需要從頭部元標記和身體中的一些元素標記中提取一些數據。Scrapy：如何共同選擇頭部和身體標記

當我嘗試這個

在response.xpath課程（「// HTML」）：

這

在response.xpath課程（」 //頭部「）：

它只從中的元標記中提取數據標籤。

當我嘗試這個

在response.xpath課程（「//體」）：

它只有HTML <body>... </body>標籤中讀取標籤從數據。

如何合併這兩個選擇，我也試過

在response.xpath課程（「//頭| //體」）：

但它只返回「 meta'標籤<head>... </head>，沒有任何東西是從身體中提取的。

我也試過這個

在response.xpath課程（「// *」）：

它的工作原理，但是，這是非常低效的，並採取了很多時間來提取。我相信有一個更有效的方法來做到這一點。

這裏是Scrapy代碼，如果它可以幫助...

第2個元素（網頁類型，pagefeatured）下一代產量在<head> ... <head>標籤。最後的2個元素（coursetloc，coursetfees）在<body ... </body>標籤

是的，它可能看起來很奇怪，但也有「元」裏面<body>...</body>從那裏我刮標籤的網站。

class MySpider(BaseSpider): 
name = "dkcourses" 
start_urls = ['http://www.example.com/scrapy/all-courses-listing'] 
allowed_domains = ["example.com"] 
def parse(self, response): 
hxs = Selector(response) 
for courses in response.xpath("//body"): 
yield { 
      'pagetype': ''.join(courses.xpath('.//meta[@name="dkpagetype"]/@content').extract()), 
      'pagefeatured': ''.join(courses.xpath('.//meta[@name="dkpagefeatured"]/@content').extract()), 
      'coursetloc': ''.join(courses.xpath('.//meta[@name="dkcoursetloc"]/@content').extract()), 
      'coursetfees': ''.join(courses.xpath('.//meta[@name="dkcoursetfees"]/@content').extract()), 
      } 
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract()): 
    yield Request(response.urljoin(url), callback=self.parse)

任何幫助是非常讚賞。由於

來源

2017-02-10 Slyper

帖子的網址或HTML代碼 –

@宏傑李發佈的代碼... – Slyper

我指的是網站的url –

使用extract_first()獲得在extract()的第一個值，不要使用join()
使用[starts-with(@name, "dkn")]找到meta標籤，//meta意味着在文檔中的所有內容。

In [5]: for meta in response.xpath('//meta[starts-with(@name, "dkn")]'): 
    ...:  name = meta.xpath('@name').extract_first() 
    ...:  content = meta.xpath('@content').extract_first() 
    ...:  print({name:content})

出來：

{'dknpagetype': 'Course'} 
{'dknpagefeatured': ''} 
{'dknpagedate': '2016-01-01'} 
{'dknpagebanner': 'http://www.deakin.edu.au/__data/assets/image/0006/757986/Banner_Cyber-Alt2.jpg'} 
{'dknpagethumbsquare': 'http://www.deakin.edu.au/__data/assets/image/0009/757989/SQ_Cyber1-2.jpg'} 
{'dknpagethumblandscape': 'http://www.deakin.edu.au/__data/assets/image/0007/757987/LS_Cyber1-1.jpg'} 
{'dknpagethumbportrait': 'http://www.deakin.edu.au/__data/assets/image/0008/757988/PT_Cyber1-3.jpg'} 
{'dknpagetitle': 'Graduate Diploma of Cyber Security'} 
{'dknpageurl': 'http://www.deakin.edu.au/course/graduate-diploma-cyber-security'} 
{'dknpagedescription': "Take your understanding of cyber security to the next level with Deakin's Graduate Diploma of Cyber Security and build your capacity to investigate and combat cyber-crime."} 
{'dknpageid': '723503'}

來源

2017-02-10 06:09:26

謝謝，但我想存儲將變量中的值發送給Elasticsearch，而不僅僅是在屏幕上打印，就像您在上面的示例代碼中看到的一樣。 – Slyper

不要緊，我需要在我的代碼中進行更改的所有內容都是針對response.xpath（「// body」）中的課程進行更改：'to'用於response.xpath中的課程（「// meta」）：'所有現在好了.... – Slyper

Scrapy：如何共同選擇頭部和身體標記

回答

相關問題