2017-08-01 84 views
1

我試圖使用鏈接提取從本網站https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html獲取歌曲的名稱,但結果是重複的。Scrapy結果重複

import scrapy 
from scrapy import Request 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 
class RedditSpider(CrawlSpider): 
    name='pagalworld' 
    allowed_domains = ["pagalworld.me"] 
    start_urls=['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html'] 
    rules = ( 
     Rule(
     LinkExtractor(restrict_xpaths='//div/ul'), 
     follow=True, 
     callback='parse_start_url'), 
    ) 
    def parse_start_url(self, response): 
     songName= response.xpath('//li/b/a/text()').extract() 

     for item in songName: 

      yield {"songName":item, 
     "URL":resposne} 

Output

+0

請張貼的輸出,並且還完整的代碼(包括異議實例) – TrakJohnson

回答

1

似乎一切都與你的蜘蛛正確。但是,如果你看一下這首歌頁它提供每首歌曲的兩個版本:

$ scrapy shell "https://pagalworld.me/files/12450/Babumoshai%20Bandookbaaz%20(2017)%20Movie%20Mp3%20Songs.html" 
>[1]: response.xpath('//li/b/a/text()').extract() 
<[1]: 
['03 Aye Saiyan - Babumoshai Bandookbaaz 190Kbps.mp3', 
'03 Aye Saiyan - Babumoshai Bandookbaaz 320Kbps.mp3', 
'01 Barfani - Male (Armaan Malik) 190Kbps.mp3', 
'01 Barfani - Male (Armaan Malik) 320Kbps.mp3', 
'02 Barfani - Female (Orunima Bhattacharya) 190Kbps.mp3', 
'02 Barfani - Female (Orunima Bhattacharya) 320Kbps.mp3'] 

一個版本是較低的190kbps質量,另一個是320kbps的高品質。
在這你可能想只是爲了讓那些之一:

>[2]: response.xpath('//li/b/a/text()[contains(.,"320Kb")]').extract() 
<[2]: 
['03 Aye Saiyan - Babumoshai Bandookbaaz 320Kbps.mp3', 
'01 Barfani - Male (Armaan Malik) 320Kbps.mp3', 
'02 Barfani - Female (Orunima Bhattacharya) 320Kbps.mp3'] 

編輯: 好像也有重複的問題。嘗試禁用鏈接提取器上的follow=True,因爲在這種情況下,您不想關注。

+0

有什麼後續的目的,它是什麼用途。我在這裏https://docs.scrapy.org/en/latest/topics/link-extractors.html尋找,但沒有發現任何 – emon

+0

@emon跟隨繼續在它自己認爲的網頁應用LinkExtractor規則。因此,在這種情況下,它會在預先找到的歌曲頁面中查找歌曲網址。 – Granitosaurus

+0

它的默認值是False? – emon