2017-03-05 39 views
0

如何告訴Scrapy將所有已獲得的項目分爲兩個列表?例如,假設我有兩種主要類型的項目 - articleauthor。我想把它們放在兩個單獨的列表中。現在我得到輸出JSON:Scrapy將項目作爲JSON中的子項目

[ 
    { 
    "article_title":"foo", 
    "article_published":"1.1.1972", 
    "author": "John Doe" 
    }, 
    { 
    "name": "John Doe", 
    "age": 42, 
    "email": "[email protected]" 
    } 
] 

如何將它轉換爲這樣的東西?

{ 
    "articles": [ 
    { 
     "article_title": "foo", 
     "article_published": "1.1.1972", 
     "author": "John Doe" 
    } 
    ], 
    "authors": [ 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
    ] 
} 

我對輸出這些功能都很簡單,與此類似:

def parse_author(self, response): 
     name = response.css('div.author-info a::text').extract_first() 
     print("Parsing author: {}".format(name)) 

     yield { 
      'author_name': name 
     } 

回答

2

項目將分別達到管道,並相應地在此設置添加的每個:

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    author = scrapy.Field() 

class Author(scrapy.Item): 
    name = scrapy.Field() 
    age = scrapy.Field() 

spider.py

def parse(self, response): 

    author = items.Author() 
    author['name'] = response.css('div.author-info a::text').extract_first() 
    print("Parsing author: {}".format(author['name'])) 
    yield author 

    article = items.Article() 
    article['title'] = response.css('article css').extract_first() 
    print("Parsing article: {}".format(article['title'])) 

    yield article 

pipelines.py

process_item(self, item, spider): 
    if isinstance(item, items.Author): 
     # Do something to authors 
    elif isinstance(item, items.Article): 
     # Do something to articles 

我建議,雖然這個架構:

[{ 
    "title": "foo", 
    "published": "1.1.1972", 
    "authors": [ 
     { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
     }, 
     { 
     "name": "Jane Doe", 
     "age": 21, 
     "email": "[email protected]" 
     }, 
    ] 
}] 

這使得全力以赴在一個項目。

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    authors = scrapy.Field() 

spider.py

def parse(self, response): 

    authors = [] 
    author = {} 
    author['name'] = "John Doe" 
    author['age'] = 42 
    author['email'] = "[email protected]" 
    print("Parsing author: {}".format(author['name'])) 
    authors.append(author) 

    article = items.Article() 
    article['title'] = "foo" 
    article['published'] = "1.1.1972" 
    print("Parsing article: {}".format(article['title'])) 
    article['authors'] = authors 
    yield article 
+0

管道訪問我仍然不確定如何將給定類型的所有項目分組在一個JSON密鑰下。修改管道返回'{'author':item}'仍然爲每個項目創建一個'author'鍵。我想我需要在我自己的列表中的某個地方累積所有項目,然後在最後輸出它們作爲JSON,但我不知道該怎麼做。 :::如果我想主要遍歷文章,您建議的架構很好。例如,列出所有作者就會變得更加困難。 –

+0

@MartinMelka我編輯了我的答案 –

1
raw = [ 
    { 
     "article_title":"foo", 
     "article_published":"1.1.1972", 
     "author": "John Doe" 
    }, 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
] 

data = {'articles':[], "authors":[]} 

for a in raw: 

    if 'article_title' in a: 
     data['articles'].extend([ a ]) 

    else: 
     data['articles'].extend([ a ]) 
+0

我不知道如何處理的字典一樣,在Scrapy。從解析函數中產生的結果直接傳遞給Scrapy,最終我無法處理它。你可以擴大你的答案嗎? –

+0

@MartinMelka過程意味着哪裏?對不起,我沒有得到你的問題...我的理解是,你的數據應該可以通過'item ['articles']' – Umair