Scrapy將項目作爲JSON中的子項目

如何告訴Scrapy將所有已獲得的項目分爲兩個列表？例如，假設我有兩種主要類型的項目 - article和author。我想把它們放在兩個單獨的列表中。現在我得到輸出JSON：Scrapy將項目作爲JSON中的子項目

[ 
    { 
    "article_title":"foo", 
    "article_published":"1.1.1972", 
    "author": "John Doe" 
    }, 
    { 
    "name": "John Doe", 
    "age": 42, 
    "email": "[email protected]" 
    } 
]

如何將它轉換爲這樣的東西？

{ 
    "articles": [ 
    { 
     "article_title": "foo", 
     "article_published": "1.1.1972", 
     "author": "John Doe" 
    } 
    ], 
    "authors": [ 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
    ] 
}

我對輸出這些功能都很簡單，與此類似：

def parse_author(self, response): 
     name = response.css('div.author-info a::text').extract_first() 
     print("Parsing author: {}".format(name)) 

     yield { 
      'author_name': name 
     }

來源

2017-03-05 Martin Melka

項目將分別達到管道，並相應地在此設置添加的每個：

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    author = scrapy.Field() 

class Author(scrapy.Item): 
    name = scrapy.Field() 
    age = scrapy.Field()

spider.py

def parse(self, response): 

    author = items.Author() 
    author['name'] = response.css('div.author-info a::text').extract_first() 
    print("Parsing author: {}".format(author['name'])) 
    yield author 

    article = items.Article() 
    article['title'] = response.css('article css').extract_first() 
    print("Parsing article: {}".format(article['title'])) 

    yield article

pipelines.py

process_item(self, item, spider): 
    if isinstance(item, items.Author): 
     # Do something to authors 
    elif isinstance(item, items.Article): 
     # Do something to articles

我建議，雖然這個架構：

[{ 
    "title": "foo", 
    "published": "1.1.1972", 
    "authors": [ 
     { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
     }, 
     { 
     "name": "Jane Doe", 
     "age": 21, 
     "email": "[email protected]" 
     }, 
    ] 
}]

這使得全力以赴在一個項目。

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    authors = scrapy.Field()

spider.py

def parse(self, response): 

    authors = [] 
    author = {} 
    author['name'] = "John Doe" 
    author['age'] = 42 
    author['email'] = "[email protected]" 
    print("Parsing author: {}".format(author['name'])) 
    authors.append(author) 

    article = items.Article() 
    article['title'] = "foo" 
    article['published'] = "1.1.1972" 
    print("Parsing article: {}".format(article['title'])) 
    article['authors'] = authors 
    yield article

來源

2017-03-06 13:45:47

管道訪問我仍然不確定如何將給定類型的所有項目分組在一個JSON密鑰下。修改管道返回'{'author'：item}'仍然爲每個項目創建一個'author'鍵。我想我需要在我自己的列表中的某個地方累積所有項目，然後在最後輸出它們作爲JSON，但我不知道該怎麼做。 :::如果我想主要遍歷文章，您建議的架構很好。例如，列出所有作者就會變得更加困難。 –

@MartinMelka我編輯了我的答案 –

raw = [ 
    { 
     "article_title":"foo", 
     "article_published":"1.1.1972", 
     "author": "John Doe" 
    }, 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
] 

data = {'articles':[], "authors":[]} 

for a in raw: 

    if 'article_title' in a: 
     data['articles'].extend([ a ]) 

    else: 
     data['articles'].extend([ a ])

來源

2017-03-06 00:07:44 Umair

我不知道如何處理的字典一樣，在Scrapy。從解析函數中產生的結果直接傳遞給Scrapy，最終我無法處理它。你可以擴大你的答案嗎？ –

@MartinMelka過程意味着哪裏？對不起，我沒有得到你的問題...我的理解是，你的數據應該可以通過'item ['articles']' – Umair

Scrapy將項目作爲JSON中的子項目

回答

相關問題