2016-08-05 19 views
0

我目前正試圖使用​​他們的API下載大量的紐約時報文章,基於Python 2.7。要做到這一點,我能夠重用的代碼我在網上找到。下載dict到CSV通過Python訪問紐約時報API的問題

[code]from nytimesarticle import articleAPI 
api = articleAPI('...') 

articles = api.search(q = 'Brazil', 
    fq = {'headline':'Brazil', 'source':['Reuters','AP', 'The New York Times']}, 
    begin_date = '20090101') 

def parse_articles(articles): 
    ''' 
    This function takes in a response to the NYT api and parses 
    the articles into a list of dictionaries 
    ''' 
    news = [] 
    for i in articles['response']['docs']: 
     dic = {} 
     dic['id'] = i['_id'] 
     if i['abstract'] is not None: 
      dic['abstract'] = i['abstract'].encode("utf8") 
     dic['headline'] = i['headline']['main'].encode("utf8") 
     dic['desk'] = i['news_desk'] 
     dic['date'] = i['pub_date'][0:10] # cutting time of day. 
     dic['section'] = i['section_name'] 
     if i['snippet'] is not None: 
      dic['snippet'] = i['snippet'].encode("utf8") 
     dic['source'] = i['source'] 
     dic['type'] = i['type_of_material'] 
     dic['url'] = i['web_url'] 
     dic['word_count'] = i['word_count'] 
     # locations 
     locations = [] 
     for x in range(0,len(i['keywords'])): 
      if 'glocations' in i['keywords'][x]['name']: 
       locations.append(i['keywords'][x]['value']) 
     dic['locations'] = locations 
     # subject 
     subjects = [] 
     for x in range(0,len(i['keywords'])): 
      if 'subject' in i['keywords'][x]['name']: 
       subjects.append(i['keywords'][x]['value']) 
     dic['subjects'] = subjects 
     news.append(dic) 
    return(news) 

def get_articles(date,query): 
    ''' 
    This function accepts a year in string format (e.g.'1980') 
    and a query (e.g.'Amnesty International') and it will 
    return a list of parsed articles (in dictionaries) 
    for that year. 
    ''' 
    all_articles = [] 
    for i in range(0,100): #NYT limits pager to first 100 pages. But rarely will you find over 100 pages of results anyway. 
     articles = api.search(q = query, 
       fq = {'headline':'Brazil','source':['Reuters','AP', 'The New York Times']}, 
       begin_date = date + '0101', 
       end_date = date + '1231', 
       page = str(i)) 
     articles = parse_articles(articles) 
     all_articles = all_articles + articles 
    return(all_articles) 

Download_all = [] 
for i in range(2009,2010): 
    print 'Processing' + str(i) + '...' 
    Amnesty_year = get_articles(str(i),'Brazil') 
    Download_all = Download_all + Amnesty_year 

import csv 
keys = Download_all[0].keys() 
with open('brazil-mentions.csv', 'wb') as output_file: 
    dict_writer = csv.DictWriter(output_file, keys) 
    dict_writer.writeheader() 
    dict_writer.writerows(Download_all) 

沒有最後一位(從「...導入CSV」這似乎是工作的罰款。如果我只是打印我結果,(「打印Download_all」),我可以看到他們,但在一個非常非結構化的方式運行的實際代碼,但是我得到的消息:

File "C:\Users\xxx.yyy\AppData\Local\Continuum\Anaconda2\lib\csv.py", line 148, in _dict_to_list 
    + ", ".join([repr(x) for x in wrong_fields])) 

ValueError: dict contains fields not in fieldnames: 'abstract' 

因爲我在這個相當一個新手,我會非常感激你的幫助引導我如何以結構化的方式將新聞文章下載到CSV文件中。

提前致謝! 最好的問候

回答

0

如果您有:

keys = Download_all[0].keys() 

這需要爲CSV列標題從字典的第一篇文章。問題在於文章字典並不都具有相同的密鑰,因此當您到達第一個具有額外密鑰abstract時,它將失敗。

看起來你會遇到問題abstractsnippet如果它們存在於響應中,它們只會添加到字典中。

你需要作出keys等於所有可能的密鑰超:

keys = Download_all[0].keys() + ['abstract', 'snippet'] 

或者,確保每一個dict有每個字段的值:

def parse_articles(articles): 
    ... 
    if i['abstract'] is not None: 
     dic['abstract'] = i['abstract'].encode("utf8") 
    else: 
     dic['abstract'] = "" 
    ... 
    if i['snippet'] is not None: 
     dic['snippet'] = i['snippet'].encode("utf8") 
    else: 
     dic['snippet'] = ""