2014-09-06 32 views
-1

我通過Reddit上多篇文章試圖循環,經過每一篇文章,並提取相關的頂級實體(通過篩選獲得最高關聯得分完成),然後添加到列表master_locations在Python中每循環迭代清空列表?

from __future__ import print_function 
from alchemyapi import AlchemyAPI 
import json 
import urllib2 
from bs4 import BeautifulSoup 

alchemyapi = AlchemyAPI() 
reddit_url = 'http://www.reddit.com/r/worldnews' 
urls = [] 
locations = [] 
relevance = [] 
master_locations = [] 

def get_all_links(page): 
    html = urllib2.urlopen(page).read() 
    soup = BeautifulSoup(html) 
    for a in soup.find_all('a', 'title may-blank ', href=True): 
     urls.append(a['href']) 
     run_alchemy_entity_per_link(a['href']) 

def run_alchemy_entity_per_link(articleurl): 
    response = alchemyapi.entities('url', articleurl) 
    if response['status'] == 'OK': 
     for entity in response['entities']: 
      if entity['type'] in entity == 'Country' or entity['type'] == 'Region' or entity['type'] == 'City' or entity['type'] == 'StateOrCountry' or entity['type'] == 'Continent': 
       if entity.get('disambiguated'): 
        locations.append(entity['disambiguated']['name']) 
        relevance.append(entity['relevance']) 
       else: 
        locations.append(entity['text']) 
        relevance.append(entity['relevance'])   
      else: 
       locations.append('No Location') 
       relevance.append('0') 
     max_pos = relevance.index(max(relevance)) # get nth position of the highest relevancy score 
     master_locations.append(locations[max_pos]) #Use n to get nth position of location and store that location name to master_locations 
     del locations[0] # RESET LIST 
     del relevance[0] # RESET LIST 
    else: 
     print('Error in entity extraction call: ', response['statusInfo']) 

get_all_links('http://www.reddit.com/r/worldnews') # Gets all URLs per article, then analyzes entity 

for item in master_locations: 
    print(item) 

但我認爲出於某種原因,列表locationsrelevance未被重置。我做錯了嗎?

印刷本的結果是:

Holland 
Holland 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Beirut 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Mogadishu 
Johor Bahru 

(可能從列表中不被清除)

+0

我已經低估了,因爲這是一段長長的代碼,大多不相關,可能已經被簡化了很多。 http://sscce.org/ – Davidmh 2014-09-06 10:05:46

回答

0

del list[0]只刪除列表中的第一項。

如果要刪除所有項目,使用下列內容:

del list[:] 

list[:] = [] 
+0

嘗試將列表更改爲'locations [:] = []'和'relevance [:] = []',但是我得到一個'ValueError:max()arg是一個空序列錯誤。 – 2014-09-06 09:33:25

+0

@PhillipeDongwooHan,在'del'語句前用'if relevance:'守衛兩行。 – falsetru 2014-09-06 09:35:07

+0

謝謝!這固定它!但是,你能簡單解釋一下爲什麼這樣做有效嗎爲什麼要放置一個if條件? – 2014-09-06 09:51:19

0

在你的情況,不要重複使用的清單,只要創建新的:

from __future__ import print_function 
from alchemyapi import AlchemyAPI 
import json 
import urllib2 
from bs4 import BeautifulSoup 

alchemyapi = AlchemyAPI() 
reddit_url = 'http://www.reddit.com/r/worldnews' 

def get_all_links(page): 
    html = urllib2.urlopen(page).read() 
    soup = BeautifulSoup(html) 
    urls = [] 
    master_locations = [] 
    for a in soup.find_all('a', 'title may-blank ', href=True): 
     urls.append(a['href']) 
     master_locations.append(run_alchemy_entity_per_link(a['href'])) 
    return urls, master_locations 

def run_alchemy_entity_per_link(articleurl): 
    response = alchemyapi.entities('url', articleurl) 
    if response['status'] != 'OK': 
     print('Error in entity extraction call: ', response['statusInfo']) 
     return 
    locations_with_relevance = [] 
    for entity in response['entities']: 
     if entity['type'] in ('Country', 'Region', 'City', 'StateOrCountry', 'Continent'): 
      if entity.get('disambiguated'): 
       location = entity['disambiguated']['name'] 
      else: 
       location = entity['text'] 
      locations_with_relevance.append((int(entity['relevance']), location)) 
     else: 
      locations_with_relevance.append((0, 'No Location')) 
    return max(locations_with_relevance)[1] 

def main(): 
    _urls, master_locations = get_all_links(reddit_url) # Gets all URLs per article, then analyzes entity 

    for item in master_locations: 
     print(item) 

if __name__ == '__main__': 
    main() 

當您有多個項目存儲在列表中時,將項目放入一個元組中,並將元組放入一個列表中,而不是兩個或多個sep憤怒的名單。

+0

嗯..試着運行你的代碼,我得到了'TypeError:'列表'對象不可調用'? – 2014-09-06 09:32:09

+0

@PhillipeDongwooHan:改正。無論如何,它更多的是看代碼並找出差異。 – Daniel 2014-09-06 10:03:10