如何識別列表中的項目與另一個列表中的項目的發生

-1

我有一個文件加載了一列文本。我想檢查加載的文本中國家名稱的發生。我已經加載了維基百科國家的CSV文件，我使用下面的代碼來計算加載文本中國家名稱的出現次數。如何識別列表中的項目與另一個列表中的項目的發生

我的代碼無法正常工作。

這裏是我的代碼： text = pd.read_sql(select_string, con) text['tokenized_text'] = mail_text.apply(lambda col:nltk.word_tokenize(col['SomeText']), axis=1) country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') ccs = set(country_codes['English short name lower case']) count_occurrences=Counter(country for country in text['tokenized_text']if country in ccs)

來源

2016-09-20 JayDoe

是'country_codes'的'dictionary'？ –

你現在的代碼有一個縮進錯誤 - 你應該先看看。 –

不，縮進只是我在這裏剪切和粘貼的結果 – JayDoe

在你原來的代碼行

dic[country]= dic[country]+1

應引起KeyError，因爲關鍵是還沒有出現在字典中，當一個國家被滿足第一次。相反，你應該檢查重點是存在的，如果不是，初始化值設爲1。

在另一方面，它不會，因爲檢查

if country in country_codes['English short name lower case']:

收益率對於所有的值False：一Series對象的__contains__與indices instead of values一起使用。你應該例如檢查

if country in country_codes['English short name lower case'].values:

如果你的list of values is short。

對於一般計數任務，Python提供collections.Counter，它的行爲有點像defaultdict(int)，但帶來了額外的好處。它刪除鍵等的人工檢查的需要

正如你已經有DataFrame對象，你可以使用的工具pandas規定：

In [12]: country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') 

In [13]: text = pd.DataFrame({'SomeText': """Finland , Finland , Finland 
    ...: The country where I want to be 
    ...: Pony trekking or camping or just watch T.V. 
    ...: Finland , Finland , Finland 
    ...: It's the country for me 
    ...: 
    ...: You're so near to Russia 
    ...: so far away from Japan 
    ...: Quite a long way from Cairo 
    ...: lots of miles from Vietnam 
    ...: 
    ...: Finland , Finland , Finland 
    ...: The country where I want to be 
    ...: Eating breakfast or dinner 
    ...: or snack lunch in the hall 
    ...: Finland , Finland , Finland 
    ...: Finland has it all 
    ...: 
    ...: Read more: Monty Python - Finland Lyrics | MetroLyrics 
    ...: """.split()}) 

In [14]: text[text['SomeText'].isin(
    ...:  country_codes['English short name lower case'] 
    ...:)]['SomeText'].value_counts().to_dict() 
    ...: 
Out[14]: {'Finland': 14, 'Japan': 1}

此發現的text行，其中SomeText列的值是英文簡稱英文簡稱country_codes列，計算唯一值SomeText，並轉換爲字典。

In [49]: where_sometext_isin_country_codes = text['SomeText'].isin(
    ...:  country_codes['English short name lower case']) 

In [50]: filtered_text = text[where_sometext_isin_country_codes] 

In [51]: value_counts = filtered_text['SomeText'].value_counts() 

In [52]: value_counts.to_dict() 
Out[52]: {'Finland': 14, 'Japan': 1}

相同與Counter：

In [23]: from collections import Counter 

In [24]: dic = Counter() 
    ...: ccs = set(country_codes['English short name lower case']) 
    ...: for country in text['SomeText']: 
    ...:  if country in ccs: 
    ...:   dic[country] += 1 
    ...: 

In [25]: dic 
Out[25]: Counter({'Finland': 14, 'Japan': 1})

或簡單地：用描述中間變量的相同

In [30]: ccs = set(country_codes['English short name lower case']) 

In [31]: Counter(country for country in text['SomeText'] if country in ccs) 
Out[31]: Counter({'Finland': 14, 'Japan': 1})

來源

2016-09-20 08:45:57

那麼俄羅斯和越南發生了什麼？他們不再是國家嗎？我認爲源數據可能會更好...... – Frangipanes

俄羅斯在那裏，但它不只是「俄羅斯」，而是「俄羅斯聯邦」。另一方面越南不是。 OP的數據和方法可以使用一些改進。 –

關於俄羅斯的好處，因爲它從來沒有被稱爲「俄羅斯聯邦」，而只是「俄羅斯」，所以也許我需要找到另一個國家代碼的源文件？ – JayDoe

如何識別列表中的項目與另一個列表中的項目的發生

回答

相關問題