提取幾個列表中的常見元素

-1

一般來說，我想要做的是在幾個csv文件的「word」共享列中提取常用元素。（2008.csv，2009.csv，2010.csv .... 2015.csv）提取幾個列表中的常見元素

所有文件都在相同的格式： '字'， '計數'

'字' 包含一年中某個文件中的所有常用詞彙。

這裏是一個文件的快照：

file 2008.csv

只要存在具有共同的元素的兩個出8個文件，我想知道這些共享的元素和無論他們在哪裏（這是非常像tfidf計算... btw）

無論如何，我的目標是要知道一些頻繁的詞出現在這些f爾斯。（據我所知，一個元素最多可以在五個文件中）

我想知道這些詞何時首次出現，即文件C中的一個詞，但不是文件B和A中的詞。

我知道+如果可能解決問題在這裏，但它是非常繁瑣的，我需要比較8中的2，8中的3，或8列中的4，在這種情況下，尋找共享元素。

這是我的工作了那麼遠，遠離了我所需要的代碼...我只是比較兩個元素出8個文件： code

誰能幫助？

來源

2016-02-16 ShirleyWang

你忘了發佈你到目前爲止的代碼。 –

請在您的問題中提供相關信息。鏈接可以刪除，我們在這裏幫助*你*。如果您能輕鬆一點，我們將不勝感激。 – zondo

這是如何像TFxIDF？你已經存檔了DF，但它在那裏結束。 – tripleee

使用設置intersection可以幫助

for i in range(len(year_list)): 
    datai=set(pd.read_csv('filename_'+year_list[i]+'.csv')['word']) 
    tocompare=[] 
    for j in range(i+1,len(year_list)): 
     dataj=set(pd.read_csv('filename_'+year_list[j]+'.csv')['word']) 
     print "Two files:",i,j 
     print datai.intersection(dataj) 
     tocompare.append(dataj) 
    print "All compare:" 
    print datai.intersection(*tocompare) 
    break

來源

2016-02-16 02:59:10 platinhom

謝謝！但這種方式在比較關鍵詞的兩年（或文件）方面仍然有限。無論如何都要在所有八個文件之間進行比較？ – ShirleyWang

'交集'方法可以接受多個參數！所以你只需要讀取包含的其他文件並將它們全部放到方法中，就像：'datai.intersection（dataj，datak，datam ....）' – platinhom

還有一些代碼問題..「All比較「可以向前進行，這意味着2012年可以與2013年到2015年的合併數據進行比較，但不會2011年。當我在特定年份嘗試查找獨特詞語時，這會造成問題。例如，2011年出現但2013年不出現的詞將被視爲2012年的唯一詞。 – ShirleyWang

第一個答案都很順利普遍。但由於某些原因，相交函數不會返回我預期的確切結果。所以我修改了提供的代碼，以提高打印輸出的準確性和更好的格式。

for i in range(0,8): 
otheryears = [] 
if i>0: 
    for y in range(0,i): 
     datay = set(pd.read_csv("most_50_common_words_"+year_list[y]+'.csv')["word"]) 
     for y in list(datay): 
      if y not in otheryears: 
       otheryears.append(y)  
uniquei = [] 
datai = set(pd.read_csv("most_50_common_words_"+year_list[i]+'.csv')["word"]) 
print "\nCompare year %d with:\n" % int(year_list[i]) 
for j in range(i+1,8): 
    dataj = set(pd.read_csv("most_50_common_words_"+year_list[j]+'.csv')['word']) 
    print year_list[j],':' 
    listj = list(datai.intersection(dataj)) 
    print list(datai.intersection(dataj)),'\n',"%d common words with year %d" % (len(datai.intersection(dataj)),int(year_list[j])) 
    for j in list(dataj): 
     if j not in otheryears: 
      otheryears.append(j) 

common = [] 
for x in list(datai): 
    if x in otheryears: 
     common.append(x) 
print "\nAll compare:" 
print "%d year has %d words in common with other years. They are as follows:\n%s" % (int(year_list[i]), 
                        len(common),common),'\n' 
for x in list(datai): 
    if x not in otheryears: 
     uniquei.append(x) 
print "%d Frequent words unique in year %d:\n%s \n" % (len(uniquei),int(year_list[i]),uniquei)

來源

2016-02-16 18:37:39 ShirleyWang

提取幾個列表中的常見元素

回答

相關問題