2017-04-17 47 views
2

我有一個Python字典前N個字:PySpark - 從多個文件的文件

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'} 

我已經創建了一個RDD這樣的:

docNameToText = sc.parallelize(diction) 

我需要計算找到頂級的-2字符串出現在每個文檔中。所以,結果應該是這個樣子:

1.txt, test, is 
2.txt, test, that 

我是新來pyspark,我所知道的算法,但不知道如何做到這一點是pyspark。我需要:

- convert the file-to-string => file-to-wordFreq 
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order 
- display the top 2 

我該如何執行此操作?

回答

0

只需使用Counter

from collections import Counter 

(sc 
    .parallelize(diction.items()) 
    # Split by whitepace 
    .mapValues(lambda s: s.split()) 
    # Count 
    .mapValues(Counter) 
    # Take most commont 
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)])) 
+0

的感謝!小的澄清:如果我還想按照字母順序排列結果,如果兩個詞具有相同的計數呢?例如,對於'1.csv',結果應該是['test','that']而不是['that','test']。謝謝。 – stfd1123581321