PySpark - 從多個文件的文件

我有一個Python字典前N個字：PySpark - 從多個文件的文件

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}

我已經創建了一個RDD這樣的：

docNameToText = sc.parallelize(diction)

我需要計算找到頂級的-2字符串出現在每個文檔中。所以，結果應該是這個樣子：

1.txt, test, is 
2.txt, test, that

我是新來pyspark，我所知道的算法，但不知道如何做到這一點是pyspark。我需要：

- convert the file-to-string => file-to-wordFreq 
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order 
- display the top 2

我該如何執行此操作？

來源

2017-04-17 stfd1123581321

只需使用Counter：

from collections import Counter 

(sc 
    .parallelize(diction.items()) 
    # Split by whitepace 
    .mapValues(lambda s: s.split()) 
    # Count 
    .mapValues(Counter) 
    # Take most commont 
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)]))

來源

2017-04-17 12:06:14 user7878518

的感謝！小的澄清：如果我還想按照字母順序排列結果，如果兩個詞具有相同的計數呢？例如，對於'1.csv'，結果應該是['test'，'that']而不是['that'，'test']。謝謝。 – stfd1123581321

PySpark - 從多個文件的文件

回答

相關問題