2
我有一個Python字典前N個字:PySpark - 從多個文件的文件
diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}
我已經創建了一個RDD這樣的:
docNameToText = sc.parallelize(diction)
我需要計算找到頂級的-2字符串出現在每個文檔中。所以,結果應該是這個樣子:
1.txt, test, is
2.txt, test, that
我是新來pyspark,我所知道的算法,但不知道如何做到這一點是pyspark。我需要:
- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2
我該如何執行此操作?
的感謝!小的澄清:如果我還想按照字母順序排列結果,如果兩個詞具有相同的計數呢?例如,對於'1.csv',結果應該是['test','that']而不是['that','test']。謝謝。 – stfd1123581321