如何使用3個值來減少鍵值？

我試圖循環訪問文本文件的RDD，並對文件中的每個唯一字進行計數，然後累積每個唯一字後面的所有單詞以及它們的計數。到目前爲止，這是我所：如何使用3個值來減少鍵值？

// connecting to spark driver 
val conf = new SparkConf().setAppName("WordStats").setMaster("local") 
val spark = new SparkContext(conf) //Creates a new SparkContext object 

//Loads the specified file into an RDD 
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt") 

//Splits the file into individual words 
val words = lines.flatMap(line => { 

    val wordList = line.split(" ") 

    for {i <- 0 until wordList.length - 1} 

    yield (wordList(i), wordList(i + 1), 1) 

})

如果我沒有明確迄今爲止，我所要做的是積累了一套遵循每個單詞的詞文件，用的次數沿所述詞語按照他們的前述詞語的形式：

（PrecedingWord，（FollowingWord，numberOfTimesWordFollows））

其數據類型是（字符串，（字符串，整數））

來源

2017-04-23 JGT

你可能想沿着這些路線的東西：

(for { 
    line <- lines 
    Array(word1, word2) <- line.split("\\s+").sliding(2) 
} yield ((word1, word2), 1)) 
.reduceByKey(_ + _) 
.map({ case ((word1, word2), count) => (word1, (word2, count)) })

順便說一句，你可能希望確保每個linesRDD「行」相當於句話讓你不跨越計算詞對句子邊界。此外，如果你還沒有，你可能想要看看像OpenNLP或CoreNLP自然語言處理庫進行句子邊界檢測，標記等。

來源

2017-04-23 09:41:11

如何使用3個值來減少鍵值？

回答

相關問題