1
我想每一個字,我用的外部單詞列表文件進行比較,請看看下面這個例子:比較兩個文件的內容和Spark
我的數據文件是:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio mortgage backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
buffered lightning thousand volts cables burned revivification place .
cables cables finally able hear auditory gem long rumored music .
...
和外部字的文件是:
thump,1
man,-1
small,-1
surprise,-1
system,1
wrap,1
left,1
lives,-1
place,-1
lightning,-1
long,1
...
當比較這些話,如果有的話每個文檔中相同的外部話再總結自己的價值觀,最後我們有一個得分每個文檔 和預期輸出是:
-2 ; surprise heard thump opened door small seedy man clasping package wrapped.
1 ; upgrading system found review spring 2008 issue moody audio mortgage backed.
2 ; omg left gotta wrap review order asap . understand hand delivered dali lama
0 ; speak hands wear earplugs lives . listen maintain link long .
-2 ; buffered lightning thousand volts cables burned revivification place .
1 ; cables cables finally able hear auditory gem long rumored music .
...
我已經試過:
object test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val searchList = sc.textFile("data/words.txt")
val sentilex = searchList.map({ (line) =>
val Array(a,b) = line.split(",").map(_.trim)
(a,b.toInt)
}).collect().toVector
val lex=sentilex.map(a=>a._1)
val lab=sentilex.map(b=>b._2)
val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(line=>line.split(" "))
val sample3 = sample2.map(elem => if (lex.contains(elem)) ("1") else elem)
sample3.foreach(println)
}
}
任何人可以幫助我嗎?
謝謝@jlopezmat您的回覆,我有一個問題,我怎麼能去掉括號每一個身邊的文檔,例如一個輸出以上代碼是:(-2,突然聽到咚咚打開門小男人緊握包裹包裹。) – Rozita
@Rozita因爲你有元組並且它的toString方法放括號, 如果你想erese他們,你的代碼應該是: 'sample2.collect.foreach(元組=>的println(tuple._1 + 「」 + tuple._2))' 我希望這將是有益的 – jlopezmat
親愛的@jlopezmat,如果我想在得分後刪除相同的單詞,我該怎麼做? – Rozita