2015-05-12 44 views
1

我想每一個字,我用的外部單詞列表文件進行比較,請看看下面這個例子:比較兩個文件的內容和Spark

我的數據文件是:

surprise heard thump opened door small seedy man clasping package wrapped. 

upgrading system found review spring 2008 issue moody audio mortgage backed. 

omg left gotta wrap review order asap . understand hand delivered dali lama 

speak hands wear earplugs lives . listen maintain link long . 

buffered lightning thousand volts cables burned revivification place . 

cables cables finally able hear auditory gem long rumored music . 
... 

和外部字的文件是:

thump,1 
man,-1 
small,-1 
surprise,-1 
system,1 
wrap,1 
left,1 
lives,-1 
place,-1 
lightning,-1 
long,1 
... 

當比較這些話,如果有的話每個文檔中相同的外部話再總結自己的價值觀,最後我們有一個得分每個文檔 和預期輸出是:

-2 ; surprise heard thump opened door small seedy man clasping package wrapped. 

1 ; upgrading system found review spring 2008 issue moody audio mortgage backed. 

2 ; omg left gotta wrap review order asap . understand hand delivered dali lama 

0 ; speak hands wear earplugs lives . listen maintain link long . 

-2 ; buffered lightning thousand volts cables burned revivification place . 

1 ; cables cables finally able hear auditory gem long rumored music . 
... 

我已經試過:

object test { 

def main(args: Array[String]): Unit = { 
val conf = new SparkConf().setAppName("prep").setMaster("local") 
val sc = new SparkContext(conf) 
val searchList = sc.textFile("data/words.txt") 

val sentilex = searchList.map({ (line) => 
    val Array(a,b) = line.split(",").map(_.trim) 
    (a,b.toInt) 
}).collect().toVector 

val lex=sentilex.map(a=>a._1) 
val lab=sentilex.map(b=>b._2) 
val sample1 = sc.textFile("data/data.txt") 
val sample2 = sample1.map(line=>line.split(" ")) 
val sample3 = sample2.map(elem => if (lex.contains(elem)) ("1") else elem) 
sample3.foreach(println) 
} 
} 

任何人可以幫助我嗎?

回答

4

嗨,我認爲最好的方法來做你想做的就是使用廣播值來發送sentilex,然後使用map函數來計算總和。在代碼將是這樣的:

object test { 
def main(args: Array[String]): Unit = { 
val conf = new SparkConf().setAppName("prep").setMaster("local") 
val sc = new SparkContext(conf) 
val searchList = sc.textFile("data/words.txt") 

val sentilex = sc.broadcast(searchList.map({ (line) => 
    val Array(a,b) = line.split(",").map(_.trim) 
    (a,b.toInt) 
    }).collect().toMap)  

val sample1 = sc.textFile("data/data.txt") 
val sample2 = sample1.map(line=>(line.split(" ").map(word => sentilex.value.getOrElse(word, 0)).reduce(_ + _), line)) 
sample2.collect.foreach(println) 
} 
} 

我希望這將是有益的

+0

謝謝@jlopezmat您的回覆,我有一個問題,我怎麼能去掉括號每一個身邊的文檔,例如一個輸出以上代碼是:(-2,突然聽到咚咚打開門小男人緊握包裹包裹。) – Rozita

+0

@Rozita因爲你有元組並且它的toString方法放括號, 如果你想erese他們,你的代碼應該是: 'sample2.collect.foreach(元組=>的println(tuple._1 + 「」 + tuple._2))' 我希望這將是有益的 – jlopezmat

+0

親愛的@jlopezmat,如果我想在得分後刪除相同的單詞,我該怎麼做? – Rozita