2015-06-05 49 views
0

假設我有這些文件,我想刪除重複:比較文件,並刪除重複的星火和Scala

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 
thourghly sansa view delete song time wont wont connect-computer computer put time 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 

這是輸出:

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 
thourghly sansa view delete song time wont wont connect-computer computer put time 

有Scala中這方面的任何解決方案和Spark?

回答

1

你似乎在讀一本線形式基礎上的文件,以便textFile將正確地讀入字符串RDD,每行一個行的這一點。在此之後,distinct將RDD減肥爲一個獨特的集合。

sc.textFile("yourfile.txt") 
    .distinct 
    .saveAsTextFile("distinct.txt") 
0

使用reduceByKey函數,可以實現您的要求。

您可以使用此代碼

val textFile = spark.textFile("hdfs://...") 
val uLine = textFile.map(line => (line, 1)) 
       .reduceByKey(_ + _).map(uLine => uLine._1) 
uLine.saveAsTextFile("hdfs://...") 

,或者您可以使用

val uLine = spark.textFile("hdfs://...").distinct 
uLine.saveAsTextFile("hdfs://...")