Spark Scala無法解析維基百科數據：enwiki_latest_articles xml bz2

我想使用火花LDA算法對維基百科數據進行主題建模：輸入文件基本上是一個包含大量xml文件的大型bz2文件。我使用的火花網站上的基本火花Scala代碼：Spark Scala無法解析維基百科數據：enwiki_latest_articles xml bz2

val sc:SparkContext = new SparkContext(conf); 
val ssqlc:SQLContext = new org.apache.spark.sql.SQLContext(sc); 
val shsqlc:HiveContext = new org.apache.spark.sql.hive.HiveContext(sc); 

// Load and parse the data 

val data = sc.textFile("/user/enwiki-latest-pages-articles.xml.bz2") 

//val datanew = data.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter } 



val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) 
    // Index documents with unique IDs 
    val corpus = parsedData.zipWithIndex.map(_.swap).cache() 
// Cluster the documents into three topics using LDA 
val ldaModel = new LDA().setK(25).run(corpus) 

// Output topics. Each is a distribution over words (matching word count vectors) 
println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):") 
val topics = ldaModel.topicsMatrix 
for (topic <- Range(0, 25)) { 
    print("Topic " + topic + ":") 
    for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); } 
    println() 
// val newtopics = ldaModel.describeTopics(5).foreach(println) 



}

它不處理數據，並拋出錯誤，如：在舞臺0.0在任務5.0例外：

ERROR executor.Executor （TID 2） java.lang.NumberFormatException：空字符串16/07/28 09:24:35錯誤 executor.Executor：階段0.0中的任務10.0中的異常（TID 5） java.lang.NumberFormatException：對於輸入字符串：「|」 16/07/28 9時24分35秒ERROR executor.Executor：異常的任務7.0級0.0 （TID 3）java.lang.NumberFormatException：對於輸入字符串：「|}」

可有人請在這件事上給予我幫助？一個簡短的代碼，以增強這將有助於。預先感謝您。

來源

2016-07-28 user2122466

您的問題是，你的數據包含不是數字的字符串。因此，這是失敗的：

s.trim.split(' ').map(_.toDouble)

你需要清理你的數據，或者只提取數字領域你有興趣

來源

2016-07-28 17:28:42

他們是一組XML文件，我非常新的階爲。能夠做修改設定的XML文件，其是在一個BZ2文件代表文檔語料庫中的單詞數向量，你對任何輸入？ – user2122466

@ user2122466表現出一定的努力。新事物不是沒有嘗試的藉口。 – Dikei

Spark Scala無法解析維基百科數據：enwiki_latest_articles xml bz2

回答

相關問題