使用RDD的詞規範化

也許這個問題有點奇怪......但我會試着問它。使用RDD的詞規範化

大家，誰使用Lucene API寫的應用程序，看到的是這樣的：

public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException 
{ 
    TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text)); 
    tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true)); 
    tokenStream = new LowerCaseFilter(Version.LUCENE_44, tokenStream); 
    tokenStream = new StandardFilter(Version.LUCENE_44, tokenStream); 
    tokenStream.reset(); 
    String result = ""; 
    while (tokenStream.incrementToken()) 
    { 
     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); 
     try 
     { 
      //normalizer.getNormalForm(...) - stemmer or lemmatizer 
      result += normalizer.getNormalForm(token.toString()) + " "; 
     } 
     catch(Exception e) 
     { 
      //if something went wrong 
     } 
    } 
    return result; 
}

是否有可能重寫的話正常化使用RDD？也許有人有這種轉變的例子，或者可以指定關於它的網絡資源？

謝謝。

來源

2014-11-15 dimson

我最近用一個類似的例子來說話。它顯示瞭如何刪除停用詞。它沒有標準化階段，但如果這個normalizer.getNormalForm來自一個可以重用的庫，它應該很容易整合。

此代碼可能是一個起點：

// source text 
val rdd = sc.textFile(...) 
// stop words src 
val stopWordsRdd = sc.textFile(...) 
// bring stop words to the driver to broadcast => more efficient than rdd.subtract(stopWordsRdd) 
val stopWords = stopWordsRdd.collect.toSet 
val stopWordsBroadcast = sc.broadcast(stopWords) 
val words = rdd.flatMap(line => line.split("\\W").map(_.toLowerCase)) 
val cleaned = words.mapPartitions{iterator => 
    val stopWordsSet = stopWordsBroadcast.value 
    iterator.filter(elem => !stopWordsSet.contains(elem)) 
    } 
// plug the normalizer function here 
val normalized = cleaned.map(normalForm(_))

注意：這是從視星火工作點。我對Lucene不熟悉。

來源

2014-11-15 09:42:29 maasg

Thanx Man！我會嘗試使用它並通知結果！ – dimson

男人！我需要建議...你認爲 - 什麼方法更有效 - 將文檔分散到節點上，然後對每個文檔的詞彙進行標記和標準化，或者連續獲取每個文檔，對它進行標記並將詞語分散到節點，每個節點將有一個標準化函數的副本？謝謝！ – dimson

使用RDD的詞規範化

回答

相關問題