2014-11-15 59 views
1

也許這個問題有點奇怪......但我會試着問它。使用RDD的詞規範化

大家,誰使用Lucene API寫的應用程序,看到的是這樣的:

public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException 
{ 
    TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text)); 
    tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true)); 
    tokenStream = new LowerCaseFilter(Version.LUCENE_44, tokenStream); 
    tokenStream = new StandardFilter(Version.LUCENE_44, tokenStream); 
    tokenStream.reset(); 
    String result = ""; 
    while (tokenStream.incrementToken()) 
    { 
     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); 
     try 
     { 
      //normalizer.getNormalForm(...) - stemmer or lemmatizer 
      result += normalizer.getNormalForm(token.toString()) + " "; 
     } 
     catch(Exception e) 
     { 
      //if something went wrong 
     } 
    } 
    return result; 
} 

是否有可能重寫的話正常化使用RDD? 也許有人有這種轉變的例子,或者可以指定關於它的網絡資源?

謝謝。

回答

1

我最近用一個類似的例子來說話。它顯示瞭如何刪除停用詞。它沒有標準化階段,但如果這個normalizer.getNormalForm來自一個可以重用的庫,它應該很容易整合。

此代碼可能是一個起點:

// source text 
val rdd = sc.textFile(...) 
// stop words src 
val stopWordsRdd = sc.textFile(...) 
// bring stop words to the driver to broadcast => more efficient than rdd.subtract(stopWordsRdd) 
val stopWords = stopWordsRdd.collect.toSet 
val stopWordsBroadcast = sc.broadcast(stopWords) 
val words = rdd.flatMap(line => line.split("\\W").map(_.toLowerCase)) 
val cleaned = words.mapPartitions{iterator => 
    val stopWordsSet = stopWordsBroadcast.value 
    iterator.filter(elem => !stopWordsSet.contains(elem)) 
    } 
// plug the normalizer function here 
val normalized = cleaned.map(normalForm(_)) 

注意:這是從視星火工作點。我對Lucene不熟悉。

+0

Thanx Man!我會嘗試使用它並通知結果! – dimson

+0

男人!我需要建議...你認爲 - 什麼方法更有效 - 將文檔分散到節點上,然後對每個文檔的詞彙進行標記和標準化,或者連續獲取每個文檔,對它進行標記並將詞語分散到節點,每個節點將有一個標準化函數的副本?謝謝! – dimson