文本挖掘與斯卡拉

我有以下數據的.txt文件：文本挖掘與斯卡拉

L666371 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Lord Chelmsford seems to want me to stay back with my Basutos. 
L666370 +++$+++ u9034 +++$+++ m616 +++$+++ VEREKER +++$+++ I'm to take the Sikali with the main column to the river 
L666369 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Your orders, Mr Vereker? 
L666257 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot 
L666256 +++$+++ u9034 +++$+++ m616 +++$+++ VEREKER +++$+++ Colonel Durnford... William Vereker. I hear you 've been seeking Officers?

我想要導入的文本文件導入斯卡拉（我做了），然後通過提取所有有關它的工作文本。之後：標記，小寫，忽略單詞形式，單獨標點符號，之後我想要學習單詞的計數，如下所示：unigram，bigram和trigram count，以最高計數排序結果。

有人可以告訴我怎麼實現嗎？我有以下的嘗試，但它似乎並不奏效：

import io.Source 
val s = Source.fromFile("movie_lines.txt")("ISO-8859-1") 
val lines = s.getLines 
val str = s.mkString 

val Pattern = "([A-Z]+.!)".r`enter code here` 

Pattern.findAllIn(str).foreach { x => println(x) } 

println ("\n This is the result\n")`enter code here` 
    }

來源

2015-02-12 Neeraj Kumar

任何人都可以回答？ – 2015-02-22 05:12:59

可以使用Epic庫從ScalaNLP西裝preprocesing文字（符號化），然後解析，標籤和提取實體。

來源

2015-03-02 07:56:17 jepemo

文本挖掘與斯卡拉

回答

相關問題