2012-08-08 87 views
1

我在Java中使用BreakIterator實現從字符串中刪除標點符號。我需要在Scala中重寫這個,所以我想這可能是一個很好的機會,用一個更好的庫替換它(我的實現非常天真,我相信它在邊緣情況下失敗了)。Scala/Java - 庫解析一些文本並刪除標點符號?

是否有這樣的圖書館存在可能被使用?

編輯:這是我在斯卡拉快速的解決方案:

private val getWordsFromLine = (line: String) => { 
    line.split(" ") 
     .map(_.toLowerCase()) 
     .map(word => word.filter(Character.isLetter(_))) 
     .filter(_.length() > 1) 
     .toList 
    } 

而鑑於此List[String](每行一個...是...這是聖經 - 這是一個很好的測試案例):

第二摩西的書,叫EXODUS

第11章現在,這些[是]以色列,這 來到埃及的孩子的名字;每個人和他的家人都與雅各同行。 2 流便,西緬,利未,猶大,3以薩迦,西布倫,和本雅明,4 丹,拿弗他利,迦得,亞設。

你得到一個List[String]像這樣:

List(the, second, book, of, moses, called, exodus, chapter, now, these, are, the, names, of, the, children, of, israel, which, came, into, egypt, every, man, and, his, household, came, with, jacob, reuben, simeon, levi, and, judah, issachar, zebulun, and, benjamin, dan, and, naphtali, gad, and, asher) 
+5

爲什麼不在Scala中使用Java實現?這兩者是可互操作的。您仍然可以在Java API中添加一些Scala的好東西,使其更易於使用。 – 2012-08-08 09:03:18

+0

我可以。如果我不需要,我只是不想重寫它。 – 2012-08-08 09:13:24

+4

通過提供示例說明您正在尋找的內容將有所幫助。從目前的描述來看,我認爲一個正則表達式應該能夠完成這項工作。 – 2012-08-08 09:19:48

回答

0

下面是使用正則表達式的方法。儘管如此,它還沒有過濾單個字符的單詞。

val s = """ 
THE SECOND BOOK OF MOSES, CALLED EXODUS 

CHAPTER 1 1 Now these [are] the names of the children of Israel, 
which came into Egypt; every man and his household came with 
Jacob. 2 Reuben, Simeon, Levi, and Judah, 3 Issachar, Zebulun, 
and Benjamin, 4 Dan, and Naphtali, Gad, and Asher. 
""" 

/* \p{L} denotes Unicode letters */ 
var items = """\b\p{L}+\b""".r findAllIn s 

println(items.toList) 
    /* List(THE, SECOND, BOOK, OF, MOSES, CALLED, EXODUS, 
      CHAPTER, Now, these, are, the, names, of, the, 
      children, of, Israel, which, came, into, Egypt, 
      every, man, and, his, household, came, with, 
      Jacob, Reuben, Simeon, Levi, and, Judah, 
      Issachar, Zebulun, and, Benjamin, Dan, and, 
      Naphtali, Gad, and, Asher) 
    */ 

/* \w denotes word characters */ 
items = """\b\w+\b""".r findAllIn s 
println(items.toList) 
    /* List(THE, SECOND, BOOK, OF, MOSES, CALLED, EXODUS, 
      CHAPTER, 1, 1, Now, these, are, the, names, of, 
      the, children, of, Israel, which, came, into, 
      Egypt, every, man, and, his, household, came, 
      with, Jacob, 2, Reuben, Simeon, Levi, and, Judah, 
      3, Issachar, Zebulun, and, Benjamin, 4, Dan, and, 
      Naphtali, Gad, and, Asher) 
    */ 

字邊界\b描述here,對正則表達式的Javadoc是here

2

對於這個特殊情況,我會用正則表達式去。

def toWords(lines: List[String]) = lines flatMap { line => 
    "[a-zA-Z]+".r findAllIn line map (_.toLowerCase) 
}