捕獲與啓動「[」和結束「]」，而分裂

目前爲正則表達式我使用str.toLowerCase.split("[\\s\\W]+")擺脫空格和標點符號，但有一類特殊的字符串，我想所有的字符串保持爲一體，並從該處理中排除：捕獲與啓動「[」和結束「]」，而分裂

[[...multiple words...]]

實施例：

[[Genghis Khan]]

仍應作爲

[[Genghis Khan]]

我應該使用什麼樣的正則表達式？

來源

2012-05-01 Kenneth Tiong

僅供參考，由'\ W'匹配的字符集包含空格字符，所以您不需要'\ s'。 '「\\ W +」'就足夠了。 –

下面是第一分割在任[[或]]功能。這樣做可以確保拆分項目在未加引號和帶引號的字符串之間交替（即第2個，第4個等項目被「引用」）。然後我們可以遍歷這個列表，並在空白處分割任何未加引號的項目，同時保留引用的項目不變。

def mySplit(s: String) = 
    """(\[\[)|(\]\])""".r.split(s).zipWithIndex.flatMap { 
    case (unquoted, i) if i%2==0 => unquoted.trim.split("\\s+") 
    case (quoted, _) => List(quoted) 
    }.toList.filter(_.nonEmpty) 

mySplit("this [[is]] the first [[test string]].") // List(this, is, the, first, test string, .) 
mySplit("[[this]] and [[that]]")   // List(this, and, that) 
mySplit("[[this]][[that]][[the other]]") // List(this, that, the other)

如果在最終輸出想要的[[ ]]，那麼就改變上述List(quoted)到List("[[" + quoted + "]]")

來源

2012-05-01 18:25:36 dhg

你的正則表達式是不是很遙遠：

def tokenize(s: String) = """\w+|(\[\[[^\]]+\]\])""".r.findAllIn(s).toList

然後：

scala> tokenize("[[Genghis Khan]] founded the [[Mongol Empire]].") 
res1: List[String] = List([[Genghis Khan]], founded, the, [[Mongol Empire]])

這是一個很好的用例對於s卡拉的parser combinators，雖然：

import scala.util.parsing.combinator._ 

object Tokenizer extends RegexParsers { 
    val punc = "[,;:\\.]*".r 
    val word = "\\w+".r 
    val multiWordToken = "[[" ~> "[^\\]]+".r <~ "]]" 
    val token = (word | multiWordToken) <~ punc 
    def apply(s: String) = parseAll(token+, s) 
}

這同樣給我們：

scala> Tokenizer("[[Genghis Khan]] founded the [[Mongol Empire]].").get 
res2: List[String] = List(Genghis Khan, founded, the, Mongol Empire)

我喜歡解析器組合版本，個人，它實際上自我記錄，更易於擴展和維護。

來源

2012-05-01 18:34:51

拆分不是處理這個問題的方法，因爲它不處理上下文。你可能會寫這：

str.toLowerCase.split("(?<!\\[\\[([^]]|\\][^]])*\\]?)[\\s\\W]+")

這將各執不是由[[其次是任何東西，除了前面]]任何空間，但Java不喜歡可變大小的查找屁股。

在我看來，要解決這個最好的辦法是寫一個解析器，除非你真的需要的速度。使用像Travis Brown（他也在其answer中顯示解析器）建議的正則表達式。

來源

2012-05-02 21:26:14

捕獲與啓動「[」和結束「]」，而分裂

回答

相關問題