正則表達式來拆分字符串，但也捕獲分隔符？

我在.txt文件中有一本書，我試圖將書分成單個單詞。在這種情況下，一個詞被認爲是A-Z，a-z或'。正則表達式來拆分字符串，但也捕獲分隔符？

到目前爲止，我有這樣的：

String[] words = bookStr.split("[^a-zA-Z']+");

成功地分割的話了就好了。但是，我也想要捕獲所有的分隔符和它們發生的次數。這可能與模式有關，還是我實際上需要循環遍歷整個字符串來計算我需要的數據？

例子：

String bookStr = "I just can't figure this out.\nI wonder why LOST ended?" 


String[] words = bookStr.split("[^a-zA-Z']+"); 

// Using the regex I already have, I have gathered the words I want. 

// ["I", "just", "can't", "figure", "this", "out", "I", "wonder", "why", "LOST", "ended"] 

// Is there any way to gather these as well using the Pattern class or with split()? 

// [" ", " ", " ", " ", " ", ".", "\n", " ", " ", " ", " ", "?"]

來源

2013-11-25 user3032301

你想你的數組中的空白*和*的話，或者你從字面上想空白的數組序列？ – Bohemian

我想要一個既包含分隔符之間的內容又包含分隔符本身的數組。所以也許只有一個數組包含上面兩個數組的內容。 – user3032301

斯普利特：

String[] words = bookStr.split("(?<!^|[a-zA-Z'])|(?![a-zA-Z'])");

因爲split（）使用分隔符，所以需要一個非消耗的正則表達式;向前看/後面是零寬度斷言，因此是非消費性的。

正則表達式功夫的細分是替代性匹配（由|分隔），對於您認爲是組成部分「詞」的字符的字符類別爲負面觀察。在輸入開始後面有一個額外的替代方案，沒有它將在第一個字符前之間出現分割，而第一個元素爲空白。

下面是一些測試代碼：

String bookStr = "I just can't figure this out.\nI wonder why LOST ended?"; String[] words = bookStr.split("(?<!^|[a-zA-Z'])|(?![a-zA-Z'])"); System.out.println(Arrays.toString(words));

輸出：

[I, , just, , can't, , figure, , this, , out, ., , I, , wonder, , why, , LOST, , ended, ?]

來源

2013-11-25 19:51:11 Bohemian

如果使用[a-zA-Z']+那麼你會得到你的結果

來源

2013-11-25 12:49:37

String[] wordsAndDelimiters = bookStr.split("\\b");

來源

2013-11-25 13:30:41 Holger

這很接近，但會將「不能」分成三部分，並會創建一個初始空白元素:( – Bohemian

[A-ZA-Z'] +（= \ s？）| \ S +（？ = [A-ZA-Z']）之前使用的話之前的空間或空間是這樣的（分詞），我瘦是採用負變通一下你的字字符讓你想要什麼

來源

2013-11-25 14:02:20 llCorvinuSll

正則表達式來拆分字符串，但也捕獲分隔符？

回答

相關問題