2013-11-01 63 views
8

說我有一個字符串例如以下。R字符串刪除拆分上的標點符號

x <- 'The world is at end. What do you think? I am going crazy! These people are too calm.' 

我只需要在標點符號!?.和下面的空格分開,並保持標點符號用它。

這消除了標點符號和葉前導空格在分裂部雖然

vec <- strsplit(x, '[!?.][:space:]*') 

我怎麼可以拆分句子離開標點?

回答

14

您可以通過使用perl=TRUE來開啓PCRE並使用lookbehind斷言。

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE) 

正則表達式:

(?<!   look behind to see if there is not: 
[^!?.]  any character except: '!', '?', '.' 
)    end of look-behind 
\s+   whitespace (\n, \r, \t, \f, and " ") (1 or more times) 

Live Demo

+0

是,'」 .''並不需要進行轉義,因爲它是在一個'[]'組或原因其他一些原因? –

+0

對於PCRE和其他所謂的Perl兼容風格,在字符類內部轉義'。^ $ * +?()[{\ |'外部字符類和'^','-',']',\。 – hwnd

+0

太好了,謝謝你澄清 –

5

sentSplit功能在qdap package是創造只爲這任務:

library(qdap) 
sentSplit(data.frame(text = x), "text") 

## tot      text 
## 1 1.1  The world is at end. 
## 2 2.2   What do you think? 
## 3 3.3   I am going crazy! 
## 4 4.4 These people are too calm. 
2

this question看看。像[:space:]這樣的字符類是在括號表達式中定義的,所以您需要將它放在一組括號中。請嘗試:

vec <- strsplit(x, '[!?.][[:space:]]*') 
vec 
# [[1]] 
# [1] "The world is at end"  "What do you think"   
# [3] "I am going crazy"   "These people are too calm" 

這樣可以排除前導空格。爲了保持標點符號,使用正向後斷言與perl = TRUE

vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE) 
vec 
# [[1]] 
# [1] "The world is at end."  "What do you think?"   
# [3] "I am going crazy!"   "These people are too calm." 
+0

他也想在分裂之後標點符號。 – hwnd

+0

啊,明白了。我會編輯 - 它看起來很像你的答案,只用'[[:space:]]'而不是'\\ s'。答案的重疊不是100%,所以如果你沒有問題,我也可以。 –

1

您可以取代下列標點符號用字符串,e.g zzzzz的空間,然後拆分該字符串。

x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think? I am going crazy! These people are too calm.") 
strsplit(x, "zzzzz") 

\1在替換字符串是指圖案的括號內的子表達式。

1

由於qdap version 1.1.0可以使用sent_detect功能如下:

library(qdap) 
sent_detect(x) 

## [1] "The world is at end."  "What do you think?"   
## [3] "I am going crazy!"   "These people are too calm." 
+0

另外,從2.2.1開始,sent_detect_nlp – demongolem