2014-09-29 45 views
1

我正在做一些NL與R和我使用stringr package標記一些文本。R stringr和str_extract_all:捕獲收縮

我希望能夠捕捉到收縮,例如不會,使其標記化到在「wo」「而不是」

這裏是我有什麼樣:

library(stringr) 

s = "won't you buy my raspberries?" 

foo = str_extract_all(s, "(n|t)|[[:punct:]]")   # captures the contraction OK... 
foo[[1]] 
>[1] "n't" "?" 

foo = str_extract_all(s, "(n|t)|\\w+|[[:punct:]]")  # gets all words, 
                # but splits the contraction! 
foo[[1]] 
>[1] "won" "'" "t" "you" "buy" "my" "raspberries" "?" 

我試圖來標記上面的句子變成「WO」「而不是」「你」「buy」「my」,「覆盆子」,「?」

我不太確定我是否可以用default, extended regular expressions來做到這一點,或者如果我需要找出某種方式來實現這種類似於Perl的模式。

有沒有人知道如何使用stringr package進行標記化?

編輯 爲了澄清,我感興趣的Treebank tokenization

回答

2

你能做到這一點,通過它是由PCRE庫支持向前看符號。

> s = "won't you buy my raspberries?" 
> s 
[1] "won't you buy my raspberries?" 
> m <- gregexpr("\\w+(?=n[[:punct:]]t)|n?[[:punct:]]t?|\\w+", s, perl=TRUE) 
> regmatches(s, m) 
[[1]] 
[1] "wo"   "n't"   "you"   "buy"   "my"   
[6] "raspberries" "?" 

OR

> m <- gregexpr("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+", s, perl=TRUE) 
> regmatches(s, m) 
[[1]] 
[1] "wo"   "n't"   "you"   "buy"   "my"   
[6] "raspberries" "?" 

OR

通過stringr庫,

> s <- "won't you buy my raspberries?" 
> str_extract_all(s, perl("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+"))[[1]] 
[1] "wo"   "n't"   "you"   "buy"   "my"   
[6] "raspberries" "?" 
+0

感謝您的答覆,@Avinash。選擇基本gregexpr和regmatches方法與stringr方法會有什麼優勢嗎? – buruzaemon 2014-09-29 13:15:03

2

stringr包功能工作時,你可以嘗試perl包裝函數。

s <- "won't you buy my raspberries?" 
pattern <- "(?=[a-z]'[a-z])|(\\s+)|(?=[!?.])" 
library(stringr) 
str_split(s, perl(pattern))[[1]] 
# [1] "wo"   "n't"   "you"   "buy"   "my"   
# [6] "raspberries" "?" 

也有其他包裝如fixedignore.case

+0

感謝您的回覆,@Richard。我將仔細看看你提到的包裝函數。 – buruzaemon 2014-09-29 13:12:04