2016-09-11 43 views
0

我有一個有百萬字的文本文檔。現在,我需要知道如何使用R查找單詞的尾部和主要單詞。如何使用R查找單詞的尾部和主要單詞?

例如,如果我想查找單詞「錯誤」前後出現的單詞。它可以像與領先的話

"typo error" 
"manual error" 
"system error" 

和尾隨的話就像

"error corrected" 
"error found" 
"error occurred" 

任何想法如何做到這一點下面的東西嗎?預先感謝您的意見。

回答

3

有關錯誤來之前的話:

x <- "no error and no error and some error" # input 

library(gsubfn) 
rx <- "(\\w+) error" 
table(strapplyc(x, rx)[[1]]) 

捐贈:

no some 
    2 1 

更換rx錯誤後話如下:

rx <- "error (\\w+)" 
1

如何:

test <- c("I don't want to match error this This is a random error what I want to match") 
# convert to a list 
words <- strsplit((test),' ') 
# get indexes that match 'error' 
indexes <- grep('error',words[[1]], perl=TRUE) 

# select words that come after 'error' 
words[[1]][indexes+1] 
# before 'error' 
words[[1]][indexes-1] 
2

我的解決辦法是str_match_all

library(stringr) 
txt <- "system error corrected hardcore error detected wtf error holymoly" 
str_match_all(txt, "\\s*(\\w+)\\serror\\s*(\\w+)") 

[[1]] 
    [,1]      [,2]  [,3]   
[1,] "system error corrected" "system" "corrected" 
[2,] " hardcore error detected" "hardcore" "detected" 
[3,] " wtf error holymoly"  "wtf" "holymoly"