2016-07-08 37 views
1

進一步深入文本挖掘,最近有客戶詢問是否有可能在前面加上5個單詞並繼續執行關鍵術語。示例...提取關鍵術語的左側和右側的單詞

爲了充分發揮繞口令的效果,您應該儘可能快地重複幾次,而不會出現絆倒或錯誤發音。

Key term=twisters 
Preceding 5 words=the full effect of tongue 
Proceeding 5 words=you should repeat them several 

長期計劃是取10個最常見的術語,與前面的和出發話沿,並加載到data.frame。我用gsub翻了一下,但無濟於事。

任何想法,指導等,將不勝感激。

回答

2

quanteda封裝具有的功能,專爲背景返回關鍵詞:kwic。它在引擎蓋下使用stringi

library(quanteda) 
kwic(txt, keywords = "twisters", window = 5, case_insensitive = TRUE) 
#       contextPre keyword      contextPost 
#[text1, 8] the full effect of tongue [ twisters ] you should repeat them several 
#[text2, 2]      The [ twisters ] are always twisting   
#[text3, 9] for those guys, they are [ twisters ] of words and will tell 
#[text4, 1]       [ Twisters ] will ruin your life. 

示例文本:

# sample text 
txt <- c("To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing.", 
     "The twisters are always twisting", 
     "watch out for those guys, they are twisters of words and will tell a yarn a mile long", 
     "Twisters will ruin your life.") 
0

使用strsplit將字符串拆分爲一個向量,然後使用grep來獲得正確的索引。如果你這樣做了很多,你應該把它包裝在一個函數中。

x <- "To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing." 
x_split <- strsplit(x, " ")[[1]] 
key <- "twisters" 
key_index <- grep(key, x) 
before <- x_split[(key_index - 5):(key_index - 1)] 
after <- x_split[(key_index + 1):(key_index + 5)] 
before 
#[1] "the" "full" "effect" "of"  "tongue" 
after 
#[1] "you"  "should" "repeat" "them" "several" 
paste(before, collapse = " ") 
#[1] "the full effect of tongue" 
paste(after, collapse = " ") 
#[1] "you should repeat them several" 
3

您可以使用從stringrword

library(stringr) 
ind <- sapply(strsplit(x, ' '), function(i) which(i == 'twisters')) 
word(x, ind-5, ind-1) 
#[1] "the full effect of tongue" 
word(x, ind+1, ind+5) 
#[1] "you should repeat them several" 
+1

多謝,正是我需要的。 – Atwp67