2017-08-09 27 views
0

假設我們有一個全文本文件作爲字符向量加載到R中。我正在尋找能夠在兩個「。」之間抽出所有文本的代碼,這兩個時期之間存在「和」以及至少一個「%」。提取滿足R中兩個條件的字符向量的句子

character <- as.character("Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%.") 

考慮看看這個簡單的例子,我沿着線的

[1] Sony reported an increase, and the percent was posted at 1.0%. 
[2] And the percent of increase for Best Buy was 2.5%. 

回答

1

希望輸出某處一個快速的解決方案:

library(magrittr) 
"Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%." %>% 
    ## split the string at the sentence boundaries 
    gsub("\\.\\s", "\\.\t", .) %>% 
    strsplit("\\t") %>% unlist() %>% 
    ## keep only sentences that contain "and the" (irrespective of case) 
    grep("and the", x = ., value = TRUE, ignore.case = TRUE) %>% 
    ## keep only the sentences that end with %. 
    grep("%\\.$", x = ., value = TRUE) %>% 
    ## remove leading white spaces 
    gsub("^\\s?", "", x = .) 
+0

工作就像一個魅力!只有在我的應用程序中使用來自Web的大型文本文件時纔會出現問題,因爲這些文件太長,句子會被截斷並繼續下一行。因此,我通過在我的readLines函數前面插入粘貼,將整個文本文件轉換爲單個字符矢量,如下所示:'paste(readLines(「websiteurl.txt」),collapse =「」)%>%' –