將strsplit（...）textvectors拆分爲R

請幫我解決我的小型項目。將strsplit（...）textvectors拆分爲R

有一個大的文本元素列表。每個元素都應該被分成一小段句子。每個小列表應該像原始文本元素一樣，作爲一個元素保存到相同位置（'行'）的初始大列表的新列中。

分解標準是"/$","und/KON","oder/KON"。這應該保留在新的小單元素的頭部。

我試過用正則表達式如"/$|und/KON|oder/KON"和manny組合轉義"$","|","/"。此外，我試圖改變參數perl = TRUE,fixed = TRUE和FALSE。每次我嘗試注意都會發生。似乎|解釋不正確。你建議如何解決這個問題？

library(stringr) # don't know if it's required 

# Input list to be splitted at each 
#  "/$", "und/KON", "oder/KON" 
#  but should keep the expression at the start of the next list element 
#  
#  Would be nice but not necessary: The small-list to be named after the ID in the first column 

> r <- list(ID=c(01, 02, 03), 
      elements=c("This should become my first small-list :/$. the first element ,/$, the second element ,/$, and the third element ./$.", 
         "This should become my second small-list :/$. Element eins und/KON Element zwei oder/KON Element drei ./$.", 
         "This should become my third small-list :/$. Element Alpha und/KON Element Beta oder/KON Element Gamma ./$.") 

# Would look something like 
r$small_lists <- sapply(r$elements ,function(x) as.list(strsplit(x,"/$|und/KON"|oder/KON", fixed=TRUE))) 
> r$small_lists 

$01 
[1] "This should become my first small-list " 
[2] ":/$. the first element " 
[3] ",/$, the second element " 
[4] ",/$, and the third element " 
[5] "./$." 

$02 
[1] "This should become my second small-list " 
[2] ":/$. Element eins " 
[3] "und/KON Element zwei " 
[4] "oder/KON Element drei" 
[5] "./$." 

$03 
[1] "This should become my third small-list " 
[2] ":/$. Element Alpha " 
[3] "und/KON Element Beta " 
[4] "oder/KON Element Gamma " 
[5] "./$." 

> class(r) 
[1] "list" 
> class(r$small_lists) 
[1] "list"

來源

2013-08-28 alex

我沒有看到一個問題在這裏了。 – A5C1D2H2I1M1N2O1R2T1

@AnandaMahto：對不起，謝謝，完成:) – alex

謝謝！）爲了讓我更好的理解，你能解釋一下''＆^ \\ 1「'分別是什麼'」^＆*「'工作？ – alex

實際上，如果這是您希望的輸出，您實際上會有比您指示的更多的分割模式。請注意，我的模式與您的模式不同。所有特殊字符都已被\\轉義。

爲了讓事情易於管理，我將創建一個單獨的要分割的模式向量，將它們粘貼到主模式中，搜索它們並通過一些您知道不會出現在您的文本，並分裂。

這裏是我已經確定的「模式」：

Pattern <- c(":/\\$", ",/\\$", "\\./\\$", 
      "und/KON", "oder/KON")

我們可以paste這些模式合力得到主模式。內部seppaste是用於匹配不同圖案的管道符號。整個模式放在括號內（(和)），以便我們稍後參考。

Pattern <- paste("(", paste(Pattern, collapse = "|"), ")", sep = "")

我們現在可以使用gsub的「前綴」添加到模式（這是什麼\\1指）。我們需要這個前綴，因爲你想保留所提到的表達式。

## Insert some text pattern you know doesn't occur in your text 
## Here, I've prepended the matched patterns with "^&*" 
## You now have something on which you can split 
strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE) 
# [[1]] 
# [1] "This should become my first small-list " 
# [2] ":/$. the first element "     
# [3] ",/$, the second element "    
# [4] ",/$, and the third element "    
# [5] "./$."         
# 
# [[2]] 
# [1] "This should become my second small-list " 
# [2] ":/$. Element eins "      
# [3] "und/KON Element zwei "     
# [4] "oder/KON Element drei "     
# [5] "./$."          
# 
# [[3]] 
# [1] "This should become my third small-list " 
# [2] ":/$. Element Alpha "      
# [3] "und/KON Element Beta "     
# [4] "oder/KON Element Gamma "     
# [5] "./$."

從上面繼續，讓你描述的命名列表：

out <- strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE) 
setNames(lapply(out, `[`, -1), lapply(out, `[`, 1)) 
# $`This should become my first small-list ` 
# [1] ":/$. the first element "  
# [2] ",/$, the second element " 
# [3] ",/$, and the third element " 
# [4] "./$."      
# 
# $`This should become my second small-list ` 
# [1] ":/$. Element eins "  
# [2] "und/KON Element zwei " 
# [3] "oder/KON Element drei " 
# [4] "./$."     
# 
# $`This should become my third small-list ` 
# [1] ":/$. Element Alpha "  
# [2] "und/KON Element Beta " 
# [3] "oder/KON Element Gamma " 
# [4] "./$."

來源

2013-08-28 17:56:10 A5C1D2H2I1M1N2O1R2T1

非常感謝。爲了更好的理解，你能分別解釋'\\ 1''部分是什麼意思嗎？他們是角色的隨機連續劇嗎？或者他們很重要嗎？ – alex

@alex，那些是反向引用。正則表達式中的匹配可以分組在括號內（'（）'）。第一個模式被反引用爲'\\ 1'，第二個模式被引用爲'\\ 2'，依此類推。在這裏，我們只有一種模式，所以它是'\\ 1'，應該保持這種狀態。 – A5C1D2H2I1M1N2O1R2T1

非常感謝！ :) – alex

將strsplit（...）textvectors拆分爲R

回答

相關問題