i如何在R中分割多行文本？

我有一個輸入文件有一個段落。我需要按照模式將段落分成兩個分段。i如何在R中分割多行文本？

paragraph.xml

<Text> 
     This is first line. 
     This is second line. 
     \delemiter\new\one 
     This is third line. 
     This is fourth line. 
</Text>

R代碼裏面：

doc<-xmlTreeParse("paragraph.xml") 
top = xmlRoot(doc) 
text<-top[[1]]

我需要本段分成2個段落。

1款

This is first line. 
This is second line.

1款

This is third line. 
    This is fourth line.

我發現strsplit功能是非常有用的，但它永遠不會分離的多行文字。

來源

2013-03-20 Manish

在嵌入式換行符，列表或向量長度之一這個'character' '字符'，還是您尚未閱讀的文本文件？ – 2013-03-20 04:34:59

請修改您的問題以顯示您的數據的確切結構（或一些示例數據）。例如，粘貼'dput（head（yourdata））'的結果。目前尚不清楚新線如何確定。 – Ben 2013-03-20 04:36:07

既然你有xml文件，最好使用XML包裝設施。我看到你在這裏開始使用它，你已經開始的連續性。

library(XML) 
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE) 
## extract the text of the node Text 
mytext = xpathSApply(doc,'//Text/text()',xmlValue) 
## convert it to a list of lines using scan 
lines <- scan(text=mytext,sep='\n',what='character') 
## get the delimiter index 
delim <- which(lines == "\\delemiter\\new\\one") 
## get the 2 paragraphes 
p1 <- lines[seq(delim-1)] 
p2 <- lines[seq(delim+1,length(lines))]

然後你可以使用paste或write拿到段落結構，例如，使用write：

write(p1,"",sep='\n') 

This is first line. 
This is second line.

來源

2013-03-20 06:26:23 agstudy

我可以使用貓而不是起訴寫函數來獲得段落結構嗎？ – Manish 2013-03-20 06:35:39

@ user15662當然是。用'cat'替換'write'。 – agstudy 2013-03-20 06:37:37

這是一種迂迴的可能性，使用split,grepl和cumsum。

一些樣本數據：

temp <- c("This is first line.", "This is second line.", 
      "\\delimiter\\new\\one", "This is third line.", 
      "This is fourth line.", "\\delimiter\\new\\one", 
      "This is fifth line") 
# [1] "This is first line." "This is second line." "\\delimiter\\new\\one" 
# [4] "This is third line." "This is fourth line." "\\delimiter\\new\\one" 
# [7] "This is fifth line"

使用split使用cumsum上grepl產生「團」之後：

temp1 <- split(temp, cumsum(grepl("delimiter", temp))) 
temp1 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "\\delimiter\\new\\one" "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "\\delimiter\\new\\one" "This is fifth line"

如果進一步清理需要，這裏有一個選項：

lapply(temp1, function(x) { 
    x[grep("delimiter", x)] <- NA 
    x[complete.cases(x)] 
}) 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "This is fifth line"

來源

2013-03-20 04:58:52 A5C1D2H2I1M1N2O1R2T1

i如何在R中分割多行文本？

回答

相關問題