2013-03-20 61 views
1

我有一個輸入文件有一個段落。我需要按照模式將段落分成兩個分段。i如何在R中分割多行文本?

paragraph.xml

<Text> 
     This is first line. 
     This is second line. 
     \delemiter\new\one 
     This is third line. 
     This is fourth line. 
</Text> 

R代碼裏面:

doc<-xmlTreeParse("paragraph.xml") 
top = xmlRoot(doc) 
text<-top[[1]] 

我需要本段分成2個段落。

1款

This is first line. 
This is second line. 

1款

This is third line. 
    This is fourth line. 

我發現strsplit功能是非常有用的,但它永遠不會分離的多行文字。

+0

在嵌入式換行符,列表或向量長度之一這個'character' '字符',還是您尚未閱讀的文本文件? – 2013-03-20 04:34:59

+0

請修改您的問題以顯示您的數據的確切結構(或一些示例數據)。例如,粘貼'dput(head(yourdata))'的結果。目前尚不清楚新線如何確定。 – Ben 2013-03-20 04:36:07

回答

2

既然你有xml文件,最好使用XML包裝設施。我看到你在這裏開始使用它,你已經開始的連續性。

library(XML) 
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE) 
## extract the text of the node Text 
mytext = xpathSApply(doc,'//Text/text()',xmlValue) 
## convert it to a list of lines using scan 
lines <- scan(text=mytext,sep='\n',what='character') 
## get the delimiter index 
delim <- which(lines == "\\delemiter\\new\\one") 
## get the 2 paragraphes 
p1 <- lines[seq(delim-1)] 
p2 <- lines[seq(delim+1,length(lines))] 

然後你可以使用pastewrite拿到段落結構,例如,使用write

write(p1,"",sep='\n') 

This is first line. 
This is second line. 
+0

我可以使用貓而不是起訴寫函數來獲得段落結構嗎? – Manish 2013-03-20 06:35:39

+0

@ user15662當然是。用'cat'替換'write'。 – agstudy 2013-03-20 06:37:37

1

這是一種迂迴的可能性,使用split,greplcumsum

一些樣本數據:

temp <- c("This is first line.", "This is second line.", 
      "\\delimiter\\new\\one", "This is third line.", 
      "This is fourth line.", "\\delimiter\\new\\one", 
      "This is fifth line") 
# [1] "This is first line." "This is second line." "\\delimiter\\new\\one" 
# [4] "This is third line." "This is fourth line." "\\delimiter\\new\\one" 
# [7] "This is fifth line" 

使用split使用cumsumgrepl產生 「團」 之後:

temp1 <- split(temp, cumsum(grepl("delimiter", temp))) 
temp1 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "\\delimiter\\new\\one" "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "\\delimiter\\new\\one" "This is fifth line" 

如果進一步清理需要,這裏有一個選項:

lapply(temp1, function(x) { 
    x[grep("delimiter", x)] <- NA 
    x[complete.cases(x)] 
}) 
# $`0` 
# [1] "This is first line." "This is second line." 
# 
# $`1` 
# [1] "This is third line." "This is fourth line." 
# 
# $`2` 
# [1] "This is fifth line"