2017-04-05 77 views
0

我從pdf文件中提取文本並創建一個語料庫對象。tm_map條件合併行

在文本中,我有以「,」或「 - 」結尾的行,並且我想向它們追加下面的行,因爲它屬於同一個句子。

比如我有

[1566] "this and other southeastern states (Eukerria saltensis,"  
[1567] "Sparganophilus helenae, Sp. tennesseensis). In the" 

我想有而不是

[1566] "this and other southeastern states (Eukerria saltensis, Sparganophilus helenae, Sp. tennesseensis). In the" 

我試過的東西像更換換行,但沒有成功:

tm_map(myCorpus, content_transformer(gsub), pattern =",$\n",replacement = "") 

任何關於如何在R中做到這一點的想法?

回答

0

謝謝,它的工作!

我不得不把它的功能,使其與tm_map工作,雖然:

clean.X <- function(X){ 

    X2 <- paste0(X,collapse="\n") 
    X2 <- gsub(",\\n",", ",X2) 
    X2 <- gsub("\\-\\n","-",X2) 
    X2 <- unlist(strsplit(X2,"\\n")) 
    return(X2) 

} 

txt2 <- tm_map(txt, content_transformer(clean.X)) 
0

這是一種方法,根據您的換行符分裂的想法...

txt <- c("aaa","bbc,","df","fgh-","jkh-","dfsf","gghf") 

txt2 <- paste0(txt,collapse="\n") 
txt2 <- gsub(",\\n",", ",txt2) 
txt2 <- gsub("\\-\\n","-",txt2) 
txt2 <- unlist(strsplit(txt2,"\\n")) 

txt2 
[1] "aaa" "bbc, df" "fgh-jkh-dfsf" "gghf"