2015-01-03 32 views
5

我有很長的名字的列,我想將它們剪成max 40個字符的長度。縮短(限制)一個句子的長度

的樣本數據:

x <- c("This is the longest sentence in world, so now just make it longer", 
"No in fact, this is the longest sentence in entire world, world, world, world, the whole world") 

我想縮短sentece長度約40( -/+ 3的nchar),所以我不縮短一個單詞中間的句子。 (所以長度取決於單詞之間的空白空間)。

另外我想補充一點3個點之後縮短的見面。

所需的輸出會是這樣的:

c("This is the longest sentence...","No in fact, this is the longest...") 

此功能將只是一味地縮短在40字符。

strtrim(x, 40) 
+1

你試圖把一個解決方案還在一起嗎? 'strsplit','nchar','cumsum'和'substr'是你需要使用的組件... –

+0

是的,我嘗試了各種各樣的東西,不像預期的那樣工作。實際上,strsplit的sentece分解是要走的路... – Maximilian

+0

'strwrap(x,width = 40)'? – lukeA

回答

5

好吧,我現在有更好的解決辦法:)

x <- c("This is the longest sentence in world, so now just make it longer","No in fact, this is the longest sentence in entire world, world, world, world, the whole world") 

extract <- function(x){ 
    result <- stri_extract_first_regex(x, "^.{0,40}(|$)") 
    longer <- stri_length(x) > 40 
    result[longer] <- stri_paste(result[longer], "...") 
    result 
} 
extract(x) 
## [1] "This is the longest sentence in world, ..." "No in fact, this is the longest sentence ..." 

基準新老VS(32個000句):

microbenchmark(sapply(x, cutAndAddDots, USE.NAMES = FALSE), extract(x), times=5) 
Unit: milliseconds 
             expr  min   lq  median   uq  max neval 
sapply(x, cutAndAddDots, USE.NAMES = FALSE) 3762.51134 3762.92163 3767.87134 3776.03706 3788.139  5 
            extract(x) 56.01727 57.18771 58.50321 79.55759 97.924  5 

舊版本

該解決方案需要stringi packag e並且總是在字符串的末尾添加三個點...

require(stringi) 
sapply(x, function(x) stri_paste(stri_wrap(x, 40)[1],"..."),USE.NAMES = FALSE) 
## [1] "This is the longest sentence in world..." "No in fact, this is the longest..." 

這一個增加了的三個點只句子其中超過40個字符:

require(stringi) 
cutAndAddDots <- function(x){ 
    w <- stri_wrap(x, 40) 
    if(length(w) > 1){ 
    stri_paste(w[1],"...") 
    }else{ 
    w[1] 
    } 
} 
sapply(x, cutAndAddDots, USE.NAMES = FALSE) 
## [1] "This is the longest sentence in world" "No in fact, this is the longest..." 

性能注stri_wrap設置normalize=FALSE可加快這個大約3倍(30測試000句子)

測試數據:

x <- stri_rand_lipsum(3000) 
x <- unlist(stri_split_regex(x,"(?<=\\.) ")) 
head(x) 
[1] "Lorem ipsum dolor sit amet, vel commodo in."              
[2] "Ultricies mauris sapien lectus dignissim."              
[3] "Id pellentesque semper turpis habitasse egestas rutrum ligula vulputate laoreet mollis id."  
[4] "Curabitur volutpat efficitur parturient nibh sociosqu, faucibus tellus, eleifend pretium, quis." 
[5] "Feugiat vel mollis ultricies ut auctor."               
[6] "Massa neque auctor lacus ridiculus."                
stri_length(head(x)) 
[1] 43 41 90 95 39 35 

cutAndAddDots <- function(x){ 
    w <- stri_wrap(x, 40, normalize = FALSE) 
    if(length(w) > 1){ 
    stri_paste(w[1],"...") 
    }else{ 
    w[1] 
    } 
} 
cutAndAddDotsNormalize <- function(x){ 
    w <- stri_wrap(x, 40, normalize = TRUE) 
    if(length(w) > 1){ 
    stri_paste(w[1],"...") 
    }else{ 
    w[1] 
    } 
} 
require(microbenchmark) 
microbenchmark(sapply(x, cutAndAddDots, USE.NAMES = FALSE),sapply(x, cutAndAddDotsNormalize, USE.NAMES = FALSE),times=3) 
Unit: seconds 
               expr  min  lq median  uq  max 
      sapply(x, cutAndAddDots, USE.NAMES = FALSE) 3.917858 3.967411 4.016964 4.055571 4.094178 
sapply(x, cutAndAddDotsNormalize, USE.NAMES = FALSE) 13.493732 13.651451 13.809170 13.917854 14.026538 
+0

這個解決方案很好,但是我只需要點就是縮短的句子。另外我需要更快的解決方案。所以我會等一下。謝謝! – Maximilian

+4

@Max我會認爲'stringi'是最快.. – akrun

+0

@akrun:如果你這麼說......我相信你:) – Maximilian

4

基礎R解決方案:

baseR <- function(x){ 
    m <- regexpr("^.{0,40}(|$)", x) 
    result <- regmatches(x,m) 
    longer <- nchar(x)>40 
    result[longer] <- paste(result[longer],"...",sep = "") 
    result 
} 
baseR(x)==extract(x) 
[1] TRUE TRUE 

作品就像@bartektartanus extract :)但它是慢...我測試了從他的代碼生成的數據。不過,如果你不想使用任何外部軟件包 - 這一個工程!

microbenchmark(baseR(x), extract(x)) 
Unit: milliseconds 
     expr  min  lq median  uq  max neval 
    baseR(x) 101.20905 107.0264 108.79086 111.03229 162.6375 100 
extract(x) 52.83951 54.6931 55.46628 59.37808 103.0631 100 
2

我想我也會發表這篇文章。絕對不是stringi的速度,但它不是太破舊。我需要一個繞過str的打印方法,所以我寫了這個。

charTrunc <- function(x, width, end = " ...") { 
    ncw <- nchar(x) >= width 
    trm <- strtrim(x[ncw], width - nchar(end)) 
    trimmed <- gsub("\\s+$", "", trm) 
    replace(x, ncw, paste0(trimmed, end)) 
} 

測試從@bartektartanus答案的字符串:

x <- stri_rand_lipsum(3000) 
x <- unlist(stri_split_regex(x,"(?<=\\.) ")) 

library(microbenchmark) 
microbenchmark(charTrunc = { 
    out <- charTrunc(x, 40L) 
    }, 
    times = 3 
) 

Unit: milliseconds 
     expr  min  lq  mean median  uq  max neval 
charTrunc 506.553 510.988 513.4603 515.423 516.9139 518.4049  3 

head(out) 
# [1] "Lorem ipsum dolor sit amet, venenati ..." 
# [2] "Tincidunt at pellentesque id sociosq ..." 
# [3] "At etiam quis et mauris non tincidun ..." 
# [4] "In viverra aenean nisl ex aliquam du ..." 
# [5] "Dui mi mauris ac lacus sit hac."   
# [6] "Ultrices faucibus sed justo ridiculu ..."