2017-07-30 50 views
3

我有像這樣一串字符串:我可以合理分割這些數字字符串嗎?

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%") 

他們分數和分數表示的百分比值,它在某種程度上得到了某處一起搗成泥。所以這個例子中第一個數字的含義是7箇中有4個是57.1%。我可以很容易地在/之前得到第一個數字(例如,stringr::word(x, 1, sep = "/")),但第二個數字可以是一個或兩個字符長,所以我很難想出一個辦法來做到這一點。我不需要%值,因爲一旦獲得數字,這很容易重新計算。

任何人都可以看到一種方式嗎?

回答

1

那種難看的-A解決方案,似乎做你想要什麼:

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%") 

split_perc <- function(x,signif_digits=1){ 
    x = gsub("%","",x) 
    if(grepl("-",x)) return(list(NA,NA)) 
    index1 = gregexpr("/",x)[[1]][1]+1 
    index2 = gregexpr("\\.",x)[[1]][1]-2 
    if(index2==-3){index2=nchar(x)-1} 

    found=FALSE 
    indices = seq(index1,index2) 
    k=1 
    while(!found & k<=length(indices)) 
    { 
    str1 =substr(x,1,indices[k]) 
    num1=as.numeric(strsplit(str1,"/")[[1]][1]) 
    num2 = as.numeric(strsplit(str1,"/")[[1]][2]) 
    value1 = round(num1/num2*100,signif_digits) 
    value2 = round(as.numeric(substr(x,indices[k]+1,nchar(x))),signif_digits) 
    if(value1==value2) 
    {found=TRUE} 
    else 
    {k=k+1} 
    } 
    if(found) 
    return(list(num1,num2)) 
    else 
    return(list(NA,NA)) 
} 

do.call(rbind,lapply(x,split_perc)) 

輸出:

 [,1] [,2] 
[1,] 4 7 
[2,] 0 1 
[3,] 6 10 
[4,] NA NA 
[5,] 11 20 

幾個例子:

y = c("11/2055.003%","11/2055.2%","40/7057.1%") 
do.call(rbind,lapply(y,split_perc)) 

    [,1] [,2] 
[1,] 11 20 # default significant digits is 1, so match found. 
[2,] NA NA # no match found since 55.1!=55.2 
[3,] 40 70 
+0

非常感謝! – Mart

+0

奇怪的是,我只是在一個月後發現了一個bug - 「11/11100%」似乎有問題,應該是11和11,但是這個函數返回11和1.我不好,因爲沒有給出100個例子%在開始。但是目前爲止,所有其他案例都完美無缺--10,10,10,11,11,12,12和12。 – Mart

0

正如你所指出的,一旦你有分數,百分比就可以重新計算。你能利用這個事實弄清楚拆分應該在哪裏嗎?

GuessSplit <- function(string) { 

    tolerance <- 0.001 #How close should the fraction be? 
    numerator <- as.numeric(word(string, 1, sep = "/")) 
    second.half <-word(string, 2, sep = "/") 
    second.half <- strsplit(second.half, '')[[1]] 

    # assuming they all end in percent signs 
    possibilities <- length(second.half) - 1 

    for (position in 1:possibilities) { 

    denom.guess <- as.numeric(paste0(second.half[1:position], collapse='')) 
    percent.guess <- as.numeric(paste0(second.half[(position+1):possibilities], collapse=''))/100 

    value <- numerator/denom.guess 

    if (abs(value - percent.guess) < tolerance) { 

     return(list(numerator=numerator, denominator=denom.guess)) 

    } 
    } 
} 

這需要一點愛來處理怪異的情況,如果它無法找到答案的可能性,可能更優雅。我也不確定什麼樣的退貨類型是最好的。也許你只需要分母,因爲分子很容易得到,但我認爲兩者的列表將是最普遍的。我希望這是一個合理的開始?

1

從溶液tidyversestringr。我們可以定義一個函數來分解第二個數字的所有可能位置,並計算百分比以查看哪一個有意義。 df2是顯示最佳分割位置的數據框,您需要的數字位於V3列中。

library(tidyverse) 
library(stringr) 

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%") 

dt <- str_split_fixed(x, pattern = "/", n = 2) %>% 
    as_data_frame() %>% 
    mutate(ID = 1:n()) %>% 
    select(ID, V1, V2) 

# Design a function to spit the second column based on position 
split_df <- function(position, dt){ 
    dt_temp <- dt %>% 
    mutate(V3 = str_sub(V2, 1, position)) %>% 
    mutate(V4 = str_sub(V2, position + 1)) %>% 
    mutate(Pos = position) 

    return(dt_temp) 
} 

# Process the data 
dt2 <- map_df(1:3, split_df, dt = dt) %>% 
    # Remove % in V4 
    mutate(V4 = str_replace(V4, "%", "")) %>% 
    # Convert V1, V3 and V4 to numeric 
    mutate_at(vars(V1, V3, V4), funs(as.numeric)) %>% 
    # Calculate possible percentage 
    mutate(V5 = V1/V3 * 100) %>% 
    # Calculate the difference between V4 and V5 
    mutate(V6 = abs(V4 - V5)) %>% 
    # Select the smallest difference based on V6 for each group 
    group_by(ID) %>% 
    arrange(ID, V6) %>% 
    slice(1) 

# The best split is now in V3 
dt2$V3 
[1] 7 1 10 0 20