2017-04-08 43 views
1

我試圖調整自己的數據,它被掛在某些電池串,所以我想調整,使分隔符是隻有以下情況:一體化正則表達式爲分隔符TIDYR

-End of word 
-「;」 
-1 space 
-Capitol Letter 

我是新來的正則表達式,但是這似乎捕捉到什麼,我在尋找:

";\s[A-Z]" 

但是,它也包括第二個單詞的第一個字母,我不想成爲其中的一部分分隔符。而且我不確定如何將其納入我的「separate_rows」聲明中。

 # Create test data 
       mydata <- as.data.frame(c("Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2")) 
names(mydata) <- "TEST" 
mydata$TEST <- as.character(mydata$TEST) 

     # convert to 2 columns with a row counter 
     mydata %>% 
     mutate(row=row.names(mydata)) %>% 
     separate_rows(TEST, sep = '; ') 

電流輸出:

row|TEST 
1|Column1 = answer1 
1|Column2 = answer2 
1|incorrectly formatted - should be connected with answer2 
2|Column1 = answer1 
2|Column2 = answer2 
2|incorrectly formatted - should be connected with answer2 
3|Column1 = answer1 
3|Column2 = answer2 
3|incorrectly formatted - should be connected with answer2 

輸出我在尋找:

row|TEST 
1|Column1 = answer1 
1|Column2 = answer2; incorrectly formatted - should be connected with answer2 
2|Column1 = answer1 
2|Column2 = answer2; incorrectly formatted - should be connected with answer2 
3|Column1 = answer1 
3|Column2 = answer2; incorrectly formatted - should be connected with answer2 

任何幫助,非常感謝!

回答

3

您可以使用positive lookaround(你的情況先行)來解決問題:

閱讀:http://www.regular-expressions.info/lookaround.html

library(tidyverse) 
    mydata <- as.data.frame(c("Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2")) 
    names(mydata) <- "TEST" 
    mydata$TEST <- as.character(mydata$TEST) 
    View(mydata) 
    library(tidyverse) 
    mydata %>% 
     mutate(row=row.names(mydata)) %>% 
     separate_rows(TEST, sep = ';(?=\\s[A-Z])') 

輸出

row 
1 1 
2 1 
3 2 
4 2 
5 3 
6 3 
                      TEST 
1                Column1 = answer1 
2 Column2 = answer2; incorrectly formatted - should be connected with answer2 
3                Column1 = answer1 
4 Column2 = answer2; incorrectly formatted - should be connected with answer2 
5                Column1 = answer1 
6 Column2 = answer2; incorrectly formatted - should be connected with answer2 

在括號內的正則表達式將檢查該模式,但不會捕獲它。因此,在比賽期間的元素將永遠不會得到e在比賽中。

1

我們可以mutate不同的分隔符,然後做separate_rows

library(tidyverse) 
rownames_to_column(mydata, 'rn') %>% 
     mutate(TEST = sub(";\\s+(?=Column)", ",", TEST, perl = TRUE)) %>% 
     separate_rows(TEST, sep=",") 
# rn                   TEST 
#1 1               Column1 = answer1 
#2 1 Column2 = answer2; incorrectly formatted - should be connected with answer2 
#3 2               Column1 = answer1 
#4 2 Column2 = answer2; incorrectly formatted - should be connected with answer2 
#5 3               Column1 = answer1 
#6 3 Column2 = answer2; incorrectly formatted - should be connected with answer2