2017-04-07 102 views
1

我有兩個數據幀佔用和數據。我想通過在職業數據框中添加一列來匹配佔領數據中的每個職業並分配核心應用類。在R中匹配句子和句子?

occupation <- c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank", "Love my profession of Professor", "NA") 

occupation <- data.frame(occupation) 

data <- data.frame(class = c("Engineers","Designer","Artist","Designer","Poetry""Banker and Prof"), Occupation = c("Civil Engineer", "Graphic Designer", "Painter","Poetry","Architect(prof)", "Sales Manager Bank")) 

我想這樣

occupation        class 
    I am Civil Engineer human being  Engineers 
    Painter Architect Poetry    Artists 
    Graphic Designer too late    Designers 
    Architect by Painter profession  Architect 
    Sales Manager Bank      Banker and Prof 
    Love my profession of Professor  NA 
     NA         NA 

我試過,但它的響應任何

occupation$value <- sapply(data$occupation, grepl, x = occupation) 
+2

嘗試搜索 「R模糊匹配」,直到你找到你喜歡的東西。 – MrFlick

回答

1

我不知道你的數據是多麼複雜,但是這是對低複雜的字符串是有用的。使用agrep功能允許您設置的公差參數,這樣可以匹配無等於字符串:

occupation <- data.frame(occupation = c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank"), 
         stringsAsFactors = FALSE) 
data <- data.frame(class = c("Engineers","Designer","Architect","Banker and Prof"), 
        occupation = c("Civil Engineer", "Graphic Designer", "Architect(prof)", "Sales Manager Bank"), 
        stringsAsFactors = FALSE) 

occupation$value <- sapply(occupation$occupation, function(x) { 
    match.class <- sapply(data$class, function(y) agrep(y, x, max.distance = 0.2)) 
    data$class[which(match.class == 1)] 
    } 
) 

如果上升max.distance可以檢測的最後文本,但previos字符串會做到這一點。

     occupation   value 
1 I am Civil Engineer human being Civil Engineer 
2  Graphic Designer too late Graphic Designer 
3   Architect by profession Architect(prof) 
4    Sales Manager Bank  

第二種選擇匹配的每一個字,但對於案件「我是土木工程師人」的話「我」和「我」匹配的一切。

occupation$value <- sapply(occupation$occupation, function(x) { 
    match.class <- sapply(data$class, function(y) { 
     any(sapply(strsplit(x, ' ')[[1]], function(z) 
     any(agrep(z, y, max.distance = 0.2)))) 
    }) 
    data$class[which(match.class)] 
    } 
) 

所以這是結果...

     occupation                 value 
1 I am Civil Engineer human being Civil Engineer, Graphic Designer, Architect(prof), Sales Manager Bank 
2  Graphic Designer too late              Graphic Designer 
3   Architect by profession              Architect(prof) 
4    Sales Manager Bank             Sales Manager Bank 

Here thelink when you can download the code

+0

我不是問你已經回答了這個問題。我覺得有些誤解。我想從職業數據框中找出不屬於職業的職業。 –

+0

是的,對不起。我在第一行代碼第十行代碼中第二行代碼 –

+0

更改了'data $ class [which(match.class)]'data $ class [which(match.class)]''謝謝。如果在職業領域有不止一個職業,像我上面編輯的那樣,我只想要第一職業的職業,那麼這個職業的代碼將會是什麼?/ –

1

agrep輸出非常類似。我無法得到它的工作Architect(prof),但如果你刪除括號,它的工作原理:

data$Occupation <- sub("\\(.*", "", data$Occupation) 
data 
      class   Occupation 
1  Engineers  Civil Engineer 
2  Designer Graphic Designer 
3  Designer   Architect 
4 Banker and Prof Sales Manager Bank 

occ.class <- data$class[unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))] 
occ.class 
[1] Engineers  Designer  Designer  Banker and Prof 
Levels: Banker and Prof Designer Engineers 

如果你想在第三屆一個展現Architect你應該在你data data.frame相應地改變它。

至於編輯:

occ.class <- unlist(sapply(data$Occupation, function(x) agrep(x, occupation))) 
ifelse(length(occ.class), data$class[occ.class], NA) 
+0

我得到這樣的輸出。 > occ.class [1]「工程師」「工程師」「工程師」「工程師」 –

+0

Sry。我得到了答案。謝謝.. –

+0

在沒有大小寫匹配的情況下。例如,我編輯了我的問題。請通過它。 –