2017-10-21 108 views
0

我有一個向量,標記詞如c(#142#856#856.2#745, NA, #856#855, NA, #685, #663, #965.23, #855#658#744#122)從標記詞創建列

單詞之間用sharp分隔。我想爲每個不同的代碼創建一列數據框,然後編寫1或0(或NA),具體取決於代碼是否在該行中。

這個想法是,每個元素變成一行,每個代碼變成一列,然後如果代碼在該元素中,那麼在列中標記爲1,或者如果該代碼不在該元素中,則標記爲0。

ID | 142 | 856 |856.2 | ... | 122 | 
1 | 1 | 1 | 1 | ... | 0 | 
2 | 0 | 0 | 0 | ... | 0 | 
... 

我知道如何用一個複雜的算法做大量的循環。但是,有沒有簡單的方法可以以簡單的方式做到這一點?

+1

你可以解釋一下這個部分:'取決於代碼是否在那一行或哪一行? ID來自哪裏? – PoGibas

回答

2

你可以做到這一點很容易使用stringr

# First we load the package 
library(stringr) 
# Then we create your example data vector 
tagged_vector <- c('#142#856#856.2#745', NA, '#856#855', NA, '#685', '#663', 
        '#965.23', '#855#658#744#122') 
# Next we need to get all the unique codes 
# stringr's str_extract_all() can do this: 
all_codes <- str_extract_all(string=tagged_vector, pattern='(?<=#)[0-9\\.]+') 
# We just looked for one or more numbers and/or dots following a '#' character 
# Now we just want the unique ones: 
unique_codes <- unique(na.omit(unlist(all_codes))) 
# Then we can use grepl() to check whether each code occurs in any element 
# I've also used as.numeric() since you want 0/1 instead of TRUE/FALSE 
result <- data.frame(sapply(unique_codes, function(x){ 
    as.numeric(grepl(x, tagged_vector)) 
})) 
# Then we add in your ID column and move it to the front: 
result$ID <- 1:nrow(result) 
result <- result[ , c(ncol(result), 1:(ncol(result)-1))] 

結果是

ID X142 X856 X856.2 X745 X855 X685 X663 X965.23 X658 X744 X122 
1 1 1 1  1 1 0 0 0  0 0 0 0 
2 2 0 0  0 0 0 0 0  0 0 0 0 
3 3 0 1  0 0 1 0 0  0 0 0 0 
4 4 0 0  0 0 0 0 0  0 0 0 0 
5 5 0 0  0 0 0 1 0  0 0 0 0 
6 6 0 0  0 0 0 0 1  0 0 0 0 
7 7 0 0  0 0 0 0 0  1 0 0 0 
8 8 0 0  0 0 1 0 0  0 1 1 1 

你可以在列名的 「X」 之前每個代碼通知。這是因爲在R a variable name may not begin with a number

+0

謝謝,就是這樣。我只是改變描述以使其更清楚。 – Xbel

+0

很高興這有幫助。 – duckmayr