如何在頻率矩陣中分割r文本？

從如何在頻率矩陣中分割r文本？

dati<- (read.csv(file='C:...csv', header=TRUE, sep=";"))

我選擇兩個變量

id<-dati$post_visid_low 
item<-dati$event_list

比

id<-as.character(id) 
item<-as.character(item)

dataT <- data.table(id, it EM）進口數據開始 dataT的結構

id item 
1 102, 104, 108,401 
2 405, 103, 650, 555, 450 
3 305, 109

我希望獲得頻率的頻道的這個矩陣ordined列

id 102 103 104 108 109 305 401 405 450 555 650 
1 1   1 1 
2   1        1  1  1 
3      1 1

我怎樣才能做到這一點？我試着用

library(Matrix) 
id<-as.character(id) 
item<-as.character(item) 
dataT <- data.table(id, item) 
lst <- strsplit(dataT$item, '\\s*,\\s*') 
Un1 <- sort(unique(unlist(lst))) 
sM <- sparseMatrix(rep(dataT$id, length(lst)), 
        match(unlist(lst), Un1), x= 1, 
        dimnames=list(dataT$id, Un1))

但我recevive這個錯誤

Error in i + (!(m.i || i1)) : non-numeric argument to binary operator

我怎麼能這樣做？

來源

2016-02-06 user2609451

擴展你的分割項目的方法，你可以做'idx < - with（d，sort（unique（as.numeric（unlist（strsplit（item，「，」））））））; s < - sapply（idx，function（x）grepl（x，d $ item））+ 0L; colnames（s）< - idx' [這是*好*，因爲它使用幾乎每個功能基R] – user20650

我們可以使用包splitstackshape來幫助我們進行拆分，然後熔化的組合和dcasting將我們的數據，你指定的格式（注意，這並不總是可行有數值列名。

library(splitstackshape) 

# split the data 
step1 <- cSplit(dat, splitCols="item") 
step1 
# id item_1 item_2 item_3 item_4 item_5 
# 1: 1 102 104 108 401  NA 
# 2: 2 405 103 650 555 450 
# 3: 3 305 109  NA  NA  NA 

# reshape it and remove missings 
step2 <- melt(step1, id.vars="id")[!is.na(value),] 

# turn to wide 
output <- dcast(step2, id~value, fun.aggregate = length) 

# or in one line 

output <- dcast(melt(cSplit(dat, splitCols="item"), id.vars="id")[!is.na(value),], 
       id~value, fun.aggregate = length) 

output 
# id 102 103 104 108 109 305 401 405 450 555 650 
# 1: 1 1 0 1 1 0 0 1 0 0 0 0 
# 2: 2 0 1 0 0 0 0 0 1 1 1 1 
# 3: 3 0 0 0 0 1 1 0 0 0 0 0

備選地，可以使用從cSplit_e同一封裝：

cSplit_e(dat, "item", ",", type = "character", fill = 0, drop = TRUE) 
    id item_102 item_103 item_104 item_108 item_109 item_305 item_401 item_405 item_450 item_555 item_650 
# 1 1  1  0  1  1  0  0  1  0  0  0  0 
# 2 2  0  1  0  0  0  0  0  1  1  1  1 
# 3 3  0  0  0  0  1  1  0  0  0  0  0

數據用於：

dat <- data.frame(id=1:3, item=c("102, 104, 108,401","405, 103, 650, 555, 450","305, 109"))

來源

2016-02-06 12:07:14 Heroka

如何在頻率矩陣中分割r文本？

回答

相關問題