2013-04-09 184 views
0

我有一個語義標籤字段&語義標籤類型。每個標籤類型/標籤用逗號分隔,而每個標籤類型&標籤以冒號分隔(見下文)。R:拆分字符串&根據拆分分配變量

ID | Semantic Tags 

1 | Person:mitch mcconnell, Person:ashley judd, Position:senator 

2 | Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3 | Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4 | Person:ashley judd, topicname:politics 

5 | URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc 

我想每個標籤類型(冒號前的術語)&標籤(冒號後的術語)分成兩個獨立的領域:「標籤類型」 &「標籤」。最終的文件應該是這個樣子:

ID | Tag Type | Tag 

1 | Person | mitch McConnell 

1 | Person | ashley judd 

1 | Position | senator 

2 | Person | mitch McConnell 

2 | Position | senator 

2 | State | kentucky 

這裏是我到目前爲止的代碼...

​​

但在那之後,我迷路了!我相信我需要使用lapply或sapply爲此,但不知道在哪裏播放...

我的道歉,如果這已被回答在網站上的其他形式 - 我是新來的R &這是對我來說仍然有點複雜。

在此先感謝任何人的幫助。

+1

能否請您使用'dput(emtable)提供了一個可重複的例子'(或'dput (head(emtable))'如果這是太多的數據?) – 2013-04-09 15:03:42

+0

我已經重新格式化數據,看起來像他們的表格佈局。 – NiuBiBang 2013-04-09 15:18:27

+0

你爲什麼不使用'dput'?它使回答者更容易 – 2013-04-09 15:21:40

回答

4

這是另一種(略有不同)的方法:

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), 
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 
) 

colnames(dat3)[-1] <- c("Tag Type", "Tag") 

## ID  Tag Type     Tag 
## 1 1   Person  mitch mcconnell 
## 2 1   Person   ashley judd 
## 3 1  Position    senator 
## 4 2   Person  mitch mcconnell 
## 5 2  Position    senator 
## 6 2 ProvinceOrState    kentucky 
## 7 2  topicname    politics 
## 8 3   Person  mitch mcconnell 
## 9 3   Person   ashley judd 
## 10 3 Organization     senate 
## 11 3 Organization    republican 
## 12 4   Person   ashley judd 
## 13 4  topicname    politics 
## 14 5    URL www.huffingtonpost.com 
## 15 5   Company    usa today 
## 16 5   Person    chuck todd 
## 17 5   Company     msnbc 

詳盡的解釋:

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/

# Let the explanation begin... 

# Here I have a short list of the variables from the rows 
# of the original dataframe; in this case the row numbers: 

seq_along(dat3)  #row variables 

# then I use sapply and length to figure out hoe long the 
# split variables in each row (now a list) are 

sapply(dat3, length) #n times 

# this tells me how many times to repeat the short list of 
# variables. This is because I stretch the dat3 list to a vector 
# Here I rep the row variables n times 

rep(seq_along(dat3), sapply(dat3, length)) 

# better assign that for later: 

ID <- rep(seq_along(dat3), sapply(dat3, length)) 

#============================================ 
# Now to explain the next chunk... 
# I take dat3 

dat3 

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags. 
# For instance here's element 5 a list of two lists 
# with character vectors of length 2 

## [[5]] 
## [[5]][[1]] 
## [1] "URL" "www.huffingtonpost.com" 
## 
## [[5]][[2]] 
## [1] "URL" "http://www.regular-expressions.info" 

# Use str to look at this structure: 

dat3[[5]] 
str(dat3[[5]]) 

## List of 2 
## $ : chr [1:2] "URL" "www.huffingtonpost.com" 
## $ : chr [1:2] "URL" "http://www.regular-expressions.info" 

# I use lapply (list apply) to apply an anynomous function: 
# function(x) do.call(rbind, x) 
# 
# TO each of the 5 elements. This basically glues the list 
# of vectors together to make a matrix. Observe just on elenet 5: 

do.call(rbind, dat3[[5]]) 

##  [,1] [,2]         
## [1,] "URL" "www.huffingtonpost.com"    
## [2,] "URL" "http://www.regular-expressions.info" 

# We use lapply to do that to all elements: 

lapply(dat3, function(x) do.call(rbind, x)) 

# We then use the do.call(rbind on this list and we have a 
# matrix 

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

# Let's assign that for later: 

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

#============================================  
# Now we put it all together with data.frame: 

data.frame(ID, the_mat) 
+0

這似乎是在做伎倆。但是,當我運行第三個命令時,我們需要執行以下命令:dlbly(lably,lapply(dat3,function(x),dlbl) do.call(rbind,X))) )' 我得到以下信息: 錯誤函數(...,deparse.level = 1): 數矩陣的列必須匹配(見ARG 2 ) 此外:有50條或更多警告(使用警告()查看前50條) – NiuBiBang 2013-04-10 18:23:54

+0

此問題僅針對您的數據,並不像您在此顯示的數據。你可以使用debug這樣的調試工具來找出第一個問題,第二個問題我會按照它的說法來做,並使用'warnings()'來更具體地查看爲什麼你會得到你所做的警告。 – 2013-04-10 18:58:11

+0

是的,我看到我的一個標籤類型是URL,它經常包含「http:」 - 最終在分割「:」時將矩陣分成非統一數量的列。所以我只是添加了一行代碼來刪除「http:」,b/n第一和第二strsplit代碼。 – NiuBiBang 2013-04-14 01:36:55

3
DF 
## ID                     Semantic.Tags 
## 1 1         Person:mitch mcconnell, Person:ashley judd, Position:senator 
## 2 2  Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3 3  Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4 4               Person:ashley judd, topicname:politics 
## 5 5    URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc 


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":") 

f <- function(x) do.call(rbind, x) 

f(lapply(ll, f)) 
##  [,1]    [,2]      
## [1,] "  Person"  "mitch mcconnell"  
## [2,] " Person"   "ashley judd"   
## [3,] " Position"  "senator"    
## [4,] "  Person"  "mitch mcconnell"  
## [5,] " Position"  "senator"    
## [6,] " ProvinceOrState" "kentucky"    
## [7,] " topicname"  "politics "    
## [8,] "  Person"  "mitch mcconnell"  
## [9,] " Person"   "ashley judd"   
## [10,] " Organization" "senate"     
## [11,] " Organization" "republican "   
## [12,] "  Person"  "ashley judd"   
## [13,] " topicname"  "politics"    
## [14,] "  URL"   "www.huffingtonpost.com" 
## [15,] " Company"   "usa today"    
## [16,] " Person"   "chuck todd"    
## [17,] " Company"   "msnbc"     
+0

(+1)或者'matrix(rapply(ll,rbind),ncol = 2,byrow = TRUE)'最後兩步。 – Henrik 2013-04-09 15:25:44

+1

或更透明:'matrix(rapply(ll,identity),ncol = 2,byrow = TRUE)' – Henrik 2013-04-09 15:31:24

+0

Thanks guys,我實際上使用了上述三種方法的代碼組合。結束工作。 – NiuBiBang 2013-04-14 01:33:46