2017-02-28 64 views
0

對於qdap來說,這是相當新穎的,但我不確定此功能是否存在,但具有如下所述的功能會很棒。使用all_words生成的文字替換文字中的單詞

我的初始數據集。

ID   Keywords 
1   112 mills, open heart surgery, great, great job 
2   Ausie, open, heart out 
3   opened, heartily, 56mg)_job, orders12 
4   order, macD 

使用all_words()我最終得到以下數據。

WORD  FREQ 
1 great  2 
2 heart  2 
3 open  2 
4 ausie  1 
5 heartily 1 
6 job   1 
7 macd  1 
8 mgjob  1 
9 mills  1 
10 opened  1 
11 order  1 
12 orders  1 
13 out   1 
14 surgery  1 

是否有其主數據集可以由通過all_words()出現確切的詞來代替的方法嗎?因此,來自使用all_words()的列表應該替換數據框中的原始單詞,即112個工廠應該成爲工廠,56毫克)_job應該成爲mgjob。

+0

你是說你不想要任何字符串與數字? – akrun

+0

你可以請嘗試更好地解釋它嗎?目前尚不清楚你需要什麼。另外,讓你的例子可重現可以幫助你獲得很大的幫助。詳細介紹'all_words'函數的獎勵 – Sotos

+0

嘗試'un1 < - unlist(strsplit(df1 $ Keywords,「[,]」)); as.data.frame(table(grep(「^ [A-Za-z] + $」,un1,value = TRUE)))' – akrun

回答

1

這是一個有點手冊,我不知道你的數據的格式,但也有一些修修補補應該做的工作:

編輯:並沒有使用qdap,但我認爲這是不是問題的關鍵部分。

2nd編輯:我忘記了替換,更正了下面的代碼。

library(data.table) 
library(tm) # Functions with tm:: below 
library(magrittr) 

dt <- data.table(
    ID = 1L:4L, 
    Keywords = c(
    paste('112 mills', 'open heart', 'surgery', 'great', 'great job', sep = ' '), 
    paste('Ausie', 'open', 'heart out', sep = ' '), 
    paste('opened', 'heartily', '56mg)_job', 'orders12', sep = ' '), 
    paste('order', 'macD', sep = ' '))) 

# dt_2 <- data.table(Tokens = tm::scan_tokenizer(dt[, Keywords])) 
dt_2 <- dt[, .(Tokens = unlist(strsplit(Keywords, split = ' '))), by = ID] 

dt_2[, Words := tm::scan_tokenizer(Tokens) %>% 
     tm::removePunctuation() %>% 
     tm::removeNumbers() 
    ] 
dt_2[, Stems := tm::stemDocument(Words)] 

dt_2 
#  ID Tokens Words Stems 
# 1: 1  112     
# 2: 1  mills mills  mill 
# 3: 1  open  open  open 
# 4: 1  heart heart heart 
# 5: 1 surgery surgery surgeri 
# 6: 1  great great great 
# 7: 1  great great great 
# 8: 1  job  job  job 
# 9: 2  Ausie Ausie  Ausi 
# 10: 2  open  open  open 
# 11: 2  heart heart heart 
# 12: 2  out  out  out 
# 13: 3 opened opened  open 
# 14: 3 heartily heartily heartili 
# 15: 3 56mg)_job mgjob mgjob 
# 16: 3 orders12 orders order 
# 17: 4  order order order 
# 18: 4  macD  macD  macD 

# Frequencies 
dt_2[, .N, by = Words] 
#  Words N 
# 1:   1 
# 2: mills 1 
# 3:  open 2 
# 4: heart 2 
# 5: surgery 1 
# 6: great 2 
# 7:  job 1 
# 8: Ausie 1 
# 9:  out 1 
# 10: opened 1 
# 11: heartily 1 
# 12: mgjob 1 
# 13: orders 1 
# 14: order 1 
# 15:  macD 1 

第二個編輯位置:

res <- dt_2[, .(Keywords = paste(Words, collapse = ' ')), by = ID] 
res 
# ID         Keywords 
# 1: 1 mills open heart surgery great great job 
# 2: 2      Ausie open heart out 
# 3: 3    opened heartily mgjob orders 
# 4: 4        order macD 

3編輯,如果您的關鍵字來作爲列表,你想保持他們的方式。

library(data.table) 
library(tm) # Functions with tm:: below 
library(magrittr) 

dt <- data.table(
    ID = 1L:4L, 
    Keywords = list(
    c('112 mills', 'open heart', 'surgery', 'great', 'great job'), 
    c('Ausie', 'open', 'heart out'), 
    c('opened', 'heartily', '56mg)_job', 'orders12'), 
    c('order', 'macD'))) 

dt_2 <- dt[, .(Keywords = unlist(Keywords)), by = ID] 
dt_2[, ID_temp := .I] 

dt_3 <- dt_2[, .(ID, Tokens = unlist(strsplit(unlist(Keywords), split = ' '))), by = ID_temp] 

dt_3[, Words := tm::scan_tokenizer(Tokens) %>% 
     tm::removePunctuation() %>% 
     tm::removeNumbers() %>% 
     stringr::str_to_lower() 
    ] 
dt_3[, Stems := tm::stemDocument(Words)] 
dt_3 

res <- dt_3[, .(
    ID = first(ID), 
    Keywords = paste(Words, collapse = ' ') %>% stringr::str_trim()), 
    by = ID_temp] 
res <- res[, .(Keywords = list(Keywords)), by = ID] 

# Confirm format (a list of keywords in every element) 
dt[1, Keywords] %T>% {print(class(.))} %T>% {print(length(.[[1]]))} 
res[1, Keywords] %T>% {print(class(.))} %T>% {print(length(.[[1]]))} 
+0

@ m-dz-第二次編輯只讓我的生活變得輕鬆。感謝很多.....只是我的csv包含多個空格分隔的單詞,應該被視爲單個單詞。即[心臟手術,特異性,噁心]應該被認爲是3個單詞而不是6個。 – NinjaR