2017-10-13 22 views
0

的對面這很可能是一個愚蠢的問題,但我GOOGLE和谷歌搜索並找不到解決方案。我認爲這是因爲我不知道用我的問題來搜索的正確方法。unnest_tokens

我有一個數據框,我已經在R中轉換爲整潔的文本格式來擺脫停用詞。我現在想將那個數據框'不整潔'回到原來的格式。

unnest_tokens的反向/反向命令是什麼?

編輯:這裏是我正在使用的數據的樣子。我試圖複製西爾格和羅賓遜的書Tidy Text的分析,但使用意大利歌劇的librettos。

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line) 
sample_df 

character line 
FIGARO Cinque... dieci.... venti... trenta... trentasei...quarantatre 
SUSANNA Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello. 
CONTE  Susanna, mi sembri agitata e confusa. 
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia! 

我把它變成整潔的文本,所以我可以擺脫停止詞:

tribble <- sample_df %>% 
      unnest_tokens(word, line) 
# Get rid of stop words 
# I had to make my own list of stop words for 18th century Italian opera 
itstopwords <- data_frame(text=mystopwords) 
names(itstopwords)[names(itstopwords)=="text"] <- "word" 
tribble2 <- tribble %>% 
      anti_join(itstopwords) 

現在我有這樣的事情:

text word 
FIGARO cinque 
FIGARO dieci 
FIGARO venti 
FIGARO trenta 
... 

我想它找回來轉換爲字符名稱和相關行的格式來查看其他事物。基本上,我希望文本的格式與之前的格式相同,但要刪除停用詞。

+0

你好,請閱讀[這](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)和編輯你的問題。瞭解更多關於你的數據是什麼樣的以及你做了什麼會使其他用戶能夠幫助你。 – shea

回答

1

不是一個愚蠢的問題!答案取決於你正在嘗試做什麼,但如果我想通過使用來自purrr的map函數在經過一些處理後的整理表單中恢復原始形式,那麼這將是我典型的方法。

首先,讓我們從原始文本轉到整理格式。

library(tidyverse) 
library(tidytext) 


tidy_austen <- janeaustenr::austen_books() %>% 
    group_by(book) %>% 
    mutate(linenumber = row_number()) %>% 
    ungroup() %>% 
    unnest_tokens(word, text) 

tidy_austen 
#> # A tibble: 725,055 x 3 
#>     book linenumber  word 
#>     <fctr>  <int>  <chr> 
#> 1 Sense & Sensibility   1  sense 
#> 2 Sense & Sensibility   1   and 
#> 3 Sense & Sensibility   1 sensibility 
#> 4 Sense & Sensibility   3   by 
#> 5 Sense & Sensibility   3  jane 
#> 6 Sense & Sensibility   3  austen 
#> 7 Sense & Sensibility   5  1811 
#> 8 Sense & Sensibility   10  chapter 
#> 9 Sense & Sensibility   10   1 
#> 10 Sense & Sensibility   13   the 
#> # ... with 725,045 more rows 

文本現在是整潔!但是我們可以把它弄亂,回到某種原始形式。我通常使用來自tidyr的nest來處理這個問題,然後使用purrr的一些map函數。

nested_austen <- tidy_austen %>% 
    nest(word) %>% 
    mutate(text = map(data, unlist), 
     text = map_chr(text, paste, collapse = " ")) 

nested_austen 
#> # A tibble: 62,272 x 4 
#>     book linenumber    data 
#>     <fctr>  <int>   <list> 
#> 1 Sense & Sensibility   1 <tibble [3 x 1]> 
#> 2 Sense & Sensibility   3 <tibble [3 x 1]> 
#> 3 Sense & Sensibility   5 <tibble [1 x 1]> 
#> 4 Sense & Sensibility   10 <tibble [2 x 1]> 
#> 5 Sense & Sensibility   13 <tibble [12 x 1]> 
#> 6 Sense & Sensibility   14 <tibble [13 x 1]> 
#> 7 Sense & Sensibility   15 <tibble [11 x 1]> 
#> 8 Sense & Sensibility   16 <tibble [12 x 1]> 
#> 9 Sense & Sensibility   17 <tibble [11 x 1]> 
#> 10 Sense & Sensibility   18 <tibble [15 x 1]> 
#> # ... with 62,262 more rows, and 1 more variables: text <chr> 

是什麼文字看起來像在年底,在這種特殊情況下?

nested_austen %>% 
    select(text) 
#> # A tibble: 62,272 x 1 
#>                 text 
#>                 <chr> 
#> 1            sense and sensibility 
#> 2              by jane austen 
#> 3                1811 
#> 4               chapter 1 
#> 5 the family of dashwood had long been settled in sussex their estate 
#> 6 was large and their residence was at norland park in the centre of 
#> 7  their property where for many generations they had lived in so 
#> 8 respectable a manner as to engage the general good opinion of their 
#> 9 surrounding acquaintance the late owner of this estate was a single 
#> 10 man who lived to a very advanced age and who for many years of his 
#> # ... with 62,262 more rows