通過文本循環創建數據幀

在此先感謝！我一直在嘗試這幾天，我有點卡住了。我試圖循環訪問一個文本文件（作爲列表導入），並從文本文件創建一個數據框。如果列表中的項目在文本中具有星期幾，並且將填充到第一列（V1）中，則數據框將開始一個新行。我想將其餘的評論放在第二列（V2）中，我可能必須將字符串連接在一起。我試圖用grepl（）來使用條件語句，但是在設置初始數據框後，我對邏輯有些迷失。通過文本循環創建數據幀

這裏是我使成R的示例文本（這是數據的Facebook從文本文件）。 []表示列表號。這是一個很長的文件（50K +行），但我有日期列設置。

[1] 星期四8月25日，2016年下午3點57分EDT

[2] 足球時間！我們需要制定計劃！我發短信給我的傢伙，雖然去年沒有接觸過。所以我們會看到我的結局！你有什麼烹飪？

[3]週日，2016年8月14日在9:17 EDT

[4]邁克爾·傑森共享後。

[5]這隻鳥是比大多數政治職位的我看了最近這裏

[6]週日，2016年8月14日在上午08時44 EDT

[7]邁克爾聰明很多和庫爾特現在是朋友。在一週的某一天在數據幀開始一個新行，而列表的其餘部分被連接成數據幀的第二列

的最終結果將是數據幀。因此最終數據名聲將是

行1（[1]在V1和[2]在V2）

行2（[3]在V1和[4]，[5]在V2）

行3（[6]在V1和[7]在V2）

這裏是我的代碼開始，我可以得到V1至正確填充，但不是數據幀的第二列中。

### Read in the text file 
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt") 

### Remove empty lines from the text file 
temp <- temp[temp!=""] 

### Create the temp char file as a list file 
tmp <- as.list(temp) 

### A days vector for searching through the list of days. 
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday") 
df <- {} 

### Loop through the list 
for (n in 1:length(tmp)){ 

    ### Search to see if there is a day in the list item 
    for(i in 1:length(days)){ 
      if(grepl(days[i], tmp[n])==1){ 
    ### Bind the row to the df if there is a day in the list item 
        df<- rbind(df, tmp[n]) 
      } 
    } 
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.  
d <- c(d, tmp[n]) 
}

來源

2016-11-22 Michael Harris

使用'dput'請分享您的數據。 –

下面是一個使用tidyverse一個選項：

library(tidyverse) 

text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT 

[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking??? 

[3]Sunday, August 14, 2016 at 9:17am EDT 

[4]Michael shared Jason post. 

[5]This bird is a lot smarter than the majority of political posts I have read recently here 

[6]Sunday, August 14, 2016 at 8:44am EDT 

[7]Michael and Kurt are now friends." 

df <- data_frame(lines = read_lines(text)) %>% # read data, set up data.frame 
    filter(lines != '') %>% # filter out empty lines 
    # set grouping by cumulative number of rows with weekdays in them 
    group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>% 
    # collapse each group to two columns 
    summarise(V1 = lines[1], V2 = list(lines[-1])) 

df 
## # A tibble: 3 × 3 
##  grp           V1  V2 
## <int>          <chr> <list> 
## 1  1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]> 
## 2  2 [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]> 
## 3  3 [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>

這種方法使用了V2列表列，這可能是在保護你的數據而言，最好的辦法，但如果你使用paste或toString需要。

大致相當於基礎R：

df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE) 

df <- df[df$V2 != '', , drop = FALSE] 

df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2)) 

df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]}) 

df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]}) 

df 
## grp           V1 
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT 
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT 
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT 
##                                         V2 
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking??? 
## 2          [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here 
## 3                                [7]Michael and Kurt are now friends.

來源

2016-11-22 05:54:34 alistaire

非常感謝您的回覆！我收到週日函數的錯誤。 grepl錯誤（粘貼（平日（1：7，縮寫= FALSE），崩潰=「|」））：參數「x」丟失，沒有默認如果我只是嘗試使用平日，類似的錯誤，現在閱讀它。 –

'paste'調用是爲''grepl'創建一個字符串'「星期五|星期六|星期日|星期一|星期二|星期三|星期四'''。你可以任何你喜歡的方式創建字符串。 '平日（Sys.Date（）+ 1：7）'應該工作;老實說，當你傳遞一個數字時，調用什麼方法有點含糊不管。並確保'grepl'的''''參數（行的列）也在那裏;它可能會引發同樣的錯誤。 – alistaire

是的！萬分感謝！我只是在學習R，但是這個代碼很棒！我喜歡你如何使用匯總並將第二列連接到列表中。我正在讀tibbles，cumsum和總結，充分理解代碼，再次感謝這真棒！ –

通過文本循環創建數據幀

回答

相關問題