如何計算數據集中每個主題的變量

我有需要針對每個主題評分的反應時間和準確性數據，我想知道哪個R包或多個函數最能滿足我的需求。以下是2個主題數據的片段。每一行代表受試者對刺激作出反應的單一試驗。如何計算數據集中每個主題的變量

date subject trialn blockcode  trialtype latency response correct 
32913  15  1 practice taskswitch 1765  205  1 
32913  15  2 practice  cueswitch 4372  203  1 
32913  15  3 practice cuerepetition 2523  203  0 
32913  15  1  test  cueswitch 2239  205  1 
32913  15  2  test cuerepetition 1244  203  1 
32913  15  3  test taskswitch 1472  203  0 
32913  15  4  test  cueswitch 1877  205  1 
32913  15  5  test taskswitch 2271  203  1 
30413  16  1 practice taskswitch 1377  203  1 
30413  16  2 practice taskswitch 1648  203  1 
30413  16  3 practice  cueswitch 1181  205  1 
30413  16  1  test  cueswitch 1045  205  1 
30413  16  2  test cuerepetition  969  203  0 
30413  16  3  test  cueswitch  857  203  1 
30413  16  4  test taskswitch 1038  205  1 
30413  16  5  test cuerepetition  836  203  0

這裏是想我做了說明：

僅在「測試」的試驗來看，對於每一個主題，計算
- 試驗總數
- 潛伏期（即反應時間）低於300ms的試驗數
- 平均潛伏期
- 意味着正確
然後，只在與中受試者的平均等待時間的3個標準差潛伏期試驗來看，計算平均潛伏期爲每個trialtype
最後，創建一個與所有這些變量新的數據幀和對象ID和日期

來源

2013-05-27 AlexR

＃2是不是真的意味着教程，所以一定要檢查出約data.table偉大的在線資源。 website是一個很好的開始，並且關於SO的包裹有很多問題，包括幾乎所有的東西。

在這裏，我只想告訴你，如果習慣了包的語法，它會變得多麼容易。

首先，讓我們加載包並在數據讀取：

library(data.table) 
str <- "date subject trialn blockcode  trialtype latency response correct 
     32913  15  1 practice taskswitch 1765  205  1 
     32913  15  2 practice  cueswitch 4372  203  1 
     32913  15  3 practice cuerepetition 2523  203  0 
     32913  15  1  test  cueswitch 2239  205  1 
     32913  15  2  test cuerepetition 1244  203  1 
     32913  15  3  test taskswitch 1472  203  0 
     32913  15  4  test  cueswitch 1877  205  1 
     32913  15  5  test taskswitch 2271  203  1 
     30413  16  1 practice taskswitch 1377  203  1 
     30413  16  2 practice taskswitch 1648  203  1 
     30413  16  3 practice  cueswitch 1181  205  1 
     30413  16  1  test  cueswitch 1045  205  1 
     30413  16  2  test cuerepetition  969  203  0 
     30413  16  3  test  cueswitch  857  203  1 
     30413  16  4  test taskswitch 1038  205  1 
     30413  16  5  test cuerepetition  836  203  0" 
DT <- as.data.table(read.table(text=str, header=TRUE))

現在，這是一件事，你問：

僅在「測試」的試驗來看，每個獨特主題計算總試驗次數，潛伏期試驗次數（即反應時間爲）低於300ms，平均潛伏期平均正確（即準確性爲）。

DT[blockcode=="test", 
    list(TotalNr = .N, 
     NrTrailLat = sum(latency < 300), 
     MeanLat = mean(latency), 
     MeanCor = mean(correct)), 
    by="subject"] 
subject TotalNr NrTrailLat MeanLat MeanCor 
1:  15  5   0 1820.6  0.8 
2:  16  5   0 949.0  0.6

基本上，這些代碼幾行，我可以回答所有這些問題。在我看來，語法也很簡單。對於我們的DT，我們只想看一下觀察點blockcode=="test"。接下來，我們要分別爲每個主題運行所有分析。這很容易通過by="subject"聲明完成。很酷的事情：如果你想分割幾個維度，只需添加它們...相反，忽視實踐的，讓我們分別來看看他們：

DT[, 
    list(TotalNr = .N, 
     NrTrailLat = sum(latency < 300), 
     MeanLat = mean(latency), 
     MeanCor = mean(correct)), 
    by="subject,blockcode"] 
    subject blockcode TotalNr NrTrailLat MeanLat MeanCor 
1:  15 practice  3   0 2886.667 0.6666667 
2:  15  test  5   0 1820.600 0.8000000 
3:  16 practice  3   0 1402.000 1.0000000 
4:  16  test  5   0 949.000 0.6000000

現在不要告訴我這是不是真棒！

讓我們嘗試另一個問題：

此外，創建一個包含日期和subjectID的最後一個（或第一）值（這是爲了將數據和subjectID在新的數據幀）的變量。

我不確定您在這裏的意思，因爲date在您的示例中沒有針對每個主題進行更改。所以讓我們稍微難一點。假設我們想知道第一個試驗的每個subject,blockcode組合的延遲。爲此，我們應該首先對DT進行排序，以便我們知道第一個trialn始終爲1.（對於此試題數據，這不是真的必要，因爲它似乎已排序）。

setkey(DT, subject, blockcode, trialn) 
DT[, list(FirstLat = latency[1]) , by="subject,blockcode"] 
subject blockcode FirstLat 
1:  15 practice  1765 
2:  15  test  2239 
3:  16 practice  1377 
4:  16  test  1045

但是，您想將其作爲DT中的新列添加。要做到這一點，你可以使用:=操作：

DT[, FirstLat := latency[1] , by="subject,blockcode"] 
DT 
date subject trialn blockcode  trialtype latency response correct FirstLat 
1: 32913  15  1 practice taskswitch 1765  205  1  1765 
2: 32913  15  2 practice  cueswitch 4372  203  1  1765 
3: 32913  15  3 practice cuerepetition 2523  203  0  1765 
4: 32913  15  1  test  cueswitch 2239  205  1  2239 
5: 32913  15  2  test cuerepetition 1244  203  1  2239 
6: 32913  15  3  test taskswitch 1472  203  0  2239 
7: 32913  15  4  test  cueswitch 1877  205  1  2239 
8: 32913  15  5  test taskswitch 2271  203  1  2239 
9: 30413  16  1 practice taskswitch 1377  203  1  1377 
10: 30413  16  2 practice taskswitch 1648  203  1  1377 
11: 30413  16  3 practice  cueswitch 1181  205  1  1377 
12: 30413  16  1  test  cueswitch 1045  205  1  1045 
13: 30413  16  2  test cuerepetition  969  203  0  1045 
14: 30413  16  3  test  cueswitch  857  203  1  1045 
15: 30413  16  4  test taskswitch 1038  205  1  1045 
16: 30413  16  5  test cuerepetition  836  203  0  1045

所以這些只是一些想法，讓你開始。我花了這些努力，因爲我想告訴你，當你理解的基本知識時，大多數事情變得相當容易。這應該是通過一開始就有點矯枉過正的手冊的動力。但是值得付出努力，相信我！因爲我甚至沒有提到最好的部分：data.table也是非常快的。祝你的分析順利。

來源

2013-05-28 07:34:00

非常感謝你的快速教程。我認爲data.table和plyr都是很好的選擇，我很想嘗試這兩種。我會開始閱讀:) – AlexR

plyr軟件包對於這種事情是很方便的（也是data.table，但我不知道它的語法）。下面是一個例子開始：

my_function <- function(tmp){ 
    data.frame(n_trials = sum(tmp[ ,'trialn']), 
      n_trialslat = sum(tmp[tmp[,'latency'] <= 300 ,'trialn']), 
      mean_latency = mean(tmp[,'latency'])) 
} 
library(plyr) 
ddply(subset(d, blockcode == "test"), 'subject', my_function)

來源

2013-05-27 23:30:24 baptiste

我對'data.table'增加了一些想法。 –

感謝您的演示和建議plyr和data.table。接受最佳答案是一個艱難的選擇，因爲它們都非常有幫助。 – AlexR

如何計算數據集中每個主題的變量

回答

相關問題