R：用不同的話語結構得到第三DF與前2

具體信息，我有兩個數據幀，df1擁有約出版物的年本出版物在一年中總的文章信息，出口名稱，以及文章的累計總和在我讀書的那段時間裏df2有物品ID的隨機樣本，具有範圍從1到的物品由df1$cumsum給定總數的潛在值。R：用不同的話語結構得到第三DF與前2

我需要做的是抓取df2中的每篇文章ID，並使用df1中包含的信息確定它屬於哪個出版物和哪一年。

這裏是一個最低限度的可重複的例子：

set.seed(890) 
df1 <- NULL 
df1$year <- c(2000:2009, 2000:2009) 
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2) 
df1$article_total <- sample(1:200, 20, replace = T) 
df1$cumsum <- cumsum(df1$article_total) 
df1 <- as.data.frame(df1) 

df2 <- NULL 
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db 
df2 <- as.data.frame(df2)

理想情況下，我也想在每年計算的一篇文章的ID。例如，在上面的數據中，outlet 1在2000年有14篇文章，在2001年有168篇（cumsum = 183）。如果我的文章ID是156，我想知道它是2001年出版物1的第142篇文章。對於每個文章ID，我都在這個數據庫中有這樣的等等。

我想我應該用for循環做到這一點，但我100％在寫它丟失。這是我開始寫作，但我有一種感覺，我不是在正確的軌道上與它：

for i in 1:nrow(df2$art_num){ 
    article_number <- df2$art_num[i] 
    if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this? 
    # get article number, year, publication in new df 
    # also calculate article ID in each year/publication 
    } 
}

在此先感謝您的幫助！我還是失去了在R寫入循環......每弗蘭克的建議

####################### EDITED實例爲

set.seed(890) 
df1 <- NULL 
df1$year <- c(2000:2002, 2000:2002) 
df1$outlet <- c(1, 1, 1, 2,2,2) 
df1$article_total <- sample(1:50, 6, replace = T) 
df1$cumsum <- cumsum(df1$article_total) 
df1 <- as.data.frame(df1) 

df2 <- NULL 
df2$art_id <- c(66, 120, 77, 156, 24) 
df2 <- as.data.frame(df2)

下面是我在尋找的輸出：

art_id outlet year article_number 
1  66  1 2002    19 
2 120  2 2000    35 
3  77  1 2002    30 
4 156  2 2001    35 
5  24  1 2000    20

這個例子顯示了我在df3理想的輸出，這是我計算/手工製造。它有一列包含文章的ID，適當的出口，年份和一個新變量art_number。這是不同於在我的文章ID從df1$cumsum和df3$art_id計算它。在這個例子中，第一行示出了在我的數據庫中的第一製品具有66的ID我獲得的19 art_number值，因爲本文（ID = 66）是由出口1.我發表在2002年第19條通過查看文章ID的基礎上，df1$cumsum定位年份和插座，然後從其減去從上年df1$cumsum值art_id值來計算該值。所以對於這個特定的文章，我計算了df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]

我需要爲我的數據庫中的每篇文章都做這個計算，所以我不會一直手工做這個過程。

來源

2017-09-14 rowbust

@Frank我所需的輸出是一個包含以下各列第三DF：'$ DF2 art_num'，'DF1 $ outlet'（基於df2的文章編號），df1 $ year'（基於df2的文章編號）以及理想的'df3 $ article_location'。第三個是基於'df2 $ art_num'和'df1 $ cumsum' ,.我沒有的正是我想是因爲我不會寫代碼來顯示什麼，我需要一個可重複的例子......理想的情況下，cumsum將是一個區間，而不是直線上升的數字，這樣我可以找到中的特定物品的ID它。我不確定我是否有道理，所以讓我知道我是否可以進一步澄清。 – rowbust

@Frank包括一個新的輸出，你可以看到...希望這是有幫助的！ – rowbust

@Frank你是對的，我沒有爲'df2'的採樣設置種子。用手寫出來。 – rowbust

我認爲你的數據結構是有道理的，雖然它會與一個附加列就好辦了，在一年和出口的第一篇文章：

library(data.table) 
setDT(df1); setDT(df2) 

df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L] 

    year outlet article_total cumsum art_cstart 
1: 2000  1    4  4   1 
2: 2001  1   43  47   5 
3: 2002  1   38  85   48 
4: 2000  2   36 121   86 
5: 2001  2   39 160  122 
6: 2002  2    8 168  161

現在，我們可以做一個滾動更新加入，「滾動」每個art_id到以前cumsum和計算每個所需的列：

df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
    x.year, 
    x.outlet, 
    i.art_id - x.art_cstart + 1L 
)]] 

    art_id outlet year art_num 
1:  66 2002 1  19 
2: 120 2000 2  35 
3:  77 2002 1  30 
4: 156 2001 2  35 
5:  24 2001 1  20

它是如何工作

x[i, on=, roll=, j]是語法的加入，x仰視的i每一行。
在這個連接j計算結果爲列的列表，.(...)簡寫list(...)。
列分配與(colnames) := .(...)完成。

該分配是對現有表df2，而不是不必要地創建一個新表。

要詳細瞭解語法是如何工作的data.table，看到啓動的消息...

> library(data.table) 
data.table 1.10.4 
    The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way 
    Documentation: ?data.table, example(data.table) and browseVignettes("data.table") 
    Release notes, videos and slides: http://r-datatable.com

來源

2017-09-14 19:16:09 Frank

我不能給予好評這還不夠。這段代碼正是我所需要的，弗蘭克。非常感謝！我一定會看看'data.table'！ – rowbust

這是你需要的代碼，我認爲：

df3 <- data.frame(matrix(ncol = 3, nrow = 0)) 
colnames(df3) <- c("articleNumber", "year", "publication") 
for(i in 1:nrow(df2$art_num)){ 
for(j in 1:nrow(df1$cumsum)) { 
    if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){ 
    # note: cumsum should be an interval before doing this? NOT REALLY SURE 
    # WHAT YOU NEED HERE 
    # get article number, year, publication in new df 
    df3[i, 1] <- df2$art_num[i] 
    df3[i, 2] <- df1$year[j] 
    df3[i, 3] <- df1$outlet[j] 
    # also calculate article ID in each year/publication ISN'T THIS 
    # art_num? 
    } 
}

來源

2017-09-14 17:50:38 JenniferHL3

感謝這個，我更新的例子在原來的問題，以顯示正是我需要獲得... – rowbust

R：用不同的話語結構得到第三DF與前2

回答

相關問題