2011-11-08 38 views
4

我使用MALLET了幾千行這是在輸出文本文件結果的話題分析(「topics.txt」)和一百多行,每一行由製表符分隔的變量是這樣的:如何通過重構MALLET輸出文件來創建表格?

Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. 
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. 
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. 

這裏的實際數據的一個片段:

> dat[1:5,1:10] 

    V1 V2 V3 V4 V5  V6 V7  V8 V9  V10 
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624 
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928 
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119 
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091 
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521 

我試圖使用[R這個輸出轉換成其中的主題是列標題的數據表,每個主題包含變量的值'比例'直接對右手si每個變量的'主題',每個'文本'的值。就像這樣:

 topic1  topic2  topic3 
text1 proportion1 proportion2 proportion3 
text2 proportion1 proportion2 proportion3 

或數據上面的代碼片段,就像這樣:

  0   2   7   8   10  12  13  16  18  20  21   23  24   27 
10.txt  0   0   0   0   0   0   0   0   0  0.1315621 0.03632624 0.3040853 0   0.4560785   
1001.txt 0   0   0   0.1699586 0   0.2099153 0.1692292 0   0  0.2660085 0   0   0   0 
1002.txt 0   0.1747023 0   0   0.1360454 0.0750711 0   0.3341721 0  0   0   0   0   0 
1003.txt 0.0186709 0   0   0.2255179 0   0.5366148 0   0   0.138856 0   0   0   0   0 
1005.txt 0.2214441 0   0.1776052 0   0   0   0   0.2363206 0  0   0   0   0.1914769 0 

這是[R代碼我必須做的工作,從一個朋友送的,但它不爲我工作(我不知道有足夠的瞭解它自己修復):

########################################## 
dat<-read.table("topics.txt", header=F, sep="\t") 
datnames<-subset(dat, select=2) 
dat2<-subset(dat, select=3:length(dat)) 
y <- data.frame(topic=character(0),proportion=character(0),text=character(0)) 
for(i in seq(1, length(dat2), 2)){ 
z<-i+1 
x<-dat2[,i:z] 
x<-cbind(x, datnames) 
colnames(x)<-c("topic","proportion", "text") 
y<-rbind(y, x) 
} 

# Right at this step at the end of the block 
# I get this message that may indicate the problem: 
# Error in c(in c("topic", "proportion", "text") : unused argument(s) ("text") 

y[is.na(y)] <- 0 
xdat<-xtabs(proportion ~ text+topic, data=y) 
write.table(xdat, file="topicMatrix.txt", sep="\t", eol = "\n", quote=TRUE, col.names=TRUE, row.names=TRUE) 
########################################## 

我會非常感謝我如何能得到這個代碼工作的任何建議。我的問題可能與this one有關,也可能與this one有關,但我還沒有技能立即使用這些問題的答案。

+1

除非你提供真正的數據結構,否則你不會得到太多的幫助....一個用於這些比例的數字。使用dput(head(dat,20)) –

+0

感謝提示,我添加了一些內容。 – Ben

+0

我還應該在使用'rm(list = ls(all = TRUE))'稍微改變了這個問題,以便在他的塊結束時,錯誤信息變成「在[.data.frame'(dat2,,i:z)中出錯:未定義的列被選中」。無論如何,我認爲@Ramnath的答案是一個很有前途的選擇。 – Ben

回答

4

這是一種方法,您的問題

dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
    "Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3 
    Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3 
    Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3")) 

NTOPICS = 3 
nam <- c('num', 'text', 
    paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = "")) 

dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long', 
    sep = "") 
reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion') 

num text  topic1  topic2  topic3 
1 Num1 text1 proportion1 proportion2 proportion3 
2 Num2 text2 proportion1 proportion2 proportion3 
3 Num3 text3 proportion1 proportion2 proportion3 

編輯。無論比例是文本還是數字,這都將起作用。您還可以修改NTOPICS以適應您擁有的主題數量

+0

感謝您的建議,我可以重現您的示例並使其適用於我的全套數據。如果我們將'dat_l < - 重塑(setNames(dat,nam),vary = 3:8,direction ='long',sep =「」)'改爲'dat_l < - reshape(setNames(dat,nam),變化= 3 :((NTOPICS * 2)+2),方向='長',sep =「」)'這似乎使它在處理不同數量的主題時更加通用和高效。 – Ben

+1

你是對的。我編輯我的解決方案來反映這一點。 – Ramnath

+0

更好,非常感謝! – Ben

2

你可以把它變成長格式,但要進一步要求真實的數據。提供數據後 編輯。仍然不確定MALLET產品的整體結構,但至少R功能已得到證明。如果存在重疊的主題,則這種方法具有「比例」總和的「特徵」。取決於可能有​​利或不利的數據佈局。

dat <-read.table(textConnection(" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624 
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928 
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119 
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091 
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521 
"), 
      header=TRUE) 
ldat <- reshape(dat, idvar=1:2, varying=list(topics=c("V3", "V5", "V7", "V9"), 
              props=c("V4", "V6", "V8", "V10")), 
         direction="long") 
####------------------#### 
    > ldat 
      V1  V2 time V3   V4 
0.10.txt.1 0 10.txt 1 27 0.45607850 
1.1001.txt.1 1 1001.txt 1 20 0.26600850 
2.1002.txt.1 2 1002.txt 1 16 0.33417210 
3.1003.txt.1 3 1003.txt 1 12 0.53661480 
4.1005.txt.1 4 1005.txt 1 16 0.23632060 
0.10.txt.2 0 10.txt 2 23 0.30408530 
1.1001.txt.2 1 1001.txt 2 12 0.20991530 
2.1002.txt.2 2 1002.txt 2 2 0.17470230 
3.1003.txt.2 3 1003.txt 2 8 0.22551790 
4.1005.txt.2 4 1005.txt 2 0 0.22144410 
0.10.txt.3 0 10.txt 3 20 0.13156210 
1.1001.txt.3 1 1001.txt 3 8 0.16995860 
2.1002.txt.3 2 1002.txt 3 10 0.13604540 
3.1003.txt.3 3 1003.txt 3 18 0.13885610 
4.1005.txt.3 4 1005.txt 3 24 0.19147690 
0.10.txt.4 0 10.txt 4 21 0.03632624 
1.1001.txt.4 1 1001.txt 4 13 0.16922928 
2.1002.txt.4 2 1002.txt 4 12 0.07507119 
3.1003.txt.4 3 1003.txt 4 0 0.01867091 
4.1005.txt.4 4 1005.txt 4 7 0.17760521 

現在可以告訴你如何使用xtabs(),因爲這些「比例」是「數字」。這樣的事情最終可能會成爲你想要的。我很驚訝的是,主題也是整數,但也許有一個從欄目號碼主題名稱的映射?

> xtabs(V4 ~ V3 + V2, data=ldat) 
    V2 
V3  10.txt 1001.txt 1002.txt 1003.txt 1005.txt 
    0 0.00000000 0.00000000 0.00000000 0.01867091 0.22144410 
    2 0.00000000 0.00000000 0.17470230 0.00000000 0.00000000 
    7 0.00000000 0.00000000 0.00000000 0.00000000 0.17760521 
    8 0.00000000 0.16995860 0.00000000 0.22551790 0.00000000 
    10 0.00000000 0.00000000 0.13604540 0.00000000 0.00000000 
    12 0.00000000 0.20991530 0.07507119 0.53661480 0.00000000 
    13 0.00000000 0.16922928 0.00000000 0.00000000 0.00000000 
    16 0.00000000 0.00000000 0.33417210 0.00000000 0.23632060 
    18 0.00000000 0.00000000 0.00000000 0.13885610 0.00000000 
    20 0.13156210 0.26600850 0.00000000 0.00000000 0.00000000 
    21 0.03632624 0.00000000 0.00000000 0.00000000 0.00000000 
    23 0.30408530 0.00000000 0.00000000 0.00000000 0.00000000 
    24 0.00000000 0.00000000 0.00000000 0.00000000 0.19147690 
    27 0.45607850 0.00000000 0.00000000 0.00000000 0.00000000 
+0

感謝您的快速建議。我可以重現你的結果。如何將它推廣到30個(或100個或更多)主題? – Ben

+1

如果列名非常規則,那麼「變化」參數可以是'topics = paste(「V」,seq(1,100,by = 2),sep =「」)'和'props = paste 「V」,seq(2,100,by = 2),sep =「」)' –

+0

感謝您的快速幫助。不幸的是,我看不出爲什麼你的建議不適合我,但@Ramnath的代碼完成了工作,所以我很樂意結案。再次感謝。 – Ben

2

回到此問題,我發現reshape函數對內存要求過高,所以我使用data.table方法代替。更多的步驟,但速度更快,內存密集程度更低。

dat <- read.table(text = "V1 V2 V3 V4 V5  V6 V7  V8 V9  V10 
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624 
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928 
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119 
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091 
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521") 

dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col 
dat <- data.table(dat) 

# get document number 
docnum <- dat$V1 
# get text number 
txt <- dat$V2 

# remove doc num and text num so we just have topic and props 
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL] 
# get topic numbers 
n <- ncol(dat1) 
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)]) 
# get props 
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)]) 

# put topics and props together 
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i])) 
names(tp) <- txt 
# make into long table 
dt <- data.table::rbindlist(tp) 
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2))) 
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2))) 

# reshape to wide 
library(data.table) 
setkey(dt, tops...i., doc) 
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.] 
setnames(out, c("topic", as.character(txt))) 

# transpose to have table of docs (rows) and columns (topics) 
tout <- data.table(t(out)) 
setnames(tout, unname(as.character(tout[1,]))) 
tout <- tout[-1,] 
row.names(tout) <- txt 

# replace NA with zero 
tout[is.na(tout)] <- 0 

而這裏的輸出,文檔爲行,專題欄目,文檔名稱在rownames,未打印出來,但可以供以後使用。

tout 

      0   2   7   8  10   12  13  16  18 
1: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 
2: 0.00000000 0.0000000 0.0000000 0.1699586 0.0000000 0.20991530 0.1692293 0.0000000 0.0000000 
3: 0.00000000 0.1747023 0.0000000 0.0000000 0.1360454 0.07507119 0.0000000 0.3341721 0.0000000 
4: 0.01867091 0.0000000 0.0000000 0.2255179 0.0000000 0.53661480 0.0000000 0.0000000 0.1388561 
5: 0.22144410 0.0000000 0.1776052 0.0000000 0.0000000 0.00000000 0.0000000 0.2363206 0.0000000 
      20   21  23  24  27 
1: 0.1315621 0.03632624 0.3040853 0.0000000 0.4560785 
2: 0.2660085 0.00000000 0.0000000 0.0000000 0.0000000 
3: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 
4: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 
5: 0.0000000 0.00000000 0.0000000 0.1914769 0.0000000 
相關問題