2012-11-06 43 views
1

我一直在努力提供數據的文本分析。通常分析包括在紙上編寫一個抄本,然後將這些信息作爲數字代碼導入到R中。我想輸出一個單詞的抄本,並將他們的單詞編號上面切到某一行寬(讓我們使用任意80個字符)。對齊和交替字符串

一個最小的可視化例子:

#what we start with: 

    person text word.num 
1 greg The  1 
2 greg dog  2 
3 greg went  3 
4 greg  to  4 
5 greg the  5 
6 greg zoo,  6 
7 greg but  7 
8 greg ate  8 
9 greg first.  9 
10 sally  He  10 
11 sally likes  11 
12 sally water  12 
13 sally  a  13 
14 sally bit  14 
15 sally too.  15 

#what我想:

1 2 3 4 5 6 
The dog went to the zoo, 

7 8 9  10 11  
but ate first. He likes 

12 13 14 15 
water a bit too. 

另外一個問題出現的號碼來獲得在大型較大的字數可能超過簡短的單詞和單詞需要在它前面放置一個額外的空間。我認爲在粘貼過程中,通過確定最大數字的最大字符(數字)並在比這個數量更少的單詞之後增加那麼多空間是很容易的。

我的想法來解決這個迄今爲止是:

  1. 創建每行一定的最大長度的特徵向量1列矩陣(strwrap可以在這裏有用)
  2. 添加多餘的空格如上所述(nchargsub在此處可能有用)
  3. 通過使用字數功能來確定伴隨矩陣的數值,然後使用cumsumseq來製作伴隨矩陣的數字c值(實際上是字符)也是1列。這將匹配行與字符(單詞)矩陣的行。
  4. 現在兩個矩陣需要按行交替行(不知道如何做到這一點)
  5. 對準上面的字號碼(不知道如何做到這一點,但nchar可能是有用的在這裏)

我想保留這個基礎工具,雖然我敢肯定Hadely的stringR會很有用,我想避免這種依賴。

dput以上數據:

dat <- structure(list(person = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,       
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("greg", "sally"), class = "factor"),    
     text = structure(c(10L, 5L, 14L, 11L, 9L, 15L, 4L, 2L, 6L,        
     7L, 8L, 13L, 1L, 3L, 12L), .Label = c("a", "ate", "bit",         
     "but", "dog", "first.", "He", "likes", "the", "The", "to",        
     "too.", "water", "went", "zoo,"), class = "factor"), word.num = 1:15), row.names = c(NA, 
    -15L), .Names = c("person", "text", "word.num"), class = "data.frame") 

我不能想出我覺得拍攝,同時可搜索的未來,使用戶的思想的標題。請提出修改建議...

+0

是'dput'不看就像你所表現的「我們的開始」 – GSee

+0

我的歉意。生病。儘快修復。 –

+0

「喜歡」這個詞是不是有理由輸出? – GSee

回答

3
> datmat <- matrix(c(1:length(dat$text), as.character(dat$text)), nrow=2, byrow=TRUE) 
> datmat 
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]  [,10] [,11] [,12] [,13] [,14] [,15] 
[1,] "1" "2" "3" "4" "5" "6" "7" "8" "9"  "10" "11" "12" "13" "14" "15" 
[2,] "The" "dog" "went" "to" "the" "zoo," "but" "ate" "first." "He" "likes" "water" "a" "bit" "too." 
> options(width=30) 
> datmat 
    [,1] [,2] [,3] [,4] 
[1,] "1" "2" "3" "4" 
[2,] "The" "dog" "went" "to" 
    [,5] [,6] [,7] [,8] 
[1,] "5" "6" "7" "8" 
[2,] "the" "zoo," "but" "ate" 
    [,9]  [,10] [,11] 
[1,] "9"  "10" "11" 
[2,] "first." "He" "likes" 
    [,12] [,13] [,14] 
[1,] "12" "13" "14" 
[2,] "water" "a" "bit" 
    [,15] 
[1,] "15" 
[2,] "too." 

的報價可以通過強迫爲表歸類對象並使用print.table被刪除:

> class(datmat) <- "table" 
> datmat 
    [,1] [,2] [,3] [,4] [,5] 
[1,] 1 2 3 4 5 
[2,] The dog went to the 
    [,6] [,7] [,8] [,9] 
[1,] 6 7 8 9  
[2,] zoo, but ate first. 
    [,10] [,11] [,12] [,13] 
[1,] 10 11 12 13 
[2,] He likes water a  
    [,14] [,15] 
[1,] 14 15 
[2,] bit too. 

您還可能能夠做一些與此。它解決了加文提到的左對齊問題:

> gsub("\\[.*\\,.*\\]", "", capture.output(print(datmat, quote=FALSE))) 
[1] "  "      
[2] " 1 2 3 4 5 " 
[3] " The dog went to the " 
[4] "  "     
[5] " 6 7 8 9  " 
[6] " zoo, but ate first." 
[7] "  "      
[8] " 10 11 12 13 " 
[9] " He likes water a " 
[10] "  "      
[11] " 14 15 "    
[12] " bit too. " 

而且又進一步細化:

datlines <- gsub("\\[.*\\,.*\\]", "", capture.output(print(datmat, quote=FALSE))) 
for(i in seq_along(datlines)){ cat(datlines[i], "\n") } 
#----------------------------------# 
1 2 3 4 5  
The dog went to the 

6 7 8 9  
zoo, but ate first. 

10 11 12 13  
He likes water a  

14 15  
bit too. 
+1

'print(datmat,quote = FALSE)'是一種比改變類更簡單的不帶引號的打印方式。 –

+0

謝謝。我曾嘗試過rownames = FALSE,並沒有得到任何喜悅。 –

+0

這是我用來制定我的方法的答案。感謝您的工作。 +1 –

3

什麼:

> tmp <- setNames(as.character(dat$text), dat$word.num) 
> print(tmp, quote=FALSE) 
    1  2  3  4  
likes water  a bit too. 
> options(width = 80) 
> print(tmp, quote=FALSE) 
    1  2  3  4  5  6  7  8  9  10  11 
    The dog went  to the zoo, but ate first.  He likes 
    12  13  14  15 
water  a bit too. 

你可以堅持自己的類的對象和添加打印方法:

class(tmp) <- "foo" 
print.foo <- function(x, quote = FALSE, ...) { 
    print(unclass(x), quote = quote, ...) 
} 
tmp 

> tmp 
    1  2  3  4  5  6  7  8  9  10  11 
    The dog went  to the zoo, but ate first.  He likes 
    12  13  14  15 
water  a bit too. 

的一種方式,以這種表示轉儲到一個文件是通過capture.output(),其中有一個文件參數:

capture.output(tmp, file = "foo.txt") 

所得到的文本文件包含:

$ cat foo.txt 
    1  2  3  4  5  6  7  8  9  10  11 
    The dog went  to the zoo, but ate first.  He likes 
water  a bit too. 
    12  13  14  15 

它不是你所擁有的 - 字數是右對齊的,但它是接近的。

+0

了不起的答案和方式比我做的更容易。謝謝+1 –

1

因爲我拿了使用迪文的解決方案和一點加文的的線程的辦法(如函數)的完整性:

numbtext <- function(text.var, width=80, txt.file = NULL) { 
    zz <- matrix(c(1:length(text.var), as.character(text.var)), 
     nrow=2, byrow=TRUE) 
    OW <- options()$width 
    options(width=width) 
    dimnames(zz) <- list(c(rep("", nrow(zz))), c(rep("", ncol(zz)))) 
    print(zz, quote = FALSE) 
    if (!is.null(txt.file)){ 
     sink(file=txt.file, append = TRUE) 
     print(zz, quote = FALSE) 
     sink() 
    } 
    options(width=OW) 
} 

numbtext(dat$text, 40, "foo.txt") 

產生:

1 2 3 4 5 6 7 8 
The dog went to the zoo, but ate 

9  10 11 12 13 14 15 
first. He likes water a bit too.