2016-12-13 21 views
0

我有一個數據框,其中包含推文,創建日期,推文ID,最愛和推特計數。我想創建一個語料庫,其中包含每個文檔的最愛和推特計數作爲變量。我也想通過tweet id識別文檔,而不是隨機文檔001 etc id。創建tm語料庫,其中包含來自數據框的文本(tweet)屬性

我開始與下面的數據...查看下面的代碼休息

    id 
1: 737243856144629760 
2: 737242308261842945 
3: 737242189055594496 
4: 737242018687164416 
5: 737241411465170944 
6: 737239685295181824 
                                    text 
1:             Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN! 
2:     "@NBCDFW: Trump rallies veterans at annual Rolling Thunder Gathering https://twitter.com/b08FcMlgkr https://twitter.com/RCDeLvHQqD" 
3:    "@FrankyLamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east." 
4: "@MariaErnandez3b: Trump Supports Rolling Thunder Rally #TRUMP STRONG https://twitter.com/pfVXQ8NdZu" So true, and remember the M.I.A.'s! 
5:  "@ScottWRasmussen: Donald Trump and Bikers Share Affection at Rolling Thunder Rally https://twitter.com/ZZl2sc29dn" A great day in D.C.! 
6: "@TeaPartyNevada: #Trump2016 "Illegals are taken care of better than our veterans." https://twitter.com/KKIgM4rNma https://twitter.com/1cEZ8wG7Cy" 
    favorited favoritwitter.comunt replyToSN    created truncated replyToSID replyToUID 
1:  FALSE   25944  NA 2016-05-30 11:26:47  FALSE   NA   NA 
2:  FALSE   9268  NA 2016-05-30 11:20:38  FALSE   NA   NA 
3:  FALSE   6739  NA 2016-05-30 11:20:09  FALSE   NA   NA 
4:  FALSE   15417  NA 2016-05-30 11:19:29  FALSE   NA   NA 
5:  FALSE   7192  NA 2016-05-30 11:17:04  FALSE   NA   NA 
6:  FALSE   9834  NA 2016-05-30 11:10:12  FALSE   NA   NA 
                      statusSource  screenName retweetCount 
1: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump   9455 
2: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump   2744 
3: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump   1604 
4: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump   4237 
5: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump   2148 
6: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump   3545 
    isRetweet retweeted longitude latitude 
1:  FALSE  FALSE  NA  NA 
2:  FALSE  FALSE  NA  NA 
3:  FALSE  FALSE  NA  NA 
4:  FALSE  FALSE  NA  NA 
5:  FALSE  FALSE  NA  NA 
6:  FALSE  FALSE  NA  NA 
                                   cleantxt 
1:             have a great memorial day and remember that we will soon make america great again! 
2:     "@nbcdfw: trump rallies veterans at annual rolling thunder gathering https://twitter.com/b08fcmlgkr https://twitter.com/rcdelvhqqd" 
3:    "@frankylamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east." 
4: "@mariaernandez3b: trump supports rolling thunder rally #trump strong https://twitter.com/pfvxq8ndzu" so true, and remember the m.i.a.'s! 
5:  "@scottwrasmussen: donald trump and bikers share affection at rolling thunder rally https://twitter.com/zzl2sc29dn" a great day in d.c.! 
6: "@teapartynevada: #trump2016 "illegals are taken care of better than our veterans." https://twitter.com/kkigm4rnma https://twitter.com/1cez8wg7cy" 

我嘗試將它與

myReader <- readTabular(mapping=list(content="cleantxt", id="id", created="created", retweet="retweetCount", fav="favoriteCount")) 
trumptweetsenhanced <- VCorpus(DataframeSource(trumptweets.df), readerControl=list(reader=myReader)) 

然而,當我轉換語料庫轉換回語料庫到一個數據框,沒有添加變量

> head(trumptweetsenhanced_dataframe.df) 
     docs                   text 
1 doc 0001       great memori day rememb will soon make america great 
2 doc 0002       nbcdfw trump ralli veteran annual roll thunder gather 
3 doc 0003  frankylamouch mani donald roll thunder brigad will sign go war middl east 
4 doc 0004  mariaernandezb trump support roll thunder ralli trump strong true rememb ms 
5 doc 0005 scottwrasmussen donald trump biker share affect roll thunder ralli great day dc 
6 doc 0006       teapartynevada trump illeg taken care better veteran 
+0

那麼你卡在哪裏?我在這裏看不到具體的可回答的問題。試着問一個有重點的問題。顯示您嘗試的任何代碼,並準確描述您卡住的位置。以[可重現的格式]包含示例數據(http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)。這會讓你更容易幫助你。 – MrFlick

+0

我提供了更多信息,並將問題限制在一個特定問題上。 – idomeneus

回答

1

您可以添加元數據到你的tweets-corpus與tm::meta()功能。見library(tm); example(meta)

此元數據註釋可能發生在每個語料庫級別上,您可能希望存儲「常用」元數據,例如收集此語料庫中的推文日期,或搜索查詢字符串,API調用詳細信息或隨你。

註釋也可以發生在每個文檔級別(在這種情況下,在每個tweet級別上) - 您可以在語料庫中存儲來自trumptweets.df數據框的推特屬性,例如retweet-count, fav-count,語言等。

這意味着聰明和小心的家務。您通常使用一組自定義函數和* apply-函數系列以讀寫方式調用meta()。 (或者使用purrr :: walk *或者purrr :: map *)

我正在寫這篇文章。自從我使用meta()以來已經有一段時間了。也許有一種完全不同的方式(使用嵌套的數據框架,或使用其他文本挖掘軟件包)。

+0

非常感謝...這是非常有幫助的,但這是我在某種意義上試圖做的。看到我使用的代碼... – idomeneus

相關問題