我有一個以3列開頭的CSV。累積百分比成本列,成本列和關鍵字列。 R腳本適用於小文件,但當我向它提供實際文件(有一百萬行)時完全死亡(永遠不會結束)。你能幫我讓這個腳本更高效嗎? Token.Count是我無法創建的人。謝謝!計數令牌字的最佳和最有效的方法
# Token Histogram
# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)
# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)
# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)
# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]
# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize
# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"
您可以鏈接到一塊樣本數據的?隨意使它合成,只是具有代表性,所以人們可以測試他們的方法,以確保他們更快。 – 2010-12-10 21:12:56
CumuCost \t \t成本Keyword.text 0.004394288 \t \t 678.5北+臉+出口 0.006698245 \t \t 80.05超高動力學傳感器 0.008738991 \t \t 79.51 X盒360 250 – datayoda 2010-12-10 22:47:12
'data.frame':74231個OBS。 5個變量: $ CumuCost:num 0.00439 0.0067 0.00874 0.01067 0.01258 ... $ Cost:num 1678 880 780 736 731 ... $ Keyword.text:chr「north + face + outlet」「kinect sensor」「x box 360 250「... $ HTT:因子w/1級別」HEAD「:1 1 1 1 1 1 1 1 1 1 ... $ Token.Count:int 3 2 4 1 4 2 2 2 2 1 ... – datayoda 2010-12-10 22:51:07