2016-04-10 23 views
1

我需要計算的每一行的語音速率。的SRT(字幕)文件的內容是這樣的:R:從srt(字幕)文件提取時間

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, plus debate and analysis. 

3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect the pay of millions 

例如,它需要4秒989毫秒說出10個字「自由民主黨承諾保護的百萬薪酬」 。這10個單詞的平均語速爲,單字爲爲498.9毫秒。

如何閱讀SRT文件,這樣我可以有一個數據幀與開始時間結束時間textString的wordCount爲列和字幕的線條像下面行?

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000") 

endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989") 

textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions") 

wordCount<-c(12,10,10) 

rate.df<-data.frame(startTime, endTime, textString, wordCount) 

如何從R中的endTime中扣除startTime,當時間以小時形式呈現時:minute:second,millisecond?

+0

我成功地使用MS Excel中的任務,但我有太多的數據需要使用Excel爲此任務。 – Ninjacat

回答

2

這裏是一個可能的解決方案(代碼是非常自我解釋):

text=" 

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, 
plus debate 
and analysis. 



3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect 
the pay of millions" 

con<-textConnection(text) 
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and 
# they should be replaced by the following single line in the real case 
# lines <- readLines(srtFileName) 

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){ 
    block <- lines[blockIdx] 
    block <- block[!grepl("^\\s*$",block)] 
    if(length(block) == 0){ 
     return(NULL) 
    } 
    if(length(block) < 3){ 
     warning("a block not respecting srt standards has been found") 
    } 
    return(data.frame(id=block[1], 
         times=block[2], 
         textString=paste0(block[3:length(block)],collapse="\n"), 
         stringsAsFactors = FALSE)) 
    }) 
m <- do.call(rbind,listOfEntries) 


# split start and end times 
tmp <- do.call(rbind,strsplit(m[,'times'],' --> ')) 
m$startTime <- tmp[,1] 
m$endTime <- tmp[,2] 

# parse times 
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric)) 
m$fromSeconds <- tmp %*% c(60*60,60,1,1/1000) 

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric)) 
m$toSeconds <- tmp %*% c(60*60,60,1,1/1000) 

# compute time difference in seconds 
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds 

# word count 
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1 

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. : 
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1 

m$millisecsPerWord <- m$timeDiffInSecs * 1000/m$wordCount 

結果:

> m 
    id       times                textString 
2 1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you 
3 2 00:00:22,000 --> 00:00:23,989  the latest from the campaign trail, \nplus debate \nand analysis. 
6 3 00:00:24,000 --> 00:00:28,989   The Liberal Democrats promise to protect \nthe pay of millions 
    startTime  endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord 
2 00:00:19,000 00:00:21,989   19 21.989   2.989  14   213.5000 
3 00:00:22,000 00:00:23,989   22 23.989   1.989  11   180.8182 
6 00:00:24,000 00:00:28,989   24 28.989   4.989  10   498.9000 
+1

哦。太棒了!非常感謝你,digEmAll!代碼很漂亮! – Ninjacat

+0

非常感謝@digemall – Ninjacat