2010-04-11 46 views
5

我想分析SRT字幕:解析SRT字幕

1 
    00:00:12,815 --> 00:00:14,509 
    Chlapi, jak to jde s 
    těma pracovníma světlama?. 

    2 
    00:00:14,815 --> 00:00:16,498 
    Trochu je zesilujeme. 

    3 
    00:00:16,934 --> 00:00:17,814 
    Jo, sleduj. 

每個項目進入結構。與此regexs:

答:

RE_ITEM = re.compile(r'(?P<index>\d+).' 
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> ' 
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).' 
    r'(?P<text>.*?)', re.DOTALL) 

B:

RE_ITEM = re.compile(r'(?P<index>\d+).' 
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> ' 
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).' 
    r'(?P<text>.*)', re.DOTALL) 

而這個代碼:

for i in Subtitles.RE_ITEM.finditer(text): 
    result.append((i.group('index'), i.group('start'), 
      i.group('end'), i.group('text'))) 

與代碼BI僅具有一個在陣列的(因爲貪婪項*)並且由於沒有貪婪,代碼AI具有空的'文本'。*?

如何治好嗎?

感謝

+0

當我讀到這,我才意識到我以前做過,但不記得如何。事實證明,我是通過程序進行的,而不是使用正則表達式。正則表達式非常優雅。如果您有興趣,可以在https://svn.jaraco.com/jaraco/python/jaraco.media/jaraco/media/srt.py找到我曾用於SRT字幕的Python類(請注意,它從jaraco.util進口的「grouper」只是來自itertools文檔的'grouper'。 – 2010-04-11 12:59:12

回答

4

文本之後是空行或文件結尾。所以,你可以使用:

r' .... (?P<text>.*?)(\n\n|$)' 
+0

+1 clean。爲了解釋空白,你可以添加......'r'....(?P 。*?)\ n \ s * \ n'' – 2010-04-11 11:46:21

1

這裏的一些代碼,我已經躺在附近解析SRT文件:

from __future__ import division 

import datetime 

class Srt_entry(object): 
    def __init__(self, lines): 
     def parsetime(string): 
      hours, minutes, seconds = string.split(u':') 
      hours = int(hours) 
      minutes = int(minutes) 
      seconds = float(u'.'.join(seconds.split(u','))) 
      return datetime.timedelta(0, seconds, 0, 0, minutes, hours) 
     self.index = int(lines[0]) 
     start, arrow, end = lines[1].split() 
     self.start = parsetime(start) 
     if arrow != u"-->": 
      raise ValueError 
     self.end = parsetime(end) 
     self.lines = lines[2:] 
     if not self.lines[-1]: 
      del self.lines[-1] 
    def __unicode__(self): 
     def delta_to_string(d): 
      hours = (d.days * 24) \ 
        + (d.seconds // (60 * 60)) 
      minutes = (d.seconds // 60) % 60 
      seconds = d.seconds % 60 + d.microseconds/1000000 
      return u','.join((u"%02d:%02d:%06.3f" 
           % (hours, minutes, seconds)).split(u'.')) 
     return (unicode(self.index) + u'\n' 
       + delta_to_string(self.start) 
       + ' --> ' 
       + delta_to_string(self.end) + u'\n' 
       + u''.join(self.lines)) 


srt_file = open("foo.srt") 
entries = [] 
entry = [] 
for line in srt_file: 
    if options.decode: 
     line = line.decode(options.decode) 
    if line == u'\n': 
     entries.append(Srt_entry(entry)) 
     entry = [] 
    else: 
     entry.append(line) 
srt_file.close() 
16

爲什麼不使用pysrt

+0

我沒有看到它有很好的文檔記錄。 – 2010-04-11 12:26:34

+0

更好的鏈接到最新版本http://pypi.python.org/pypi/pysrt – daphshez 2011-08-11 05:49:21

+0

謝謝@Daphna! – 2011-08-11 06:03:25

1
splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()] 
regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL) 
for s in splits: 
    r = regex.search(s) 
    print r.groups() 
1

這裏有一個片斷,我寫這SRT文件轉換成字典:

import re 
def srt_time_to_seconds(time): 
    split_time=time.split(',') 
    major, minor = (split_time[0].split(':'), split_time[1]) 
    return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000 

def srt_to_dict(srtText): 
    subs=[] 
    for s in re.sub('\r\n', '\n', srtText).split('\n\n'): 
     st = s.split('\n') 
     if len(st)>=3: 
      split = st[1].split(' --> ') 
      subs.append({'start': srt_time_to_seconds(split[0].strip()), 
         'end': srt_time_to_seconds(split[1].strip()), 
         'text': '<br />'.join(j for j in st[2:len(st)]) 
         }) 
    return subs 

用法:

import srt_to_dict 
with open('test.srt', "r") as f: 
     srtText = f.read() 
     print srt_to_dict(srtText) 
2

我成了SRT庫很沮喪可用於Python(通常是因爲它們是重量級和避開語言標準類型而傾向於自定義類),所以我'我花了大約一年時間在我自己的srt圖書館工作。你可以在https://github.com/cdown/srt

我試圖保持簡單和輕便的類(除了核心字幕類,或多或少只是存儲SRT塊數據)。它可以讀取和寫入SRT文件,並將不兼容的SRT文件轉換爲兼容的文件。

這裏是你的樣品輸入一個使用示例:

>>> import srt, pprint 
>>> gen = srt.parse('''\ 
... 1 
... 00:00:12,815 --> 00:00:14,509 
... Chlapi, jak to jde s 
... těma pracovníma světlama?. 
... 
... 2 
... 00:00:14,815 --> 00:00:16,498 
... Trochu je zesilujeme. 
... 
... 3 
... 00:00:16,934 --> 00:00:17,814 
... Jo, sleduj. 
... 
... ''') 
>>> pprint.pprint(list(gen)) 
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'), 
Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'), 
Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')] 
+0

這很完美! – Mogsdad 2016-05-06 01:39:12