2012-12-05 79 views
13

我找Python的方式來分割句子譯成話,還可以對所有詞的索引信息存儲在一個句子例如現在字符串指數分裂在Python

a = "This is a sentence" 
b = a.split() # ["This", "is", "a", "sentence"] 

,我也想存儲所有的話

c = a.splitWithIndices() #[(0,3), (5,6), (8,8), (10,17)] 

什麼是實現splitWithIndices(最好的方式)的索引信息,確實蟒蛇有我可以用任何方法庫。任何幫助我計算單詞索引的方法都會很棒。

+0

'a.index(x)'返回'x'的索引。這可以使用。 – Whymarrh

回答

8

我覺得它更自然返回相應拼接的開始和結束。例如(0,4)而不是(0,3)

>>> from itertools import groupby 
>>> def splitWithIndices(s, c=' '): 
... p = 0 
... for k, g in groupby(s, lambda x:x==c): 
... q = p + sum(1 for i in g) 
... if not k: 
... yield p, q # or p, q-1 if you are really sure you want that 
... p = q 
... 
>>> a = "This is a sentence" 
>>> list(splitWithIndices(a)) 
[(0, 4), (5, 7), (8, 9), (10, 18)] 

>>> a[0:4] 
'This' 
>>> a[5:7] 
'is' 
>>> a[8:9] 
'a' 
>>> a[10:18] 
'sentence' 
17

這裏有一個方法使用正則表達式:

>>> import re 
>>> a = "This is a sentence" 
>>> matches = [(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', a)] 
>>> matches 
[('This', (0, 3)), ('is', (5, 6)), ('a', (8, 8)), ('sentence', (10, 17))] 
>>> b, c = zip(*matches) 
>>> b 
('This', 'is', 'a', 'sentence') 
>>> c 
((0, 3), (5, 6), (8, 8), (10, 17)) 

作爲一個班輪:

b, c = zip(*[(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', a)]) 

如果你只是想索引:

c = [(m.start(), m.end()-1) for m in re.finditer(r'\S+', a)] 
+0

@ f-j這裏'*匹配'是什麼意思?謝謝。 – zfz

+0

這被稱爲[解包參數列表](http://docs.python.org/2/tutorial/controlflow.html#unpacking-argument-lists)或splat操作符。基本上'foo(* [a,b])'將等同於'foo(a,b)'。 –