2017-05-04 89 views
3

如何從字符串中找到子字符串列表的位置?如何從字符串中找到子字符串列表的位置?

給定一個字符串:

「飛機,開往聖彼得堡,墜毀在埃及西奈沙漠僅23分鐘後起飛,從沙姆沙伊赫星期六」。

與子列表:

[ '的', '飛機', ' ' '束縛', '對', '聖', '聖彼得堡',',' ,'墜毀','in','埃及',''s','西奈','沙漠','just','23','分鐘','後','起飛','從' '沙姆', '沙姆沙伊赫', '上', '星期六','']

希望的輸出:

>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday." 
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.'] 
>>> find_offsets(tokens, s) 
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34), 
     (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67), 
     (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109), 
     (110, 119), (120, 122), (123, 131), (131, 132)] 

輸出的說明,第一個子字符串「The」可以通過使用字符串s使用(start, end)索引找到。所以從期望的輸出。

因此,如果我們遍歷從期望的輸出我們得到的子串的名單,也就是整數的所有元組

>>> [s[start:end] for start, end in out] 
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.'] 

我已經試過:

def find_offset(tokens, s): 
    index = 0 
    offsets = [] 
    for token in tokens: 
     start = s[index:].index(token) + index 
     index = start + len(token) 
     offsets.append((start, index)) 
    return offsets 

有另一種方法來查找字符串中的子串列表的位置?

回答

1

如果我們沒有關於子的想法,還有除了沒辦法重新掃描整個文本爲他們每個人。

如果從數據看來,我們知道這些是文本的連續片段,按照文本順序給出,每次比賽後僅掃描文本的其餘部分很容易。但是,每次都刪除文本是沒有意義的。

def spans(text, fragments): 
    result = [] 
    point = 0 # Where we're in the text. 
    for fragment in fragments: 
     found_start = text.index(fragment, point) 
     found_end = found_start + len(fragment) 
     result.append((found_start, found_end)) 
     point = found_end 
    return result 

測試:

>>> spans('foo in bar', ['foo', 'in', 'bar']) 
[(0, 3), (4, 6), (7, 10)] 

這是假定每個片段存在於在正確的地方的文本。您的輸出格式不提供不匹配報告的示例。使用.find而不是.index可以幫助,雖然只是部分。

4

解決方案一:

#use list comprehension and list.index function. 
[tuple((s.index(e),s.index(e)+len(e))) for e in t] 

二的解決方案來糾正第一個解決方案的問題:

def find_offsets(tokens, s): 
    tid = [list(e) for e in tokens] 
    i = 0 
    for id_token,token in enumerate(tid): 
     while (token[0]!=s[i]):    
      i+=1 
     tid[id_token] = tuple((i,i+len(token))) 
     i+=len(token) 

    return tid 


find_offsets(tokens, s) 
Out[201]: 
[(0, 3), 
(4, 9), 
(9, 10), 
(11, 16), 
(17, 20), 
(21, 23), 
(24, 34), 
(34, 35), 
(36, 43), 
(44, 46), 
(47, 52), 
(52, 54), 
(55, 60), 
(61, 67), 
(68, 72), 
(73, 75), 
(76, 83), 
(84, 89), 
(90, 98), 
(99, 103), 
(104, 109), 
(110, 119), 
(120, 122), 
(123, 131), 
(131, 132)] 

#another test 
s = 'The plane, plane' 
t = ['The', 'plane', ',', 'plane'] 
find_offsets(t,s) 
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)] 
+1

奈斯利短而且興高采烈低效的,調用'的.index()'兩次。 – 9000

+0

此外,如果有重複的單詞,這將無法正常工作。 '.index()'總是隻提取第一個實例=( – alvas

+0

嘗試'='飛機,飛機'; t = ['The','plane',',','plane']' – alvas

1
import re 

s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday." 
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.'] 


for token in tokens: 
    pattern = re.compile(re.escape(token)) 
    print(pattern.search(s).span()) 

RESULT

(0, 3) 
(4, 9) 
(9, 10) 
(11, 16) 
(17, 20) 
(21, 23) 
(24, 34) 
(9, 10) 
(36, 43) 
(44, 46) 
(47, 52) 
(52, 54) 
(55, 60) 
(61, 67) 
(68, 72) 
(73, 75) 
(76, 83) 
(84, 89) 
(90, 98) 
(99, 103) 
(104, 109) 
(110, 119) 
(120, 122) 
(123, 131) 
(131, 132) 
相關問題