查找多個常見起始字符串

我有一個字符串列表，其中一個或多個字符串子集具有共同的起始字符串。我想要一個將輸入的字符串原始列表作爲輸入並返回所有常用起始字符串的列表的函數。在我的特殊情況下，我也知道每個公共前綴必須以給定的分隔符結束。下面是輸入數據的我說的是該類型的示例（忽略任何顏色高亮）：查找多個常見起始字符串

Population of metro area/Portland 
Population of city/Portland 
Population of metro area/San Francisco 
Population of city/San Francisco 
Population of metro area/Seattle 
Population of city/Seattle

在這裏，定界符是/和公用開始字符串是Population of metro area和Population of city。也許分隔符最終並不重要，但我已經強調，我不想只返回一個結果，即通用的共同起始字符串Population of;我也不想要普通的子串Population of metro area/S和Population of city/S。

該算法的最終用途是將字符串按其通用前綴分組。舉例來說，上面的列表可以改製爲消除冗餘信息，像這樣一個層次結構：

Population of metro area 
    Portland 
    San Francisco 
    Seattle 
Population of city 
    Portland 
    San Francisco 
    Seattle

我在任何語言中使用Python，但僞代碼就可以了。

編輯正如湯姆·安德森指出，可以很容易地減少給原來的問題簡單地分割字符串，並使用散列組由前綴。我原本以爲這個問題可能會更復雜，因爲在實踐中我有時會遇到帶有嵌入分隔符的前綴，但我意識到這也可以通過簡單地做一個僅限於分割一次的正確分割來解決。

來源

2011-10-21 Abiel

您如何處理「San Franc伊斯科「和」聖安東尼奧「在你的分組？ – retracile

是不是隻是在字符串上循環，在分隔符上分割它們，然後將第二部分按前半部分分組？像這樣：

def groupByPrefix(strings): 
    stringsByPrefix = {} 
    for string in strings: 
      prefix, suffix = map(str.strip, string.split("/", 1)) 
      group = stringsByPrefix.setdefault(prefix, []) 
      group.append(suffix) 
    return stringsByPrefix

一般來說，如果你正在尋找串prefices，該解決方案將是字符串whop爲trie。任何具有多個子項的分支節點都是最大公共前綴。但是你的需要比這個更受限制。

來源

2011-10-21 16:29:57

這是爲什麼通過'itertools.groupby'？ –

男人，誰接受了'itertools.groupby'？ –

d = collections.defaultdict(list) 

for place, name in ((i.strip() for i in line.split('/')) 
        for line in text.splitlines()): 
    d[place].append(name)

所以d將是一個字典，如：

{'Population of city': 
     ['Portland', 
     'San Francisco', 
     'Seattle'], 
'Population of metro area': 
     ['Portland', 
     'San Francisco', 
     'Seattle']}

您可以通過line.split('/')代替(i.strip() for i in line.split('/')如果你知道有你的周圍沒有文字多餘的空格。

來源

2011-10-21 16:36:48 JBernardo

這不是很一般，但可能你需要什麼：

def commons(strings): 
    return set(s.split('/')[0] for s in strings)

，並避免了對分組數據回去：

def group(strings): 
    groups = {} 
    for s in strings: 
     prefix, remainder = s.split('/', 1) 
     groups.setdefault(prefix, []).append(remainder) 
    return groups

來源

2011-10-21 16:39:10 retracile

使用csv.reader和itertools.groupby，對待'/'作爲分隔符，並由第一列組成：

for key, group in groupby(sorted(reader(inp, delimiter='/')), key=lambda x: x[0]): 
    print key 
    for line in group: 
     print "\t", line[1]

來源

2011-10-21 17:06:35

查找多個常見起始字符串

回答

相關問題