在字符串

列表標記動態子假設這兩組字符串：在字符串

file=sheet-2016-12-08.xlsx 
file=sheet-2016-11-21.xlsx 
file=sheet-2016-11-12.xlsx 
file=sheet-2016-11-08.xlsx 
file=sheet-2016-10-22.xlsx 
file=sheet-2016-09-29.xlsx 
file=sheet-2016-09-05.xlsx 
file=sheet-2016-09-04.xlsx 

size=1024KB 
size=22KB 
size=980KB 
size=15019KB 
size=202KB

我需要在這兩組分別分別運行功能，收到以下輸出：

file=sheet-2016-*.xlsx 

size=*KB

數據集可以是任何一組字符串。它不必與格式匹配。這裏有一個例子另一個例子：

id.4030.paid 
id.1280.paid 
id.88.paid

其預期輸出爲：

id.*.paid

基本上，我需要一個函數來分析一組字符串，並用星號代替少見子（* ）

來源

2017-08-25 HyderA

您可以使用os.path.commonprefix來計算公共前綴。它用於計算文件路徑列表中的共享目錄，但可用於通用上下文中。

然後反轉字符串，並再次申請共同的前綴，然後反轉，來計算共同後綴（改編自https://gist.github.com/willwest/ca5d050fdf15232a9e67）

dataset = """id.4030.paid 
id.1280.paid 
id.88.paid""".splitlines() 

import os 


# Return the longest common suffix in a list of strings 
def longest_common_suffix(list_of_strings): 
    reversed_strings = [s[::-1] for s in list_of_strings] 
    return os.path.commonprefix(reversed_strings)[::-1] 

common_prefix = os.path.commonprefix(dataset) 
common_suffix = longest_common_suffix(dataset) 

print("{}*{}".format(common_prefix,common_suffix))

結果：

id.*.paid

編輯：如WIM注意到：

當所有字符串相等時，常用前綴&後綴爲應該返回字符串本身而不是prefix*suffix：應檢查所有字符串是否相同
當通用前綴&後綴重疊/有共享字母時，這也會混淆計算：應該計算字符串上的公共後綴減去公共前綴

因此，需要一種全方位的方法來預先測試列表以確保至少有2個字符串不同（在過程中凝結前綴/後綴公式），並計算公共後綴切片以刪除常見前綴：

def compute_generic_string(dataset): 
    # edge case where all strings are the same 
    if len(set(dataset))==1: 
     return dataset[0] 

    commonprefix = os.path.commonprefix(dataset) 

    return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])

現在讓我們來測試：

for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]: 
    print(compute_generic_string(dataset))

結果：

id.*.paid 
aB*c 
*

（當數據集爲空，代碼返回*，也許這應該是另一種邊緣情況）

來源

2017-08-25 22:35:44

Dang，'os.path.commonprefix'！多久了。 – wim

upvote for commonprefix ...不知道它是否存在。 – Solaxun

相當不錯的一個，加上一個 –

from os.path import commonprefix 

def commonsuffix(m): 
    return commonprefix([s[::-1] for s in m])[::-1] 

def inverse_glob(strs): 
    start = commonprefix(strs) 
    n = len(start) 
    ends = [s[n:] for s in strs] 
    end = commonsuffix(ends) 
    if start and not any(ends): 
     return start 
    else: 
     return start + '*' + end

這個問題比表面看起來更復雜。

根據目前的具體情況，問題仍然沒有很好的約束，即沒有獨特的解決方案。對於輸入['spamAndEggs', 'spamAndHamAndEggs']，spam*AndEggs和spamAnd*Eggs都是有效答案。對於輸入['aXXXXz', 'aXXXz']有四個可能的解決方案。在上面給出的代碼中，我們更願意選擇儘可能長的前綴，以使解決方案具有獨特性。

指出JFF's answer用於指出os.path.commonprefix的存在。

Inverse glob - reverse engineer a wildcard string from file names是這個問題的一個相關和更難推廣。

來源

2017-08-25 22:41:38 wim

感謝您的意見和幫助，使我的解決方案更好。有些人可能會反對你的解決方案是我的副本，但沒有你的意見，我不可能實現一個工作。 –

FWIW我的[原始解決方案]（https://stackoverflow.com/revisions/45890262/1）與您的bug相同。當我看到你的實現時刪除它，那更好。 – wim

我們可以說我們一起擊敗那一個:) –

回答

相關問題