通過匹配集分組項目

我想解析大量的配置文件，並根據內容將結果分組到不同的組 - 我只是不知道如何處理這個問題。例如，假設我有3個文件以下數據：通過匹配集分組項目

 
config1.txt 
ntp 1.1.1.1 
ntp 2.2.2.2 

config2.txt 
ntp 1.1.1.1 

config3.txt 
ntp 2.2.2.2 
ntp 1.1.1.1 

config4.txt 
ntp 2.2.2.2

 
The results would be: 
Sets of unique data 3: 
Set 1 (1.1.1.1, 2.2.2.2): config1.txt, config3.txt 
Set 2 (1.1.1.1): config2.txt 
Set 3 (2.2.2.2): config4.txt

我知道如何glob的文件的目錄，循環水珠結果並打開每次打開一個文件，並使用正則表達式匹配每一行。我不明白的部分是，如何將這些結果存儲起來，並將每個文件與一組結果進行比較，即使條目不符合條件，但是明智地匹配條目也是如此。任何幫助，將不勝感激。

謝謝！

來源

2011-09-21 Ethan Whitt

「我明白如何glob文件目錄，循環glob結果並一次打開每個文件，並使用正則表達式匹配每一行」show我們的代碼，我們很樂意告訴你如何去做其餘的事情。提示：使用字典。 – agf

我會處理這個是這樣的：

首先，得到這樣一本字典：

{(1.1.1.1) : (file1, file2, file3), (2.2.2.2) : (file1, file3, file4) }

然後遍歷文件生成集：

{(file1) : ((1.1.1.1), (2.2.2.2)), etc }

的比較集合的值。

if val(file1) == val(file3): 
    Set1 = {(1.1.1.1), (2.2.2.2) : (file1, file2), etc }

這可能不是最快和最優雅的解決方案，但它應該工作。

來源

2011-09-21 08:40:24 Glaslos

from collections import defaultdict 

#Load the data. 
paths = ["config1.txt", "config2.txt", "config3.txt", "config4.txt"] 
files = {} 

for path in paths: 
    with open(path) as file: 
     for line in file.readlines(): 
      ... #Get data from files 
      files[path] = frozenset(data) 

#Example data. 
files = { 
    "config1.txt": frozenset(["1.1.1.1", "2.2.2.2"]), 
    "config2.txt": frozenset(["1.1.1.1"]), 
    "config3.txt": frozenset(["2.2.2.2", "1.1.1.1"]), 
    "config4.txt": frozenset(["2.2.2.2"]), 
} 

sets = defaultdict(list) 

for key, value in files.items(): 
    sets[value].append(key)

請注意，您需要使用frozensets，因爲它們是不可變的，因此可以用作字典鍵。由於他們不會改變，這很好。

來源

2011-09-21 08:46:35

精益和卑鄙，我喜歡它。我認爲它是O（N * M）其中N是文件數量，M是每個文件的平均配置項數量。 –

filenames = [ r'config1.txt', 
       r'config2.txt', 
       r'config3.txt', 
       r'config4.txt' ] 
results = {} 
for filename in filenames: 
    with open(filename, 'r') as f: 
     contents = (line.split()[1] for line in f) 
     key = frozenset(contents) 
     results.setdefault(key, []).append(filename)

來源

2011-09-21 08:47:25

我比dict.setdefault更喜歡defaultdict（list）。 – rocksportrocker

我可能也應該這樣做，但我有一種習慣，儘量以儘量少的進口來做到這一點，這對我來說很難打破。 –

是的，導入是一個問題.. – rocksportrocker

您需要一個將文件內容映射到文件名的字典。所以你必須讀取每個文件，對條目進行排序，從它們中構建一個元組並將其用作關鍵字。

如果您可以在文件中有重複條目：首先將內容讀入set。

來源

2011-09-21 08:47:54 rocksportrocker

這種方法比其他方法更冗長，但根據幾個因素（見最後的筆記），它可能更有效。除非您正在處理大量配置項目的大量文件，否則我甚至不會考慮將其用於某些其他建議，但如果性能成爲問題，則此算法可能會有所幫助。

開始從配置字符串的文件集（稱之爲c2f，並從文件設置配置字符串（f2c）。兩者都可以作爲你glob的文件建立一個字典。

要clear，c2f是一個字典，其中鍵是字符串，值是文件集f2c是字典，其中鍵是文件，值是字符串集

循環遍歷文件鍵f2c和一個數據項目，使用c2f查找所有包含該項目的文件，這些是你需要比較的唯一文件。

這裏的工作代碼：

# this structure simulates the files system and contents. 
cfg_data = { 
    "config1.txt": ["1.1.1.1", "2.2.2.2"], 
    "config2.txt": ["1.1.1.1"], 
    "config3.txt": ["2.2.2.2", "1.1.1.1"], 
    "config4.txt": ["2.2.2.2"] 
} 

# Build the dictionaries (this is O(n) over the lines of configuration data) 
f2c = dict() 
c2f = dict() 

for file, data in cfg_data.iteritems(): 
    data_set = set() 
    for item in data: 
     data_set.add(item) 
     if not item in c2f: 
      c2f[item] = set() 

     c2f[item].add(file) 
    f2c[file] = data_set; 

# build the results as a list of pairs of lists: 
results = [] 

# track the processed files 
processed = set() 

for file, data in f2c.iteritems(): 
    if file in processed: 
     continue 

    size = len(data) 
    equivalence_list = [] 

    # get one item from data, preferably the one used by the smallest list of 
    # files. 
    item = None 
    item_files = 0 
    for i in data: 
     if item == None: 
      item = i 
      item_files = len(c2f[item]) 
     elif len(c2f[i]) < item_files: 
      item = i 
      item_files = len(c2f[i]) 

    # All files with the same data as f must have at least the first item of 
    # data, just look at those files. 
    for other_file in c2f[item]: 
     other_data = f2c[other_file] 
     if other_data == data: 
      equivalence_list.append(other_file) 
      # No need to visit these files again 
      processed.add(other_file) 

    results.append((data, equivalence_list)) 

# Display the results 
for data, files in results: 
    print data, ':', files

添加上計算複雜注：這在技術上是O（（K數N）*（L日誌M）），其中N爲文件的數量，M是（< = N）是具有相同內容和L的文件的組的數量 L（< = M）是必須成對比較的每個L的文件的平均數量處理文件。這應該是有效的，如果K < < N和L < < M.

來源

2011-09-21 08:54:42

通過匹配集分組項目

回答

相關問題