2017-01-17 36 views
0

我在目錄698中有一堆文件是確切的。每個文件都包含日期和唯一ID以及名稱。像這樣:我可以按日期和ID對文件進行分組,並對其進行區分嗎?

import pandas as pd 
from pandas import Series, DataFrame 
import numpy as np 
import csv 
import os 
import re 

20151231_7801_Test_Maps.txt 
20151231_7801_Test_Items.txt 
20151231_7802_Test_Maps.txt 
20151231_7802_Test_Items.txt 

我期待通過日期和標識它們分組,打開每個文件(地圖,以及項目),並做有關文件中的某些ID的差異分析。我將如何做到這一點?

到目前爲止,我有這個作爲我的代碼,但我不知道如何遍歷並打開每個組中的每個文件:

groups = defaultdict(list) 
for filename in os.listdir('F:\Desktop'): 
    date = filename[:8] 
    identifier = filename[10:14] 
    basename, extension = os.path.splitext(filename) 
    groups[date, identifier].append(filename) 

我的輸出打印一些羣體的正確,但不是全部,對例如:

('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt] 

某些組只打印一個文件,即使該日期和標識符有兩個文件。

這不是我最關心的,但一旦他們在小組打散我想組中的每個文件分配給一個數據幀像這樣:

for key in groups: 
    maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
    items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 

    #checks IDs between the two files and looks for differences 
    set(maps.ID).difference(items.ID) 

可有人請與分組中的文件幫助按日期和ID,並重覆按組打開文件?謝謝!

回答

0

從四條的答案以,我已經找到了一個不錯的辦法做到這一點。

groups = defaultdict(list) 
output = [] 

for filename in os.listdir(pathloc): 
date = filename[:8] 
identifier = filename[14:18] 
basename, extension = os.path.splitext(filename) 
groups[date, identifier].append(filename) 


for key, fnames in groups.iteritems(): 
filedicts = {} 
print list(fnames) 
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 



diffs = set(maps.ID).symmetric_difference(items.ID) 

filedicts['FileIDKey'] = list(key) 
filedicts['Missing_IDs'] = list(diffs)       
filedicts['FileNames'] = fnames 

output.append(filedicts) 

這讓我然後去和這個主字典列表添加到數據幀:

new = pd.DataFrame(output) 
1

了一些幫助,從https://stackoverflow.com/a/20228113/6626530而且做得

import pandas as pd 


from collections import defaultdict 

difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2')) 

pathloc ='C:\Users\shmathew\Desktop\Sample\\abc\\' 
groups = defaultdict(list) 
for filename in os.listdir(pathloc): 
    date = filename[:8] 
    identifier = filename[10:14] 
    basename, extension = os.path.splitext(filename) 
    groups[date, identifier].append(filename) 



for key,filenames in groups.iteritems(): 
    #print " processing following files" 
    #print filenames 
    maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
    Items = pd.read_csv(pathloc+filenames[0] , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
    df = pd.concat([maps, Items]) 
    df = df.reset_index(drop=True) 
    df_gpby = df.groupby(list(df.columns)) 
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] 




    #print "\n\n Difference \n\n" 
    ids= (df.reindex(idx)) 
    row =list(filenames); 
    row.extend(list(ids['ID'])) 

    print row 
    # difference.append(row) 
    difference.append(row) 
print difference 

輸出

['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1 ', '00432xx0TRNY1 '] 
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt'] 
Empty DataFrame 
Columns: [Filename1, Filename2, DiffID1, DiffID2] 
Index: [] 
+0

謝謝!這很好,我想知道是否有一種方法可以將它放入一個名爲Difference的數據框列中,每個記錄旁邊都有文件名/ ID? (將報告目的過濾起來更容易) – staten12

+0

更新了代碼,但無法將它們放入Dataframe – Shijo

相關問題