2017-06-04 21 views
1

我是Python的新手,一直在用我創建的(150行)學生ID號,等級,年齡,class_code,area_code等等。我想要處理的數據不僅僅是按某一列(按年級,年齡等)進行過濾,而且還會創建一個與該行(學生ID)不同列的列表。我已經設法找到如何隔離需要查找特定值的列,但無法弄清楚如何創建我需要返回的值的列表。Python - 從.dat文件中過濾列並從其他列返回給定值

因此,這裏是5行中的數據的樣本:

1/A/15/13/43214 
2/I/15/21/58322 
3/C/17/89/68470 
4/I/18/6/57362 
5/I/14/4/00000 
6/A/16/23/34567 

我需要的第一列(學生證)名單的基礎上,篩選第二列(級)......(並最終第三列,第四列等,但如果我看到它只是第二個看起來如何,我想我可以找出其他)。另請注意:我沒有在.dat文件中使用標題。

我想出瞭如何隔離/查看第二列。

import numpy 

data = numpy.genfromtxt('/testdata.dat', delimiter='/', dtype='unicode') 

grades = data[:,1] 
print (grades) 

打印:

['A' 'I' 'C' 'I' 'I' 'A'] 

但現在,我怎麼能拉就在第一列的對應於A的,C的,我是爲單獨的列表?

所以我想看到一個列表,也與第1列,爲A的,C的整數之間的逗號,和我的

list from A = [1, 6] 
list from C = [3] 
list from I = [2, 4, 5] 

同樣,如果我可以看到它是如何與實現只是第二列,只有一個值(比如說A),我想我可以想出如何爲B's,C's,D's等以及其他列做些什麼。我只需要看一個例子來說明如何應用這個語法,然後就像其他的一樣。

此外,我一直在使用numpy,但也讀了關於熊貓,csv和我認爲這些庫也可能是可能的。但就像我說的,一直在使用numpy來處理.dat文件。我不知道其他庫是否會更容易使用?

回答

1

大熊貓的解決方案:

import pandas as pd 

df = pd.read_csv('data.txt', header=None, sep='/') 
dfs = {k:v for k,v in df.groupby(1)} 

因此,我們有DataFrames的字典:

In [59]: dfs.keys() 
Out[59]: dict_keys(['I', 'C', 'A']) 

In [60]: dfs['I'] 
Out[60]: 
    0 1 2 3  4 
1 2 I 15 21 58322 
3 4 I 18 6 57362 
4 5 I 14 4  0 

In [61]: dfs['C'] 
Out[61]: 
    0 1 2 3  4 
2 3 C 17 89 68470 

In [62]: dfs['A'] 
Out[62]: 
    0 1 2 3  4 
0 1 A 15 13 43214 
5 6 A 16 23 34567 

如果你想擁有第一列的細分電子郵件列表:

In [67]: dfs['I'].iloc[:, 0].tolist() 
Out[67]: [2, 4, 5] 

In [68]: dfs['C'].iloc[:, 0].tolist() 
Out[68]: [3] 

In [69]: dfs['A'].iloc[:, 0].tolist() 
Out[69]: [1, 6] 
1

您可以瀏覽列表並製作一個布爾值來選擇匹配特定等級的數組。這可能需要一些改進。

import numpy as np 

grades = np.genfromtxt('data.txt', delimiter='/', skip_header=0, dtype='unicode') 


res = {} 
for grade in set(grades[:, 1].tolist()): 
    res[grade] = grades[grades[:, 1]==grade][:,0].tolist() 

print res 
+0

所以我一直在玩到目前爲止發佈的不同解決方案。我喜歡你的解決方案。它將res顯示爲一組列表。我試圖查找,而且我仍在搜索,但有沒有辦法將列表與列表分開?所以我可以基本上是水庫的'A'級別列表,以及水庫等的'C'級別?我所發現的只是將列表添加到集合中,或者從列表中刪除列表,或者列表的子集和列表的子集。但我似乎無法找到任何有關多個列表的集合。 – chitown88

1

實際上你不需要任何廣告用於這樣一個簡單任務的模塊。 Pure-Python解決方案將逐行讀取文件並使用str.split()對它們進行「解析」,它們將爲您提供您的列表,然後您可以對任何參數進行非常多的過濾。喜歡的東西:

students = {} # store for our students by grade 
with open("testdata.dat", "r") as f: # open the file 
    for line in f: # read the file line by line 
     row = line.strip().split("/") # split the line into individual columns 
     # you can now directly filter your row, or you can store the row in a list for later 
     # let's split them by grade: 
     grade = row[1] # second column of our row is the grade 
     # create/append the sublist in our `students` dict keyed by the grade 
     students[grade] = students.get(grade, []) + [row] 
# now your students dict contains all students split by grade, e.g.: 
a_students = students["A"] 
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']] 

# if you want only to collect the A-grade student IDs, you can get a list of them as: 
student_ids = [entry[0] for entry in students["A"]] 
# ['1', '6'] 

但是,讓我們回去了幾步 - 如果你想你應該只存儲您的列表,然後更廣義的解決方案創建一個函數通過傳遞的參數進行過濾,所以:

# define a filter function 
# filters should contain a list of filters whereas a filter would be defined as: 
# [position, [values]] 
# and you can define as many as you want 
def filter_sublists(source, filters=None): 
    result = [] # store for our result 
    filters = filters or [] # in case no filter is returned 
    for element in source: # go through every element of our source data 
     try: 
      if all(element[f[0]] in f[1] for f in filters): # check if all our filters match 
       result.append(element) # add the element 
     except IndexError: # invalid filter position or data position, ignore 
      pass 
    return result # return the result 

# now we can use it to filter our data, first lets load our data: 

with open("testdata.dat", "r") as f: # open the file 
    students = [line.strip().split("/") for line in f] # store all our students as a list 

# now we have all the data in the `students` list and we can filter it by any element 
a_students = filter_sublists(students, [[1, ["A"]]]) 
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']] 

# or again, if you just need the IDs: 
a_student_ids = [entry[0] for entry in filter_sublists(students, [[1, ["A"]]])] 
# ['1', '6'] 

# but you can filter by any parameter, for example: 
age_15_students = filter_sublists(students, [[2, ["15"]]]) 
# [['1', 'A', '15', '13', '43214'], ['2', 'I', '15', '21', '58322']] 

# or you can get all I-grade students aged 14 or 15: 
i_students = filter_sublists(students, [[1, ["I"]], [2, ["14", "15"]]]) 
# [['2', 'I', '15', '21', '58322'], ['5', 'I', '14', '4', '00000']] 
相關問題