2015-11-12 48 views
-8

幫幫忙!!!用Python搜索並輸出

150文本文件列表,

One text file with query texts: ( 
    SRR1005851 
    SRR1299210 
    SRR1021605 
    SRR1299782 
    SRR1299369 
    SRR1006158 
    ...etc). 

我想搜索從每150個文本文件的列表此查詢的文本。
如果例如在至少120個文件中發現SRR1005851,則SRR1005851將被附加在輸​​出文件中。
搜索將迭代所有搜索查詢文本和所有150個文件。

摘要:我正在查找150個文件中至少90%的文件。

+2

那麼,你嘗試過什麼?向我們展示您的代碼,告訴我們您卡在哪裏,我們可以提供幫助。 –

+0

你到目前爲止做了什麼?在爲此編碼時,您遇到的具體問題是什麼? – NSNoob

+0

我寫了下面的代碼。我對所需的東西有了一個概念,但我不確定如何使其工作。請幫助我們 count = 0 with open(「expressed.txt」,「w」)as result: with open(「C:/Users/ifeanyi/Desktop/modify/Bmori_id.txt」,「r」)作爲query_file: 在query_file比賽: 在glob.glob名(「* .TXT」): 開放的(名字,「R」)相比: 爲線相比: 如果行匹配: count = + 1 result.append(count) – Mikko

回答

0

我不認爲我完全理解你的問題。發佈你的代碼和一個示例文件會非常有幫助。

此代碼將計算所有文件中的所有條目,然後它將識別每個文件的唯一條目。之後,它會計算每個文件中每個條目的發生次數。然後,它將只選擇至少出現在所有文件的90%中的條目。

此外,此代碼可能更短,但出於可讀性的原因,我創建了許多變量,使用了很長且有意義的名稱。

請閱讀註釋;)

import os 
from collections import Counter 
from sys import argv 

# adjust your cut point 
PERCENT_CUT = 0.9 

# here we are going to save each file's entries, so we can sum them later 
files_dict = {} 

# total files seems to be the number you'll need to check against count 
total_files = 0; 

# raw total entries, even duplicates 
total_entries = 0; 

unique_entries = 0; 

# first argument is script name, so have the second one be the folder to search 
search_dir = argv[1] 

# list everything under search dir - ideally only your input files 
# CHECK HOW TO READ ONLY SPECIFIC FILE types if you have something inside the same folder 
files_list = os.listdir(search_dir) 

total_files = len(files_list) 

print('Files READ:') 

# iterate over each file found at given folder 
for file_name in files_list: 
    print(" "+file_name) 

    file_object = open(search_dir+file_name, 'r') 

    # returns a list of entries with 'newline' stripped 
    file_entries = map(lambda it: it.strip("\r\n"), file_object.readlines()) 

    # gotta count'em all 
    total_entries += len(file_entries) 

    # set doesn't allow duplicate entries 
    entries_set = set(file_entries) 

    #creates a dict from the set, set each key's value to 1. 
    file_entries_dict = dict.fromkeys(entries_set, 1) 

    # entries dict is now used differenty, each key will hold a COUNTER 
    files_dict[file_name] = Counter(file_entries_dict) 

    file_object.close(); 


print("\n\nALL ENTRIES COUNT: "+str(total_entries)) 

# now we create a dict that will hold each unique key's count so we can sum all dicts read from files 
entries_dict = Counter({}) 

for file_dict_key, file_dict_value in files_dict.items(): 
    print(str(file_dict_key)+" - "+str(file_dict_value)) 
    entries_dict += file_dict_value 

print("\nUNIQUE ENTRIES COUNT: "+str(len(entries_dict.keys()))) 

# print(entries_dict) 

# 90% from your question 
cut_line = total_files * PERCENT_CUT 
print("\nNeeds at least "+str(int(cut_line))+" entries to be listed below") 
#output dict is the final dict, where we put entries that were present in > 90% of the files. 
output_dict = {} 
# this is PYTHON 3 - CHECK YOUR VERSION as older versions might use iteritems() instead of items() in the line belows 
for entry, count in entries_dict.items(): 
    if count > cut_line: 
     output_dict[entry] = count; 

print(output_dict) 
+0

非常感謝,這正是我想要的。我只是對這些文件進行了一些調整,並且像魔術一樣工作。非常感謝兄弟。感謝stackoverflow ... – Mikko

+0

令人敬畏的兄弟,請標記答案和upvote,如果你願意。 – guilhermo

相關問題