用Python搜索並輸出

-8

150文本文件列表，

One text file with query texts: ( 
    SRR1005851 
    SRR1299210 
    SRR1021605 
    SRR1299782 
    SRR1299369 
    SRR1006158 
    ...etc).

我想搜索從每150個文本文件的列表此查詢的文本。
如果例如在至少120個文件中發現SRR1005851，則SRR1005851將被附加在輸出文件中。
搜索將迭代所有搜索查詢文本和所有150個文件。

摘要：我正在查找150個文件中至少90％的文件。

來源

2015-11-12 Mikko

那麼，你嘗試過什麼？向我們展示您的代碼，告訴我們您卡在哪裏，我們可以提供幫助。 –

你到目前爲止做了什麼？在爲此編碼時，您遇到的具體問題是什麼？ – NSNoob

我寫了下面的代碼。我對所需的東西有了一個概念，但我不確定如何使其工作。請幫助我們 count = 0 with open（「expressed.txt」，「w」）as result： with open（「C：/Users/ifeanyi/Desktop/modify/Bmori_id.txt」，「r」）作爲query_file：在query_file比賽：在glob.glob名（「* .TXT」）：開放的（名字，「R」）相比：爲線相比：如果行匹配： count = + 1 result.append（count） – Mikko

我不認爲我完全理解你的問題。發佈你的代碼和一個示例文件會非常有幫助。

此代碼將計算所有文件中的所有條目，然後它將識別每個文件的唯一條目。之後，它會計算每個文件中每個條目的發生次數。然後，它將只選擇至少出現在所有文件的90％中的條目。

此外，此代碼可能更短，但出於可讀性的原因，我創建了許多變量，使用了很長且有意義的名稱。

請閱讀註釋;）

import os 
from collections import Counter 
from sys import argv 

# adjust your cut point 
PERCENT_CUT = 0.9 

# here we are going to save each file's entries, so we can sum them later 
files_dict = {} 

# total files seems to be the number you'll need to check against count 
total_files = 0; 

# raw total entries, even duplicates 
total_entries = 0; 

unique_entries = 0; 

# first argument is script name, so have the second one be the folder to search 
search_dir = argv[1] 

# list everything under search dir - ideally only your input files 
# CHECK HOW TO READ ONLY SPECIFIC FILE types if you have something inside the same folder 
files_list = os.listdir(search_dir) 

total_files = len(files_list) 

print('Files READ:') 

# iterate over each file found at given folder 
for file_name in files_list: 
    print(" "+file_name) 

    file_object = open(search_dir+file_name, 'r') 

    # returns a list of entries with 'newline' stripped 
    file_entries = map(lambda it: it.strip("\r\n"), file_object.readlines()) 

    # gotta count'em all 
    total_entries += len(file_entries) 

    # set doesn't allow duplicate entries 
    entries_set = set(file_entries) 

    #creates a dict from the set, set each key's value to 1. 
    file_entries_dict = dict.fromkeys(entries_set, 1) 

    # entries dict is now used differenty, each key will hold a COUNTER 
    files_dict[file_name] = Counter(file_entries_dict) 

    file_object.close(); 


print("\n\nALL ENTRIES COUNT: "+str(total_entries)) 

# now we create a dict that will hold each unique key's count so we can sum all dicts read from files 
entries_dict = Counter({}) 

for file_dict_key, file_dict_value in files_dict.items(): 
    print(str(file_dict_key)+" - "+str(file_dict_value)) 
    entries_dict += file_dict_value 

print("\nUNIQUE ENTRIES COUNT: "+str(len(entries_dict.keys()))) 

# print(entries_dict) 

# 90% from your question 
cut_line = total_files * PERCENT_CUT 
print("\nNeeds at least "+str(int(cut_line))+" entries to be listed below") 
#output dict is the final dict, where we put entries that were present in > 90% of the files. 
output_dict = {} 
# this is PYTHON 3 - CHECK YOUR VERSION as older versions might use iteritems() instead of items() in the line belows 
for entry, count in entries_dict.items(): 
    if count > cut_line: 
     output_dict[entry] = count; 

print(output_dict)

來源

2015-12-08 02:31:04 guilhermo

非常感謝，這正是我想要的。我只是對這些文件進行了一些調整，並且像魔術一樣工作。非常感謝兄弟。感謝stackoverflow ... – Mikko

令人敬畏的兄弟，請標記答案和upvote，如果你願意。 – guilhermo

用Python搜索並輸出

回答

相關問題