2011-11-03 73 views
1

我有一個文本文件(領結對齊文件)看起來像這樣的遞減和更新值的字段:蟒蛇:如何分隔文本

 
read_1 + 345995|PACid:16033981 599 AGTAGTAATCAGTCACCCGCAAGGTAGACAAGG qqqqqqqqqqqqqqqqqqqqq!!qqqqqqqqqq 0 
read_2 + 949205|PACid:16054220 338 TACCAGCACTAATGCACCGGATCCCATCAGATC qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq!!q 0 31:A>T 
read_3 + 932004|PACid:16034380 1226 GGCACCTTATGAGAAATCAAAGTTTTTGGGTTC qqqqqqqqqqqqqqq!!qqqqqqqqqqqqq!!q 3 

我要減一列#4(位置),並用更新的值打印每一行。

我可以讀取文件,然後根據選項卡分隔字段,並將第4列標識爲data[3],但之後我堅持從第4列的每個值中減去一個值,並打印每行中的所有字段更新了第4列的值。

我該如何使用Python來做到這一點?

我想是這樣的:

in_file = open(sys.argv[1],'r') 
out_file = open(sys.argv[2], 'w') 
for line in in_file: 
    data = line.rstrip().split('\t') 
    position = int(float(data[3]) -1) 

,但我不知道如何與打印與更新的位置的線進行。

+1

問題的哪個部分卡住了? (閱讀文件?識別第四列?減法?打印?) – Johnsyweb

+1

嘿!我意識到這是我DNA序列的一部分。你從哪裏得到那個的?高級互聯網和它的隱私缺乏! :-) – paxdiablo

+1

作爲一個方面說明,是否有必要使用Python?因爲awk很容易實現,比如'awk'BEGIN {OFS =「\ t」} NF> 0 {$ 4 - = 1;打印}' out.txt' –

回答

1

使用csv module,通知它你的字段分隔符是一個製表:

from io import StringIO 

indata = StringIO(u"""read_1 + 345995|PACid:16033981 599 AGTAGTAATCAGTCACCCGCAAGGTAGACAAGG qqqqqqqqqqqqqqqqqqqqq!!qqqqqqqqqq 0 
read_2 + 949205|PACid:16054220 338 TACCAGCACTAATGCACCGGATCCCATCAGATC qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq!!q 0 31:A>T 
read_3 + 932004|PACid:16034380 1226 GGCACCTTATGAGAAATCAAAGTTTTTGGGTTC qqqqqqqqqqqqqqq!!qqqqqqqqqqqqq!!q 3 
""") 

# that StringIO stuff is just for testing, you should do 
# with open('your_file_name', 'r') as indata: 
# before the 'for' loop, and then indent the rest one level. 

from csv import reader 

for line in reader(indata, delimiter='\t'): 
    if len(line) > 3: 
     line[3] = str(int(line[3]) - 1) 
    print '\t'.join(line) 

然後,只需轉換位置的數字,減去一個,將其轉換回,並打印線。

+0

謝謝,但我得到上述代碼錯誤:行[3] = str(int(行[3]) - 1) IndexError:列表索引超出範圍。我需要打印原始文件中的所有字段,更新第4列。我試圖做這樣的事情,in_file = open(sys.argv [1],'r') out_file = open(sys.argv [2],'w') in_file中的行: data = line .rstrip()。split('\ t') position = int(float(data [3])-1,但我不確定如何繼續打印更新位置的行 – psaima

+0

@psaima然後,你可以在'for'循環裏添加一個'if len(line)> 3:'test來過濾掉壞行,我將它編輯爲 – agf

+0

@psaima基本上你現有的代碼幾乎是正確的,只要將'position = int(float(data [3])-1'改爲data [3] = str(int(data [3]) - 1 )',那麼你可以按照我的方式'print'\ t'.join(data)'。 – agf