2012-08-02 69 views
1

我有很長的是包括讀取不同的文件,最後把一切都變成不同的.csv代碼列表的Python編寫與行的.csv文件和列轉

這是我所有的代碼

import csv 
import os.path 
#open files + readlines 
with open("C:/Users/Ivan Wong/Desktop/Placement/Lists of targets/Mouse/UCSC to Ensembl.csv", "r") as f: 
    reader = csv.reader(f, delimiter = ',') 
    #find files with the name in 1st row 
    for row in reader: 
     graph_filename = os.path.join("C:/Python27/Scripts/My scripts/Selenoprotein/NMD targets",row[0]+"_nt_counts.txt.png") 
     if os.path.exists(graph_filename): 
      y = row[0]+'_nt_counts.txt' 
      r = open('C:/Users/Ivan Wong/Desktop/Placement/fp_mesc_nochx/'+y, 'r') 
      k = r.readlines() 
      r.close 
      del k[:1] 
      k = map(lambda s: s.strip(), k) 
      interger = map(int, k) 
      import itertools 
      #adding the numbers for every 3 rows 
      def grouper(n, iterable, fillvalue=None): 
       "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx" 
       args = [iter(iterable)] * n 
       return itertools.izip_longest(*args, fillvalue=fillvalue) 
      result = map(sum, grouper(3, interger, 0))  
      e = row[1] 
      cDNA = open('C:/Users/Ivan Wong/Desktop/Placement/Downloaded seq/Mouse/cDNA.txt', 'r') 
      seq = cDNA.readlines() 
      # get all lines that have a gene name 
      lineNum = 0; 
      lineGenes = [] 
      for line in seq: 
       lineNum = lineNum +1 
       if '>' in line: 
        lineGenes.append(str(lineNum)) 
       if '>'+e in line: 
        lineBegin = lineNum 

      cDNA.close 

      # which gene is this 
      index1 = lineGenes.index(str(lineBegin)) 
      lineEnd = lineGenes[index1+1]   
# linebegin and lineEnd now give you, where to look for your sequence, all that 
# you have to do is to read the lines between lineBegin and lineEnd in the file 
# and make it into a single string.    
      lineEnd = lineGenes[index1+1] 
      Lastline = int(lineEnd) -1 

# in your code you have already made a list with all the lines (q), first delete 
# \n and other symbols, then combine all lines into a big string of nucleotides (like this)  
      qq = seq[lineBegin:Lastline] 
      qq = map(lambda s: s.strip(), qq) 
      string = '' 
      for i in range(len(qq)): 
       string = string + qq[i] 
# now you want to get a list of triplets, again you can use the for loop: 
# first get the length of the string 
      lenString = len(string); 
# this is your list codons 
      listCodon = [] 
      for i in range(0,lenString/3): 
       listCodon.append(string[0+i*3:3+i*3]) 
      with open(e+'.csv','wb') as outfile: 
       outfile.writelines(str(result)+'\n'+str(listCodon)) 

我在這裏的問題是,產生的文件看起來像這樣:

0  0  0   
'GCA' 'CTT' 'GGT' 

我想讓它像這樣:

0 GCA  
0 CTT  
0 GGT 

我能在我的代碼做實現這一目標?

打印結果:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 2, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 3, 3, 0, 3, 1, 2, 1, 2, 1, 0, 1, 0, 1, 2, 1, 0, 5, 0, 0, 0, 0, 6, 0, 1, 0, 0, 2, 0, 1, 0, 0, 1, 1, 0, 1, 6, 34, 35, 32, 1, 1, 0, 4, 1, 0, 1, 0, 0, 0, 0, 1, 6, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 

打印listCodon:

['gtt', 'gaa', 'aca', 'gag', 'aca', 'tgt', 'tct', 'gga', 'gat', 'gag', 'ctg', 'tgg', 'gca', 'gaa', 'gga', 'cag', 'gcc', 'taa', 'gca', 'cag', 'gca', 'gca', 'gag', 'ctt', 'tga', 'tct', 'ctt', 'ggt', 'gat', 'cgg', 'tgg', 'ggg', 'atc', 'cgg', 'tgg', 'cct', 'agc', 'ttg', 'tgc', 'caa', 'gga', 'agc', 'tgc', 'tca', 'gct', 'ggg', 'aaa', 'gaa', 'ggt', 'ggc', 'tgt', 'ggc', 'tga', 'cta', 'tgt', 'gga', 'acc', 'ttc', 'tcc', 'ccg', 'agg', 'cac', 'caa', 'gtg', 'ggg', 'cct', 'tgg', 'tgg', 'cac', 'ctg', 'tgt', 'caa', 'cgt', 'ggg', 'ttg', 'cat', 'acc', 'caa', 'gaa', 'gct', 'gat', 'gca', 'tca', 'ggc', 'tgc', 'act', 'gct', 'ggg', 'ggg', 'cat', 'gat', 'cag', 'aga', 'tgc', 'tca', 'cca', 'cta', 'tgg', 'ctg', 'gga', 'ggt', 'ggc', 'cca', 'gcc', 'tgt', 'cca', 'aca', 'caa', 'ctg', 'gtg', 'aga', 'gag', 'aag', 'ccc', 'ttg', 'ccc', 'tct', 'gca', 'ggt', 'ccc', 'att', 'gaa', 'agg', 'aga', 'ggt', 'ttg', 'ctc', 'tct', 'gcc', 'act', 'cat', 'ctg', 'taa', 'ccg', 'tga', 'gct', 'ttt', 'cca', 'ccc', 'ggc', 'ctc', 'ctc', 'ttt', 'gat', 'ccc', 'aga', 'ata', 'atg', 'act', 'ctg', 'aga', 'ctt', 'ctt', 'atg', 'tat', 'gaa', 'taa', 'atg', 'cct', 'ggg', 'cca', 'aaa', 'acc'] 

This is what Marek's code helped me to achieve This is what I hope for

左圖是什麼馬立克氏代碼幫我實現,我想打一個改進所以它安排如右圖

圖片
+0

究竟包含在'result'和'listCodon'?我在問,因爲現在你的代碼片斷已經暗示這兩個都是簡單的字符串,並且'str()'調用是無用的。否則就沒有辦法實現這個輸出,除非他們是超載'自定義類__str __()',如果是這樣的話,我們肯定會需要看到這些定義。 – 2012-08-02 09:47:43

+0

@Tim Pietzcker結果和listCodon是我沒放上去代碼。他們已處理與Python和基本結果數據的數字(例如我把爲0),並listCodon將是3個字母(GCA,CTT等) – ivanhoifung 2012-08-02 09:50:36

+0

所以你也已經離開了你的循環樣品?你需要發佈更多的代碼 - 我們無法猜測你的程序在做什麼。請張貼我們需要的複製結果(最好不要超過)。然後我們可以考慮修復它。 – 2012-08-02 09:52:17

回答

4

您可以使用zip()將兩個迭代器一起壓縮。所以,如果你有

result = [0, 0, 0, 0, 0] 
listCodons = ['gtt', 'gaa', 'aca', 'gag', 'aca'] 

那麼你可以做

>>> list(zip(result, listCodons)) 
[(0, 'gtt'), (0, 'gaa'), (0, 'aca'), (0, 'gag'), (0, 'aca')] 

,或者爲您例如:

with open(e+'.csv','w') as outfile: 
    out = csv.writer(outfile) 
    out.writerows(zip(result, listCodons)) 
+0

zip(result,listCodons)實際返回列表,所以沒有需要將其包裝到「list」構造函數中。 – Marek 2012-08-02 10:35:53

+1

@MarekWawrzyczek:不在Python 3. – 2012-08-02 10:36:18

+0

這也適用!謝謝蒂姆 – ivanhoifung 2012-08-02 10:42:43

1

試試這個:

proper_result = '\n'.join([ '%s %s' % (nr, codon) for nr, codon in zip(result, listCodon) ]) 

編輯(密碼子拆分成單獨的列):

proper_result = '\n'.join(' '.join([str(nr),] + list(codon)) for nr, codon in zip(nrs, cdns)) 

編輯(逗號分隔值):

proper_result = '\n'.join('%s, %s' % (nr, codon) for nr, codon in zip(result, listCodon)) 
+0

SyntaxError:無效的語法...爲什麼? – ivanhoifung 2012-08-02 10:01:45

+0

首先寫了答案,然後進行了測試,現在更正了 – Marek 2012-08-02 10:12:40

+0

謝謝!現在它全部在一列中,儘管不是2列,它們被擠壓成1列這很好,但你認爲你可以將它們分開嗎? – ivanhoifung 2012-08-02 10:13:06