2015-10-18 47 views
2

我有一段數據,爲了分析它,我必須先組織這些數據。如何防止通過for循環覆蓋字典中的數據

數據(樣本):

3.30.67.10 [ '2i69A', '1sfkA', '1sfkB', '1sfkH', '2hcnA', '2hcsA', '2hfzA', '2of6A' ,2qeqA','2qeqB','2wa1A','2wa1B','2wa2A','2wa2B','4r05A','4r8rA','4r8sA','1pjwA','2m0sA','4uifA' '3c1A','3j4A','3u1jB','3u1jA','3u1jB','3u1jA','3u1jB','1ztxE','3c5xC','4ctjA','4ctkA',' ,'3l6pA','3lkwA','4al8C','4gsxA','4gt0A','4oigA','3j8dB','1p58A','2m9pA','2m9qA','3uzvA','1uzgA',' 3p8A',3p97A,3uzeC,3uzeD,3vttA,2bhrA,2bmfA,2fomA,2fomB,4m9fA,4m9iA,4m9kA,4m9mA, ,'2jlqA','2jlrA','2jlsA','2jluA','2jlvA','2jlwA','2jlxA','2jlyA','2jlzA','2whxA','2whxC' 2wzqA','2wzqC','3uypB','1yzoA','1k4rA ','4alaC','1z66A','2gg1A','3ixyA','2hg0A','2v6iA','2v6jA','4r8tA','1yksA','1ymfA','3evaA','3evbA', '3evcA','3evdA','3eveA','3evfA','1p58D','2b6bA','1s6nA','119kA','1oamA','1anA','1ok8A','1okeA','1r6aA '','1r6rA','1thdA','2p1dA','3j8dG','3j8dH','3zkoA','4tplA','4uifB','4ut6A','4ut9A','4utaA','4utbA', '4utcA','3g7tA','3j05A','4cauA','4cbfA','4cbfB','3j35A','1tc7C','2fp7A','2fp7B','2g05D','2ggvA','2ggvB '','2joA','2ijoB','2p5pA','2yolA','3e90A','3e90B','3e90C','4r8tB','4c2iA','2oxtA','3egpA','3ircA', '4ffyA','4ffzA','4l5fE','2jqmA','2jv6A','4am0A','4am0Q','4am0R','4fg0A','4bz1A','4bz2A','2p3qA','2xbmA '','3c5xA','2oy0A','3i50E','3iywA','3j0bA','3lkzA','4o6cA','4o6dA','4oieA','4oiiA','3c6dA','2r6pA', '2p3A','2p40A','2p41A','1urzA','2px2A','2px4A','2px5A','2px8A','2pxaA','2pxcA','2v8oA','2wv9A '','3c6dD','3c6eC','3j42D','2z83A','3p54A','4hdgA','4hdhA','4k6mA','4mtpA','4mtpD','2jsfA' ,'2r69A','3c6rA','3c6rD','3ixyD','3iyaA','3iyaD','3ixxA','2r29A','3evgA','1df9A','2qidA','3j27A',' 3j2bB','4uihA','3uzqB','1befA','1tg8A','tgeA','3ixxD','5a1zA','1n6gA','1na4A','1svbA' '4azxA','4azxD','4b03A','4b03D','4c2iB','4cctA','4cctD','2h0pA','3uajB','3uc0A','3uc0B','3we1A',' 2j7uA,2j7wA,3j6sA,3j6sB,3j6tA,3j6tB,3j6uA,3j6uB,3vwsA,4c11A,4hhjA,4v0qA,4v0rA, ]

正如你可以看到一些數據的前4位數字相似,如「1sfk」。如果他們共享前4位,這意味着它們屬於相同的結構,並且我需要爲每個完整蛋白質代碼(5位,如1sfkA或1sfkB)(在PDBSum數據庫中找到)存儲唯一的UniProt代碼數字代碼。

對於我創造了這個和平的代碼:

for domain in dDomainSeqSum.keys():# CHANGE TO COMPRESS FILE 
     dDomainSeqSumSWS[domain]={} 
     for pdb in dDomainSeqSum[domain]:#add sws of a pdb in a variable and later add that variable to the domain thing 
      pdb1 = list(pdb)#split is not working 
      pdb2 = pdb1[0]+pdb1[1]+pdb1[2]+pdb1[3] 
      dDomainSeqSumSWS[domain][pdb2]=[] 
      for i in range(len(PDBSum)): #make pdb3 search and then compare to the pdb stored 
       if pdb in PDBSum[i]: 
        if "SWS_ID" in PDBSum[i]: 
         line = PDBSum[i].split() 
         if pdb2 not in dDomainSeqSumSWS: 
          dDomainSeqSumSWS[domain][pdb2]=[line[2]] 
         else: 
          dDomainSeqSumSWS[domain][pdb2].append(line[2]) 

同時運行的代碼之後,這是我得到的結果:

{「67年3月30日。10 ':{' 4c2i ':[' G3F5K5 '],' 2p3l ':[' Q9WLZ5 '],' 4uta ':[' Q68Y26 '],' 4utc ':[' Q68Y26 '],' 4utb ':[' Q68Y26 '],' 1urz ':[' Q80E47 '],' 3l6p ':[' P17763 '],' 1tge ':[' P27914 '],' 3evb ':[' P03314 '],' 2vbc ':[' Q2TN89 '],' 3eva ':[' P03314 '],' 3evf ':[' P03314 '],' 3evg ':[' P29991 '],' 3evd ':[' P03314 '],' 3eve ':[' P03314 '],' 2p1d ':[' P12823 '],' 3j42 ':[' Q3BCY5 '],' 2jlx ':[' Q2YHF0 '],' 2jly ':[' Q2YHF0 '],' 2jlz ':[' Q2YHF0 '],' 2jlu ':[' Q2YHF0 '],' 2jlv ':[' Q2YHF0 '],' 2jlw ':[' Q2YHF0 '],' 1oke ':[' P12823 '],' 2jlq ':[' Q2YHF0 '],' 2jlr ':[' Q2YHF0 '],' 2jls ':[' Q2YHF0 '],' 2wv9 ':[' P05769 '],' 2z83 ':[' P27395 '],' 4hdh ':[' P27395 '],' 2hcn ':[' P14335 '],' 2oxt ':[' A0EKU1 '],' 1tg8 ':[' P27914 '],' 4hdg ':[' P27395 '],' 4ut9 ':[' Q68Y26 '],' 3e90 ':[' P06935 '],' 4am0 ':[' Q58HT7 '],' 4ut6 ':[' Q68Y26 '],' 1ok8 ':[' P12823 '],' 4ffy ':[' Q9J7C6 '],' 4ffz ':[' Q88640 '],' 4b03 ':[' G3F5K5 '],' 2m9p ':[' P14337 '],' 2m9q ':[' P14337 '],' 4fg0 ':[' P09732 '],' 4azx ':[' G3F5K5 '],' 2hcs ':[' P14335 '],' 4hhj ':[' Q6DLV0 '],' 4mtp ':[' P27395 '],' 3j8d ':[' P128 23 '],' 3uc0 ':[' P09866 '],' 4l5f ':[' Q8BE40 '],' 4m9t ':[' Q91H74 '],' 4m9k ':[' Q91H74 '],' 4m9i ':[' Q91H74 '],' 2of6 ':[' P14335 '],' 2px5 ':[' P05769 '],' 4m9m ':[' Q91H74 '],' 4m9f ':[' Q91H74 '],' 3j0b ':[' Q9Q6P4 '],' 5a1z ':[' G9FRP5 '],' 4r8r ':[' C1KBQ3 '],' 4r8s ':[' C1KBQ3 '],' 1l9k ':[' P12823 '],' 1svb ':[' P14336 '],' 4r8t ':[' O90417 '],' 2hfz ':[' P14335 '],' 2v6j ':[' Q32ZD5 '],' 3zko ':[' P12823 '],' 2ggv ':[' P06935 '],' 2v6i ':[' Q32ZD5 '],' 3u1j ':[' Q5UB51 '],' 3u1i ':[' Q5UB51 '],' 4oig ':[' P17763 '],' 4ala ':[' Q7TGD1 '],' 3P97 ':[' P27915 '],' 3p8z ':[' P27915 '],' 2pxc ':[' P05769 '],' 4gsx ':[' P17763 '],' 2pxa ':[' P05769 '],' 4oii ':[' Q9Q6P4 '],' 1bef ':[' Q9Q4T1 '],' 3evc ':[' P03314 '],' 3j05 ':[' Q689G3 '],' 3egp ':[' Q9J7C6 '],' 2yol ':[' P06935 '],' 2v8o ':[' P05769 '],' 4r05 ':[' C1KBQ3 '],' 1n6g ':[' P14336 '],' 3lkz ':[' Q9Q6P4 '],' 4cau ':[' Q689G3 '],' 2px2 ':[' P05769 '],' 2gg1 ':[' P29837 '],' 4al8 ':[' P17763 '],' 2px4 ':[' P05769 '],' 3lkw ':[' P17763 '],' 2r69 ':[' P18356 '],' 2r6p ':[' Q66394 '],' 3j6s ':[' Q6DLV0 '],' 3j6u ':[' Q6DL V0 '],' 1sfk ':[' P14335 '],' 1z66 ':[' P29837 '],' 3uaj ':[' P09866 '],' 3iyw ':[' Q9Q6P4 '],' 3j35 ':[' E7FLK7 '],' 4k6m ':[' P27395 '],' 2fom ':[' Q91H74 '],' 3vws ':[' Q6DLV0 '],' 3vtt ':[' P27915 '],' 3iya ':[' P18356 '],' 2p5p ':[' P06935 '],' 2hg0 ':[' Q91R00 '],' 2jqm ':[' Q6DV88 '],' 2p41 ':[' Q9WLZ5 '],' 4v0r ':[' Q6DLV0 '],' 4tpl ':[' Q5SBG8 '],' 1yks ':[' P03314 '],' 4bz1 ':[' Q7TGC7 '],' 4bz2 ':[' Q7TGC7 '],' 1thd ':[' P12823 '],' 2m0s ':[' Q9YKL3 '],' 4cbf ':[' E0WXI2 '],' 3ixx ':[' Q3I100 '],' 3ixy ':[' P18356 '],' 2px8 ':[' P05769 '],' 1ztx ':[' Q91KZ4 '],' 2fp7 ':[' P06935 '],' 4uif ':[' E0WXJ3 '],' 4uih ':[' P14340 '],' 3uzq ':[' P27909 '],' 4c11 ':[' Q6DLV0 '],' 1p58 ':[' Q9WDA7 '],' 4cct ':[' G3F5K5 '],' 2r29 ':[' P29991 '],' 2p40 ':[' Q9WLZ5 '],' 1na4 ':[' P14336 '],' 1ymf ':[' P03314 '],' 3uzv ':[' P07564 '],' 1r6r ':[' P12823 '],' 3c5x ':[' Q6H1E5 '],' 2xbm ':[' C0LMU5 '],' 3g7t ':[' Q689G3 '],' 2g05 ':[' P06935 '],' 1r6a ':[' P12823 '],' 3uze ':[' P27915 '],' 2whx ':[' Q2YHF0 '],' 3p54 ':[' P27395 '],' 1k4r ':[' C3V005 '],' 3i50 ':[' Q9Q6P4 '],' 3c6d ':[' Q3BC Y5 '],' 3c6e ':[' Q3BCY5 '],' 4o6c ':[' Q9Q6P4 '],' 4o6b ':[' P29990 '],' 4o6d ':[' Q9Q6P4 '],' 2ijo ':[' P06935 '],' 2wa2 ':[' Q8QL64 '],' 1tc7 ':[' P06935 '],' 3j27 ':[' P14340 '],' 2wa1 ':[' Q8QL64 '],' 3gcz ':[' Q7T918 '],' 2p3q ':[' Q20IJ2 '],' 2jsf ':[' P18356 '],' 3we1 ':[' P09866 '],' 1df9 ':[' P14340 '],' 4gt0 ':[' P17763 '],' 3c6r ':[' P18356 '],' 3j2p ':[' P14340 '],' 3irc ':[' Q9J7C6 '],' 2oy0 ':[' Q9Q6P4 '],' 3uyp ':[' Q2YHF0 '],' 2qeq ':[' P14335 '],' 2jv6 ':[' Q6DV88 '],' 2qid ':[' P14340 '],' 1oan ':[' P12823 '],' 1oam ':[' P12823 '],' 2b6b ':[' Q9WDA7 '],' 2bmf ':[' Q91H74 '],' 2i69 ':[' Q80QJ9 '],' 2j7w ':[' Q6DLV0 '],' 4v0q ':[' Q6DLV0 '],' 1yzo ':[' P29838 '],' 1s6n ':[' Q913C7 '],' 4oie ':[' Q9Q6P4 '],' 2bhr ':[' Q91H74 '],' 3j6t ':[' Q6DLV0 '],' 2p3o ':[' Q9WLZ5 '],' 4ctk ':[' A9LIE0 '],' 4ctj ':[' A9LIE0 '],' 2j7u ':[' Q6DLV0 '],' 1pjw ':[' Q9J0X3 '],' 1uzg ':[' P27915 '],' 2h0p ':[' P09866 '],' 2wzq ':[' Q2YHF0「]}}

正如你所看到的,1sfk被覆蓋,它應該有3個個人的UniProt代碼

+0

請創建一個**最小**工作示例,顯示您的問題。 – Jasper

回答

3

有你有問題(如對方的回答也預示)兩個地方 -

  1. 首先是你寫dDomainSeqSumSWS[domain][pdb2]爲空列表 - dDomainSeqSumSWS[domain][pdb2]=[]

  2. 二是在條件 - if pdb2 not in dDomainSeqSumSWS: - 這將永遠是False,因爲pdb2dDomainSeqSumSWS[domain]字典不dDomainSeqSumSWS字典的關鍵。

你其實並不需要或者上面的東西,而不是你應該看看dict.setdefault,這是本作。實施例 -

for domain in dDomainSeqSum.keys():# CHANGE TO COMPRESS FILE 
    dDomainSeqSumSWS[domain]={} 
    for pdb in dDomainSeqSum[domain]:#add sws of a pdb in a variable and later add that variable to the domain thing 
     pdb2 = pdb[:4] #you do not need to convert to list for indexing and you can slice the first four characters off. 
     dDomainSeqSumSWS[domain][pdb2]=[] 
     for i in range(len(PDBSum)): #make pdb3 search and then compare to the pdb stored 
      if pdb in PDBSum[i]: 
       if "SWS_ID" in PDBSum[i]: 
        line = PDBSum[i].split() 
        dDomainSeqSumSWS[domain].setdefault(pdb2,[]).append(line[2]) 

dict.setdefault需要key作爲第一個參數和默認值作爲第二個參數,並將該值,如果密鑰不存在在字典中,並返回該值。否則,如果密鑰存在於字典中,它只會返回該值的值。

另外,我改變了,你轉換pdb到不需要索引編list()(你可以索引串)的線,你可以使用切片從字符串取前四個字符。

+0

謝謝,我曾嘗試過使用dic.setdefault,但我無法正確使用它。您的洞察力幫助了我很多,並幫助我使代碼更簡單。我非常感謝您在糾正和提高我的計劃方面的幫助。謝謝 –

3
dDomainSeqSumSWS[domain][pdb2]=[] 

有你覆蓋以前的列表。你應該檢查PDB2關鍵在dDomainSeqSumSWS [域]字典已經存在。

+0

謝謝,我完全忘了它。你的建議可以幫助我很多! –