2013-11-21 18 views
-1

我有一個文件(FILEA)與以下列格式的基因組區域的n個:獲取隨機區域從單個站點

Chromosome Start End Length Number 
chr1  100 400 300 6... 

我有另一個(大得多)FILEB(控制數據)包含在單個位點格式如下:

Chromosome Site  
chr1 105 
chr1 110... 

從中我想根據第一個數據集選擇隨機條目。 因此,對於第一個文件中的每個區域,我將從第二個數據集中獲得具有相同長度和數量但隨機位置的隨機區域。

例如:

Chromosome Start End Length Number 
chr5  350 650 300 6... 

到目前爲止,我所擁有的是:

List = [] 
NewList = [] 
LineCount = 0 

for Line in FileB: 
    if LineCount == 0: 
     OutFile.write(Line) 
    else: 
     List.append(Line) 
    LineCount +=1 


for Line in FileA: 
    Chr, Start, End, Len, Entries = Line.strip("\n").split("\t")[:5] 
    RandomStart = random.sample(List, 1) 
    ## here I need to find a way to keep adding sequential lines to a NewList till the last site minus the first site is near the Len 
    ## then I need to convert this new list into the format Chr, Start, End, Lenght, Number and write out and then clear NewList 
+1

你可以發佈你試過這樣的代碼遠? – mdml

+0

如果第二個數據集僅包含網站(沒有長度和數量),如何在第二個數據集中找到長度和編號相同的區域?或者你想從第一組中選取一個區域,然後從第二組中選取一個隨機區域,並將其轉換爲從第二個區域開始的區域,長度和第一個區域的數量? – Hyperboreus

+0

什麼是「位置」,數據集中的信息在哪裏? – duhaime

回答

0

我其實解決了這個問題,並正在張貼我的代碼的主要部分:

import random 

def get_regions(i, Chr, Start, End, Len): 
    n = EndN = 0 
    while 0 < (End - Start) <= int(Len)+15: 
     End = int(Dict[i+1].split("\t")[2]) 
     EndN = int(Dict[i].split("\t")[2]) 
     i +=1 
     n +=1 
    if int(Len)-15 <= (EndN - Start) <= int(Len)+15: 
     OutFile.write(Chr + '\t'+str(Start)+ '\t'+str(EndN) +'\t'+ str(n) +'\t'+str(int(EndN)-int(Start))+ '\n') 
     NewList =[] 
    else: 
     Chr, Start, End, i = get_random(Keys) 

def get_random(Keys): 
    i = random.sample(Keys, 1)[0] 
    Chr = Dict[i].split('\t')[0] 
    Start = int(Dict[i].split('\t')[1]) 
    End = int(Dict[i+10].split('\t')[2]) 
    get_regions(i, Chr, Start, End, Len) 
    return Chr, Start, End, i 


InFile = open(FileB, 'r') 
OutFile = open(OutFile, 'w') 
Dict = {} 
LineCount = 0 

for Line in InFile: 
    if LineCount > 0: 
     Dict[LineCount-1] = Line 
    LineCount +=1 


LineCount = 0 
DiffFile = open(FileA, "r") 
for Line in DiffFile: 
    if LineCount ==0: 
     Header = Line 
     OutFile.write(Header) 
    else: 
     Entries, Len = Line.strip("\n").split("\t")[3:5] 
     Keys = Dict.keys() 
     Chr, Start, End, i = get_random(Keys) 
    LineCount +=1 
1

如果你想找到B中的所有區域具有相同長度和數量在一個區域(並假設A和B是tsv文件),您可能可以這樣做:

fileA = open(pathToFileA).read() 
fileB = open(pathToFileB).read() 

out = open("foundMatches.tsv", "w") 

splitA = FileA.split("\n") 
splitB = FileB.split("\n") 

For genomicRegion in SplitA: 
    splitRegionsA = genomicRegions.split("\t") 
    chromosomeA = splitRegionsA[0] 
    startA = splitRegionsA[1] 
    endA = splitRegionsA[2] 
    lengthA = splitRegionsA[3] 
    numberA = splitRegionsA[4] 

    for genomicRegionB in SplitB: 
     splitRegionsB = genomicRegionsB.split("\t") 
     chromosomeB = splitRegionsB[0] 
     startB = splitRegionsB[1] 
     endB = splitRegionsB[2] 
     lengthB = splitRegionsB[3] 
     numberB = splitRegionsB[4] 

     if lengthA == lengthB: 
      if numberA == numberB: 
       out.write(str(chromosomeA) + "\t" + str(startA) + "\t" + str(endA) + "\t" + str(lengthA) + "\t" + str(numberA) + "\t" + str(chromosomeB) + "\t" + str(startB) + "\t" + str(endB) + "\t" + str(lengthB) + "\t" + str(numberB) + "\n") 

然後您可以選擇隨機樣本輸出文件。 (如果你的數據集是大你想要的東西更優雅。)