2017-06-15 115 views
2

我開始使用Snakemake,我有一個非常基本的問題,我無法在snakemake教程中找到答案。單個規則中的多個輸入和輸出Snakemake文件

我想創建一個單一的規則snakefile在Linux下逐一下載多個文件。 'expand'不能在輸出中使用,因爲這些文件需要逐個下載,並且通配符無法使用,因爲它是目標規則。

我想到的唯一方法就是這樣的東西,它不能正常工作。我無法弄清楚如何發送下載的內容與特定的名稱,如「downloaded_files.dwn」使用{}輸出特定的目錄在後續步驟中使用:

links=[link1,link2,link3,....] 
rule download:  
output: 
    "outdir/{downloaded_file}.dwn" 
params: 
    shellCallFile='callscript', 
run: 
    callString='' 
    for item in links: 
     callString+='wget str(item) -O '+{output}+'\n' 
    call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True) 
    call(callString, shell=True) 

我明白任何提示就如何實現這一被解決,哪一部分蛇頭我不明白。

+1

如果您不使用'-j'選項運行snakemake,則只有一個規則實例將在給定時間運行。是否需要按照精確的順序下載文件? – bli

+0

另外,通常使用只有輸入的第一個「all」規則,爲此可以使用擴展。這將推動工作流程的其餘部分。 – bli

+0

有沒有可用於確定下載文件名稱的鏈接名稱中的模式?請記住,Snakemake的目的是在文件名中使用規律性。 – bli

回答

3

這裏是一個註釋過的例子,可以幫助您解決問題:

# Create some way of associating output files with links 
# The output file names will be built from the keys: "chain_{key}.gz" 
# One could probably directly use output file names as keys 
links = { 
    "1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz", 
    "2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz", 
    "3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"} 


rule download: 
    output: 
     # We inform snakemake that this rule will generate 
     # the following list of files: 
     # ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"] 
     # Note that we don't need to use {output} in the "run" or "shell" part. 
     # This list will be used if we later add rules 
     # that use the files generated by the present rule. 
     expand("outdir/chain_{n}.gz", n=links.keys()) 
    run: 
     # The sort is there to ensure the files are in the 1, 2, 3 order. 
     # We could use an OrderedDict if we wanted an arbitrary order. 
     for link_num in sorted(links.keys()): 
      shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num)) 

這裏是這樣做的另一種方法,使用任意名稱爲下載的文件,並使用output(雖然有點人爲地):

links = [ 
    ("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"), 
    ("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"), 
    ("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")] 


rule download: 
    output: 
     # We inform snakemake that this rule will generate 
     # the following list of files: 
     # ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"] 
     ["outdir/{f}".format(f=filename) for (filename, _) in links] 
    run: 
     for i in range(len(links)): 
      # output is a list, so we can access its items by index 
      shell("wget {link} -O {chain_file}".format(
       link=links[i][1], chain_file=output[i])) 
     # using a direct loop over the pairs (filename, link) 
     # could be considered "cleaner" 
     # for (filename, link) in links: 
     #  shell("wget {link} -0 outdir/{filename}".format(
     #   link=link, filename=filename)) 

使用snakemake -j 3一個例子,其中三個下載可以並行地進行:

# To use os.path.join, 
# which is more robust than manually writing the separator. 
import os 

# Association between output files and source links 
links = { 
    "foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz", 
    "bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz", 
    "baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"} 


# Make this association accessible via a function of wildcards 
def chainfile2link(wildcards): 
    return links[wildcards.chainfile] 


# First rule will drive the rest of the workflow 
rule all: 
    input: 
     # expand generates the list of the final files we want 
     expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys()) 


rule download: 
    output: 
     # We inform snakemake what this rule will generate 
     os.path.join("outdir", "{chainfile}") 
    params: 
     # using a function of wildcards in params 
     link = chainfile2link, 
    shell: 
     """ 
     wget {params.link} -O {output} 
     """ 
+0

感謝bli爲您提供了出色的解決方案。還有一個問題。這個規則是否也可以修改爲並行下載鏈接? – user3015703

+1

爲了平行運行,你可以在'all'規則的'input'中移動'expand',從'run'部分移除'for'循環,並使用'-j'。 'all'規則將導致'download'規則針對每個想要的文件運行一次。我會再增加一個例子,但你可能會嘗試。 – bli

+1

@ user3015703我爲並行下載添加了一個示例。 – bli

相關問題