Snakemake - Override LSF (bsub) cluster config in

2019-06-05 06:06发布

问题:

Is it possible to define default settings for memory and resources in cluster config file, and then override in rule specific manner, when needed? Is resources field in rules directly tied to cluster config file? Or is it just a fancy way for params field for readability purposes?

In the example below, how do I use default cluster configs for rule a, but use custom changes (memory=40000 and rusage=15000) in rule b?

cluster.json:

{
    "__default__":
    {
        "memory": 20000,
        "resources": "\"rusage[mem=8000] span[hosts=1]\"",
        "output": "logs/cluster/{rule}.{wildcards}.out",
        "error": "logs/cluster/{rule}.{wildcards}.err"
    },
}

Snakefile:

rule all:
    'a_out.txt', 'b_out.txt'

rule a:
    input:
        'a.txt'
    output:
        'a_out.txt'
    shell:
        'touch {output}'

rule b:
    input:
        'b.txt'
    output:
        'b_out.txt'
    shell:
        'touch {output}'

Command for execution:

 snakemake --cluster-config cluster.json 
           --cluster "bsub -M {cluster.memory} -R {cluster.resources} -o logs.txt" 
           -j 50

I understand that it is possible to define rule specific resources requirements in cluster config file, but I would prefer to define them directly in Snakefile, if possible.

Or else, if there is a better way of implementing this, please let me know.

回答1:

You can directly add resources to each of your rules :

rule all:
    'a_out.txt' , 'b_out.txt'

rule a:
    input:
        'a.txt'
    output:
        'a_out.txt'
    resources:
        mem_mb=40000
    shell:
        'touch {output}'
rule b:
    input:
        'b.txt'
    output:
        'b_out.txt'
    resources:
        mem_mb=20000
    shell:
        'touch {output}'

And then, you should remove the resources parameter from your .json, so that the command line would not override the snakefile:

new.cluster.json:

{
    "__default__":
    {
        "output": "logs/cluster/{rule}.{wildcards}.out",
        "error": "logs/cluster/{rule}.{wildcards}.err"
    },
}


回答2:

In new.cluster.json you can actually define resources for specific rules. So in your case you would do the following

{
    "__default__":
    {
        "memory": 20000,
        "resources": "\"rusage[mem=8000] span[hosts=1]\"",
        "output": "logs/cluster/{rule}.{wildcards}.out",
        "error": "logs/cluster/{rule}.{wildcards}.err"
    },
    "b":
    {
        "memory": 40000,
        "resources": "\"rusage[mem=15000] span[hosts=1]\"",
        "output": "logs/cluster/{rule}.{wildcards}.out",
        "error": "logs/cluster/{rule}.{wildcards}.err"
    },
}

Then in the Snakefile you can refer to these resources by importing new.cluster.json and referring to it in your rule

import json

with open('new.cluster.json') as fh:
    cluster_config = json.load(fh)

rule all:
    'a_out.txt' , 'b_out.txt'

rule a:
    input:
        'a.txt'
    output:
        'a_out.txt'
    shell:
        'touch {output}'
rule b:
    input:
        'b.txt'
    output:
        'b_out.txt'
    resources:
        mem_mb=cluster_config["b"]["memory"]
    shell:
        'touch {output}'

If you take a look through this repository, you can see how I use these cluster configs in the wild.