Snakemake: rule for using many inputs for one outp

2019-07-05 19:00发布

问题:

I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis (i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work.

Here's a simplification of the alignment rule:

rule alignment:
    input:
        rules.download.output.fastq
    output:
        '{group}/alignment/{sample}.bam'
    shell:
        "bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"

And the same for the variant calling:

rule variant_calling:
    input:
        rules.alignment.output
    output:
        '{group}/variants/{sample}.vcf.gz'
    shell:
        "bash scripts/03_variant_calling.sh {wildcards.sample} {wildcards.group}"

This works just fine, as there is a single .vcf file generated for each aligned .bam file. But what I would like to do is to generate a single .vcf file from an arbitrary number of .bam files. I have a pandas dataframe containing all the sample names and their corresponding group. I would essentially like to change the output of the second rule to be '{group}/variants/{group}.vcf', but everything I've done has failed in some way.

The idea I had was to provide the rule with all the per-group aligned .bam files as input and then just give the script it runs the directory where they all lie. The problem is that I can't find a way to give an input in this per-group manner: either I give per-sample (as the working pipeline) or I give ALL the .bam files for EACH group variant calling, regardless of what group they actually belong to. I can't use just wildcards, as the {sample} wildcard is not present in the last output. I also tried using functions as input, but that lead to the same problems as above.

The crux of the matter seem to be the layers of grouping: if I wanted to perform variant calling on all aligned .bam files in the dataset as a whole, that would probably work fine, give my above mentioned problems. The problem comes with sub-groups for the dataset as a whole:

  sample1      sample2             sample1      sample2      sample3
     |            |                   |            |            |
     |            |                   |            |            |
     --------------                   ---------------------------
            |                                      |
            |                                      |
          group1                                 group2

Any ideas on how to solve this?

回答1:

You have to use some kind of structure to keep your samples in groups:

GROUPS = {
    "group1":["sample1","sample2"],
    "group2":["sample1","sample2","sample3"]
}

then something like this:

rule all:
    input:
         expand("{group}/variants/{group}.vcf.gz", group=list(GROUPS.keys()))

rule alignment:
    input:
        rules.download.output.fastq
    output:
        '{group}/alignment/{sample}.bam'
    shell:
        "bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"

rule variant_calling:
    input:
        lambda wildcards: expand("{group}/alignment/{sample}.bam", group=wildcards.group, sample=GROUPS[wildcards.group]
    output:
        '{group}/variants/{group}.vcf.gz'
    shell:
        "bash scripts/03_variant_calling.sh {input} {output}"

of course there are missing rules that you didn't show but I think you'll get the point !

The shell command from rule variant_calling might be hard to handle but you can alway define a directory as params like:

params: groupAlignDir = "{group}/alignment"

and use that in the shell:

"bash scripts/03_variant_calling.sh {params.groupAlignDir} {output}"

and get all the bam file in the directory in your script "variant_calling.sh"