I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis (i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work.
Here's a simplification of the alignment rule:
rule alignment:
input:
rules.download.output.fastq
output:
'{group}/alignment/{sample}.bam'
shell:
"bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"
And the same for the variant calling:
rule variant_calling:
input:
rules.alignment.output
output:
'{group}/variants/{sample}.vcf.gz'
shell:
"bash scripts/03_variant_calling.sh {wildcards.sample} {wildcards.group}"
This works just fine, as there is a single .vcf
file generated for each aligned .bam
file. But what I would like to do is to generate a single .vcf
file from an arbitrary number of .bam
files. I have a pandas
dataframe containing all the sample
names and their corresponding group
. I would essentially like to change the output
of the second rule to be '{group}/variants/{group}.vcf'
, but everything I've done has failed in some way.
The idea I had was to provide the rule with all the per-group aligned .bam
files as input and then just give the script it runs the directory where they all lie. The problem is that I can't find a way to give an input in this per-group manner: either I give per-sample (as the working pipeline) or I give ALL the .bam
files for EACH group variant calling, regardless of what group they actually belong to. I can't use just wildcards, as the {sample}
wildcard is not present in the last output. I also tried using functions as input, but that lead to the same problems as above.
The crux of the matter seem to be the layers of grouping: if I wanted to perform variant calling on all aligned .bam
files in the dataset as a whole, that would probably work fine, give my above mentioned problems. The problem comes with sub-groups for the dataset as a whole:
sample1 sample2 sample1 sample2 sample3
| | | | |
| | | | |
-------------- ---------------------------
| |
| |
group1 group2
Any ideas on how to solve this?