Serial program runs slower with multiple instances

I have a fortran code that I am using to calculate some quantities related to the work that I do. The code itself involves several nested loops, and requires very little disk I/O. Whenever the code is modified, I run it against a suite of several input files (just to make sure it's working properly).

To make a long story short, the most recent update has increased the run time of the program by about a factor of four, and running each input file serially with one CPU takes about 45 minutes (a long time to wait, just to see whether anything was broken). Consequently, I'd like to run each of the input files in parallel across the 4 cpus on the system. I've been attempting to implement the parallelism via a bash script.

The interesting thing I have noted is that, when only one instance of the program is running on the machine, it takes about three and a half minutes to crank through one of the input files. When four instances of the program are running, it takes more like eleven and a half minute to crank through one input file (bringing my total run time down from about 45 minutes to 36 minutes - an improvement, yes, but not quite what I had hoped for).

I've tried implementing the parallelism using gnu parallel, xargs, wait, and even just starting four instances of the program in the background from the command line. Regardless of how the instances are started, I see the same slow down. Consequently, I'm pretty sure this isn't an artifact of the shell scripting, but something going on with the program itself.

I have tried rebuilding the program with debugging symbols turned off, and also using static linking. Neither of these had any noticeable impact. I'm currently building the program with the following options:

$ gfortran -Wall -g -O3 -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow,denormal -fbounds-check -finit-real=nan -finit-integer=nan -o [program name] {sources}

Any help or guidance would be much appreciated!

On modern CPUs you cannot expect a linear speedup. There are several reasons:

Hyperthreading GNU/Linux will see hyperthreading as a core eventhough it is not a real core. It is more like 30% of a core.
Shared caches If your cores share the same cache and a single instance of your program uses the full shared cache, then you will get more cache misses if you run more instances.
Memory bandwidth A similar case as the shared cache is the shared memory bandwidth. If a single thread uses the full memory bandwidth, then running more jobs in parallel may congest the bandwidth. This can partly be solved by running on a NUMA where each CPU has some RAM that is "closer" than other RAM.
Turbo mode Many CPUs can run a single thread at a higher clock rate than multiple threads. This is due to heat.

All of these will exhibit the same symptom: Running a single thread will be faster than each of the multiple threads, but the total throughput of the multiple threads will be bigger than the single thread.

Though I must admit your case sounds extreme: With 4 cores I would have expected a speedup of at least 2.

How to identify the reason

Hyperthreading Use taskset to select which cores to run on. If you use 2 of the 4 cores is there any difference if you use #1+2 or #1+3?
Turbo mode Use cpufreq-set to force a low frequency. Is the speed now the same if you run 1 or 2 jobs in parallel?
Shared cache Not sure how to do this, but if it is somehow possible to disable the cache, then comparing 1 job to 2 jobs run at the same low frequency should give an indication.