MPI_Allreduce mix elements in the sum

I am parallelising a fortran code which works with no problem in a no-MPI version. Below is an excerpt of the code.

Every processor does the following:

For a certain number of particles it evolves certain quantities in the loop "do 203"; in a given interval which is divided in Nint subintervals (j=1,Nint), every processor produces an element of the vectors Nx1(j), Nx2(j).
Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).

Another issue is that if I deallocate the variables the code remains in standby after the end of the calculation without completing the execution; but I don't know if this is related to the MPI_Allreduce issue.

    include "mpif.h"
    ...
    integer*4 ....
    ...
    real*8 
    ...
    call MPI_INIT(mpierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, npe, mpierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, mype, mpierr)

!       Allocate variables
    allocate(Nx1(Nint),Nx5(Nint))
    ...

!       Parameters
    ...

    call MPI_Barrier (MPI_COMM_WORLD, mpierr)

!   Loop on particles

    do 100 npartj=1,npart_local

     call init_random_seed() 
     call random_number (rand)

    ...
    Initial condition
    ... 
    do 203 i=1,1000000  ! loop for time evolution of single particle

        if(ufinp.gt.p1.and.ufinp.le.p2)then 
         do j=1,Nint  ! spatial position at any momentum
          ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
          rs(j) = ls(j)+Delta/Nint
          if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
            Nx1(j)=Nx1(j)+1 
          endif 
         enddo
        endif
       if(ufinp.gt.p2.and.ufinp.le.p3)then 
        do j=1,Nint  ! spatial position at any momentum
          ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
          rs(j) = ls(j)+Delta/Nint
          if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
            Nx2(j)=Nx2(j)+1 
          endif 
        enddo
       endif
203  continue 
100    continue     
    call MPI_Barrier (MPI_COMM_WORLD, mpierr)

    print*,"To be summed"
    do j=1,Nint
       call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum,
     &      MPI_COMM_WORLD, mpierr)
           call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum,
     &          MPI_COMM_WORLD, mpierr)
     enddo 

    if(mype.eq.0)then
     do j=1,Nint
       write(1,107)ls(j),Nx5(j),Nx6(j)
     enddo 
107  format(3(F13.2,2X,i6,2X,i6))   
    endif 
    call MPI_Barrier (MPI_COMM_WORLD, mpierr)
    print*,"Now deallocate"
!   deallocate(Nx1)  !inserting the de-allocate
!   deallocate(Nx2)

    close(1)

    call MPI_Finalize(mpierr)

    end


!  Subroutines
    ...

Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).

This is not what an allreduce does. Reduction means the summation is done in parallel across all processes. allreduce means all processes will get the result of the summing.

Your MPI_Allreduces:

   call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)
   call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)

Actually look like the count should be 1 here. This is because count just states how many elements you are to receive from each process, not how many there will be in total.

However, you actually do not need that loop, because the allreduce luckily is capable of handling multiple elements all at once. Thus, I believe instead of the loop with your allreduces, you actually want something like:

   integer :: Nx1(nint)
   integer :: Nx2(nint)
   integer :: Nx5(nint)
   integer :: Nx6(nint)

   call MPI_ALLREDUCE (Nx1, Nx5, nint, mpi_integer, mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)
   call MPI_ALLREDUCE (Nx2, Nx6, nint, mpi_integer, mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)

Nx5 will contain the sum of Nx1 across all partitions, and Nx6 the sum across Nx2. The information in your question is a little bit lacking, so I am not quite sure, if this is what you are looking for.