why vector is always slower than C array, at least

I am trying to find all the primes not greater than n using the Eratosthenes'Sieve algorithm, and I have the following codes, with the sieve implemented in vector and C array, I have found that almost during all the time, C array is always faster.

Using vector:

int countPrimes_vector(int n) {                  
    int res = 0; 
    vector<char>bitmap(n);
    memset(&bitmap[0], '1', bitmap.size() * sizeof( bitmap[0]));
    //vector<bool>bitmap(n, true); Using this one is even slower!!

    for (int i = 2; i<n; ++i){

        if(bitmap[i]=='1')++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = '0';
        }
    }

    return res;
}

Using C array:

int countPrimes_array(int n) {  

    int res = 0; 
    bool * bitmap = new bool[n];
    memset(bitmap, true, sizeof(bool) * n);
    for (int i = 2; i<n; ++i){

        if(bitmap[i])++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    delete []bitmap;
    return res;
}

The test code:

clock_t t;
t = clock();
int a;
for(int i=0; i<10; ++i)a = countPrimes_vector(8000000); 
t = clock() - t;
cout<<"time for vector = "<<t<<endl;

t = clock();
int b;
for(int i=0; i<10; ++i)b = countPrimes_array(8000000); 
t = clock() - t;
cout<<"time for array = "<<t<<endl;

The output:

 time for vector = 32460000
 time for array = 29840000

I have tested many times, and C array is always faster. What's the reason behind it?

I often heard that the performance for vector and C array is the same, vector should be always used for being a standard container. Is this statement true, or at least generally speaking ? In what cases C array should be preferred?

EDIT:

As the following comments suggest, after turning on optimization -O2 or -O3 (originally it was compiled with g++ test.cpp), the time difference between vector and C array is no longer valid, in some occasions vector is faster than C array.

Your comparisons contain inconsistencies which would explain the differences, and another factor could be the result of compiling without sufficient optimization. Some implementations have a lot of additional code in the debug builds of STL, for instance MSVC does bounds checking on vector element accesses that produce a significant reduction in speed in debug builds.

The following code shows a MUCH closer performance between the two, and the difference is probably just a lack of samples (ideone has a timeout limit of 5s).

#include <vector>
#include <cmath>
#include <cstring>

int countPrimes_vector(int n) {  
    int res = 0; 
    std::vector<bool> bitmap(n, true);
    for (int i = 2; i<n; ++i){
        if(bitmap[i])
          ++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    return res;
}

int countPrimes_carray(int n) {  
    int res = 0; 
    bool* bitmap = new bool[n];
    memset(bitmap, true, sizeof(bool) * n);
    for (int i = 2; i<n; ++i){

        if(bitmap[i])++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    delete []bitmap;
    return res;
}

#include <chrono>
#include <iostream>

using namespace std;

void test(const char* description, int (*fn)(int))
{
    using clock = std::chrono::steady_clock;
    using ms = std::chrono::milliseconds;

    auto start = clock::now();

    int a;
    for(int i=0; i<9; ++i)
        a = countPrimes_vector(8000000); 

    auto end = clock::now();
    auto diff = std::chrono::duration_cast<ms>(end - start);

    std::cout << "time for " << description << " = " << diff.count() << "ms\n";
}

int main()
{
    test("carray", countPrimes_carray);
    test("vector", countPrimes_vector);
}

Live demo: http://ideone.com/0Y9gQx

time for carray = 2251ms
time for vector = 2254ms

Although on some runs the carray was 1-2 ms slower. Again, that's insufficient samples on a shared resource.

--- EDIT ---

In your main comments you ask "why optimization can make a difference".

std::vector<bool> v = { 1, 2, 3 };
bool b[] = { 1, 2, 3 };

We have two "array"s of 3 elements, so consider the following:

v[10]; // illegal!
b[10]; // illegal!

Debug versions of STL can often catch this during run time (and with some scenarios, compile time). The array access may just result in bad data or a crash.

Additionally, the STL is implemented using many small member-function calls to things like size(), and because vector is a class, [] is actually facaded through a function call (operator[]).

The compiler can eliminate many of these, but that's optimization. If you don't optimize, then something like

std::vector<int> v;
v[10];

does something roughly like:

int* data() { return M_.data_; }

v.operator[](size_t idx = 10) {
    if (idx >= this->size()) {
        raise exception("invalid [] access");
    }
    return *(data() + idx);
}

and even though data is an "inlinable" function, to make debugging easier, the unoptimized code leaves it as this. When you build with optimization, the compiler recognizes that the implementation of these functions are so trivial it can just substitute their implementations into the call sites, and it quickly winds up simplifying all of the above to a more array-access like operation.

For example, in the above case, it may first reduce operator[] to

v.operator[](size_t idx = 10) {
    if (idx >= this->size()) {
        raise exception("invalid [] access");
    }
    return *(M_.data_ + idx);
}

And since compiling without debugging probably removes the bounds check, it becomes

v.operator[](size_t idx = 10) {
    return *(M_.data_ + idx);
}

so now the inliner can reduce

x = v[1];

x = *(v.M_.data_ + 1); // comparable to v.M_.data_[1];

There is a tiny penalty. The c-array involves the data block in memory and a single local variable that fits into a register that points to the block, your references are directly relative to that:

With a vector, though, you have the vector object which is a pointer to the data, a size and a capacity variable:

vector<T>  // pseudo code
{
    T* ptr;
    size_t size;
    size_t capacity;
}

If you were counting machine instructions, the vector will have 3 variables to initialize, and manage.

When you write

x = v[1];

given the above approximation of vector, you are saying something along the lines of:

T* ptr = v.data();
x = ptr[1];

but the compiler is usually smart enough when building with optimization to recognize that it can do the first line before the loop, but this tends to cost a register.

T* ptr = v.data(); // in debug, function call, otherwise inlined.
for ... {
    x = ptr[1];
}

So you're probably looking at a handful more machine instructions per iteration of your test function, or on a modern processor, maybe a nanosecond or two of extra wall time.