why is strchr twice as fast as my simd code

I am learning SIMD and was curious to see whether it was possible to beat strchr at finding a character. It appears that strchr uses the same intrinsics but I assume that it checks for a null, whereas I know the character is in the array and plan on avoiding a null check.

My code is:

size_t N = 1e9;
bool found = false; //Not really used ...
size_t char_index1 = 0;
size_t char_index2 = 0;
char * str = malloc(N);
memset(str,'a',N);

__m256i char_match;
__m256i str_simd;
__m256i result;
__m256i* pSrc1;

int simd_mask;

str[(size_t)5e8] = 'b';


    char_match = _mm256_set1_epi8('b');
    result = _mm256_set1_epi32(0);

    simd_mask = 0;

    pSrc1 = (__m256i *)str;

    while (1){
        str_simd  = _mm256_lddqu_si256(pSrc1);
        result = _mm256_cmpeq_epi8(str_simd, char_match);
        simd_mask = _mm256_movemask_epi8(result);   
        if (simd_mask != 0){
            break;
        }
        pSrc1++;
    }

Full (not yet finished code) at: https://gist.github.com/JimHokanson/433e185ba53b41e49ce3ac804568ac1e

strchr is twice as fast as this code (using gcc and xcode). I'm hoping to understand why.

Update: compiling using: gcc -std=c11 -mavx2 -mlzcnt

I had not set an optimization flag in the compiler. Setting -O3 resulted in the SIMD code only taking 75% of the time of strchr.

Update: I should also clarify this is not a final working version of the code. There are still additional checks that need to be put in place and possible ways of optimizing the calls (I think). At least at this point though the code is in the ballpark of strchr. As pointed out in the question comments this version could read past a page and fault. Finally, this is mostly a SIMD learning opportunity (for myself), and memchr is probably your best bet (although I suspect you might be able to just slightly beat memchr if you have a sentinel buffer).