GCC, -flto, -fno-builtin and custom function imple

I'm observing unexpected behaviour (at least I cant find explanation for it) with GCC flag -flto and jemalloc/tcmalloc. Once -flto is used and I link with above libraries malloc/calloc and friends are not replaced by je/tc malloc implementation, the glibc implementation is called. Once I remove -flto flag, everything works as expected. I tried to use -fno-builtin/-fno-builtin-* with -flto but still, it doesnt pick the je/tc malloc implementation.

How the -flto machinery works? Why the binary doesnt pick new implementation? How it even links with -fno-builtin when it should fail on unresolved external for, say, printf?

EDIT001:
GCC 7.3
Sample code

int main()
{
    auto p = malloc(1024);
    free(p);
    return 0;
}

Compilation:

/usr/bin/c++ -O2 -g -DNDEBUG -flto -std=gnu++14 -o CMakeFiles/flto.dir/main.cpp.o -c /home/user/Development/CPPJunk/flto/main.cpp

Linkage:

/usr/bin/c++ -O2 -g -DNDEBUG -flto CMakeFiles/flto.dir/main.cpp.o -o flto -L/home/user/Development/jemalloc -Wl,-rpath,/home/user/Development/jemalloc -ljemalloc

EDIT002:
More suitable sample code

#include <cstdlib>

int main()
{
    auto p = malloc(1024);
    if (p) {
        free(p);
    }

    auto p1 = new int;
    if (p1) {
        delete p1;
    }

    auto p2 = new int[32];
    if (p2) {
        delete[] p2;
    }
    return 0;
}

First, your sample code is wrong. Read carefully the C11 standard n1570. When you want to use the standard malloc, you should #include <stdlib.h>.

In C++11 (read n3337) malloc is frowned upon and should not be used (prefer new). If you still want to use std::malloc in C++ you should #include <cstdlib> (which, in GCC, is internally including <stdlib.h>)

Then your sample code is almost C code (once you replace auto with void*), not C++. It could be optimized (once you include <stdlib.h>), even without -flto but with just -O3, according to the as-if rule, to an empty main. (I've even wrote a public report, bismon-chariot-doc.pdf, which has a section §1.4.2 explaining in several pages how that optimization happens).

To optimize around malloc and free, GCC uses some __attribute__(malloc) function attribute in the declaration (inside <stdlib.h>) of malloc.

How the -flto machinery works?

LTO is explained in GCC internals §25.

It works by using some internal (GIMPLE-like and/or SSA-like) representation of the code both at "compile" and at "link" time (actually, the linking step becomes another compilation with whole-program optimization, so your code gets "compiled" twice in practice).

LTO always should (in practice) be used with some optimization flag (e.g. -O2 or even -O3) both at compile and at link time. So you should compile and link with g++ -flto -O2 (it has no practical sense to use -flto without at least -O2 and the exact same optimization flags should be used at compile and at link time).

More precisely -flto also embeds in the object files some internal (GIMPLE-like) representation of the source code, and that is also used "at link time" (notably for optimization and inlining happening again when "linking" your entire program, re-using its GIMPLE). Actually GCC contains some LTO front-end and compiler called lto1 (in addition of the C++ front-end and compiler called cc1plus) and lto1 is (when you link with g++ -flto -O2) used at link time to reprocess these GIMPLE representations.

Probably, libjemalloc has its own headers, and might have inline (or inlinable) functions. Then you also need to use -flto -O2 when compiling that library from its source code (so that its Gimple is stored in the library)

At last, the fact that the usual malloc gets called is independent of -flto. It is a linker issue, not a compiler one. You could try to link -ljemalloc statically (and then you'll better build that library also with gcc -flto -O2; if you don't build it like that you won't get LTO optimizations across malloc calls).

You could pass also -v to your compilation and linking commands to understand what g++ is doing. You could even pass -Wl,--verbose to ask the ld (started by g++) to be verbose.

Notice that LTO (and the internal representations that it is using) is compiler and version specific. The internal (Gimple & SSA) representation is slightly different between GCC 7 & GCC 8 (and in Clang it is very different, so of course incompatible). The dynamic linker ld-linux(8) does not know about LTO.

PS. You could install the libjemalloc-dev package and add #include <jemalloc/jemalloc.h> in your code. See also jemalloc(3) man page. Probably libjemalloc could be configured or patched to define some je_malloc symbol as a replacement for malloc. Then it would be simpler (for LTO) to use je_malloc in your code (to avoid conflict between several malloc ELF symbols). To learn more about symbols in shared libraries, read Drepper's How to Write Shared Libraries paper. And of course you should expect LTO to change the behavior of linking!