Haskell Thrift library 300x slower than C++ in per

2019-03-07 23:24发布

I'm building an application which contains two components - server written in Haskell, and client written in Qt (C++). I'm using thrift to communicate them, and I wonder why is it working so slow.

I made a performance test and here is the result on my machine

Results

C++ server and C++ client:

Sending 100 pings                    -    13.37 ms
Transfering 1000000 size vector      -   433.58 ms
Recieved: 3906.25 kB
Transfering 100000 items from server -  1090.19 ms
Transfering 100000 items to server   -   631.98 ms

Haskell server and C++ client:

Sending 100 pings                       3959.97 ms
Transfering 1000000 size vector      - 12481.40 ms
Recieved: 3906.25 kB
Transfering 100000 items from server - 26066.80 ms
Transfering 100000 items to server   -  1805.44 ms

Why is Haskell so slow in this test? How can I improve it performance?

Here are the files:

Files

performance.thrift

namespace hs test
namespace cpp test

struct Item {
    1: optional string    name
    2: optional list<i32> coordinates
}

struct ItemPack {
    1: optional list<Item>     items
    2: optional map<i32, Item> mappers
}


service ItemStore {
    void ping()
    ItemPack getItems(1:string name, 2: i32 count) 
    bool     setItems(1: ItemPack items)

    list<i32> getVector(1: i32 count)
}

Main.hs

{-# LANGUAGE ScopedTypeVariables #-}   
module Main where

import           Data.Int  
import           Data.Maybe (fromJust) 
import qualified Data.Vector as Vector
import qualified Data.HashMap.Strict  as HashMap
import           Network

-- Thrift libraries
import           Thrift.Server

-- Generated Thrift modules
import Performance_Types
import ItemStore_Iface
import ItemStore


i32toi :: Int32 -> Int
i32toi = fromIntegral

itoi32 :: Int -> Int32
itoi32 = fromIntegral

port :: PortNumber
port = 9090

data ItemHandler = ItemHandler

instance ItemStore_Iface ItemHandler where
    ping _                   = return () --putStrLn "ping"
    getItems _ mtname mtsize = do 
        let size = i32toi $ fromJust mtsize
            item i = Item mtname (Just $ Vector.fromList $ map itoi32 [i..100])
            items = map item [0..(size-1)]
            itemsv = Vector.fromList items 
            mappers = zip (map itoi32 [0..(size-1)]) items 
            mappersh = HashMap.fromList mappers
            itemPack = ItemPack (Just itemsv) (Just mappersh)
        putStrLn "getItems"
        return itemPack

    setItems _ _             = do putStrLn "setItems"
                                  return True

    getVector _ mtsize       = do putStrLn "getVector"
                                  let size = i32toi $ fromJust mtsize
                                  return $ Vector.generate size itoi32

main :: IO ()
main = do
    _ <- runBasicServer ItemHandler process port 
    putStrLn "Server stopped"

ItemStore_client.cpp

#include <iostream>
#include <chrono>
#include "gen-cpp/ItemStore.h"

#include <transport/TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>

using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;

using namespace test;
using namespace std;

#define TIME_INIT  std::chrono::_V2::steady_clock::time_point start, stop; \
                   std::chrono::duration<long long int, std::ratio<1ll, 1000000000ll> > duration;
#define TIME_START start = std::chrono::steady_clock::now(); 
#define TIME_END   duration = std::chrono::steady_clock::now() - start; \
                   std::cout << chrono::duration <double, std::milli> (duration).count() << " ms" << std::endl;

int main(int argc, char **argv) {

    boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
    boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
    boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));

    ItemStoreClient server(protocol);
    transport->open();

    TIME_INIT

    long pings = 100;
    cout << "Sending " << pings << " pings" << endl;
    TIME_START
    for(auto i = 0 ; i< pings ; ++i)
        server.ping();
    TIME_END


    long vectorSize = 1000000;

    cout << "Transfering " << vectorSize << " size vector" << endl;
    std::vector<int> v;
    TIME_START
    server.getVector(v, vectorSize);
    TIME_END
    cout << "Recieved: " << v.size()*sizeof(int) / 1024.0 << " kB" << endl;


    long itemsSize = 100000;

    cout << "Transfering " << itemsSize << " items from server" << endl;
    ItemPack items;
    TIME_START
    server.getItems(items, "test", itemsSize);
    TIME_END


    cout << "Transfering " << itemsSize << " items to server" << endl;
    TIME_START
    server.setItems(items);
    TIME_END

    transport->close();

    return 0;
}

ItemStore_server.cpp

#include "gen-cpp/ItemStore.h"
#include <thrift/protocol/TBinaryProtocol.h>
#include <thrift/server/TSimpleServer.h>
#include <thrift/transport/TServerSocket.h>
#include <thrift/transport/TBufferTransports.h>

#include <map>
#include <vector>

using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;
using namespace ::apache::thrift::server;


using namespace test;
using boost::shared_ptr;

class ItemStoreHandler : virtual public ItemStoreIf {
  public:
    ItemStoreHandler() {
    }

    void ping() {
        // printf("ping\n");
    }

    void getItems(ItemPack& _return, const std::string& name, const int32_t count) {

        std::vector <Item> items;
        std::map<int, Item> mappers;

        for(auto i = 0 ; i < count ; ++i){
            std::vector<int> coordinates;
            for(auto c = i ; c< 100 ; ++c)
                coordinates.push_back(c);

            Item item;
            item.__set_name(name);
            item.__set_coordinates(coordinates);

            items.push_back(item);
            mappers[i] = item;
        }

        _return.__set_items(items);
        _return.__set_mappers(mappers);
        printf("getItems\n");
    }

    bool setItems(const ItemPack& items) {
        printf("setItems\n");
        return true;
    }

    void getVector(std::vector<int32_t> & _return, const int32_t count) {
        for(auto i = 0 ; i < count ; ++i)
            _return.push_back(i);
        printf("getVector\n");
    }
};

int main(int argc, char **argv) {
    int port = 9090;
    shared_ptr<ItemStoreHandler> handler(new ItemStoreHandler());
    shared_ptr<TProcessor> processor(new ItemStoreProcessor(handler));
    shared_ptr<TServerTransport> serverTransport(new TServerSocket(port));
    shared_ptr<TTransportFactory> transportFactory(new TBufferedTransportFactory());
    shared_ptr<TProtocolFactory> protocolFactory(new TBinaryProtocolFactory());

    TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory);
    server.serve();
    return 0;
}

Makefile

GEN_SRC := gen-cpp/ItemStore.cpp gen-cpp/performance_constants.cpp gen-cpp/performance_types.cpp
GEN_OBJ := $(patsubst %.cpp,%.o, $(GEN_SRC))

THRIFT_DIR := /usr/local/include/thrift
BOOST_DIR := /usr/local/include

INC := -I$(THRIFT_DIR) -I$(BOOST_DIR)

.PHONY: all clean

all:   ItemStore_server ItemStore_client

%.o: %.cpp
    $(CXX) --std=c++11 -Wall -DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H $(INC) -c $< -o $@

ItemStore_server: ItemStore_server.o $(GEN_OBJ) 
    $(CXX) $^ -o $@ -L/usr/local/lib -lthrift -DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H

ItemStore_client: ItemStore_client.o $(GEN_OBJ)
    $(CXX) $^ -o $@ -L/usr/local/lib -lthrift -DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H

clean:
    $(RM) *.o ItemStore_server ItemStore_client

Compile and run

I generate files (using thrift 0.9 available here) with:

$ thrift --gen cpp performance.thrift
$ thrift --gen hs performance.thrift

Compile with

$ make
$ ghc Main.hs gen-hs/ItemStore_Client.hs gen-hs/ItemStore.hs gen-hs/ItemStore_Iface.hs gen-hs/Performance_Consts.hs gen-hs/Performance_Types.hs -Wall -O2

Run Haskell test:

$ ./Main& 
$ ./ItemStore_client

Run C++ test:

$ ./ItemStore_server&
$ ./ItemStore_client

Remember to kill server after each test

Update

Edited getVector method to use Vector.generate instead of Vector.fromList, but still no effect

Update 2

Due to suggestion of @MdxBhmt I tested the getItems function as follows:

getItems _ mtname mtsize = do let size = i32toi $! fromJust mtsize
                                  item i = Item mtname (Just $!  Vector.enumFromN (i::Int32) (100- (fromIntegral i)))
                                  itemsv = Vector.map item  $ Vector.enumFromN 0  (size-1)
                                  itemPack = ItemPack (Just itemsv) Nothing 
                              putStrLn "getItems"
                              return itemPack

which is strict and has improved Vector generation vs its alternative based on my original implementation:

getItems _ mtname mtsize = do let size = i32toi $ fromJust mtsize
                                  item i = Item mtname (Just $ Vector.fromList $ map itoi32 [i..100])
                                  items = map item [0..(size-1)]
                                  itemsv = Vector.fromList items 
                                  itemPack = ItemPack (Just itemsv) Nothing
                              putStrLn "getItems"
                              return itemPack

Notice that there is no HashMap sent. The first version gives time 12338.2 ms and the second is 11698.7 ms, no speedup :(

Update 3

I reported an issue to Thrift Jira

Update 4 by abhinav

This is completely unscientific but using GHC 7.8.3 with Thrift 0.9.2 and @MdxBhmt's version of getItems, the discrepancy is significantly reduced.

C++ server and C++ client:

Sending 100 pings:                     8.56 ms
Transferring 1000000 size vector:      137.97 ms
Recieved:                              3906.25 kB
Transferring 100000 items from server: 467.78 ms
Transferring 100000 items to server:   207.59 ms

Haskell server and C++ client:

Sending 100 pings:                     24.95 ms
Recieved:                              3906.25 kB
Transferring 1000000 size vector:      378.60 ms
Transferring 100000 items from server: 233.74 ms
Transferring 100000 items to server:   913.07 ms

Multiple executions were performed, restarting the server each time. The results are reproducible.

Note that the source code from the original question (with @MdxBhmt's getItems implementation) will not compile as-is. The following changes will have to be made:

getItems _ mtname mtsize = do let size = i32toi $! fromJust mtsize
                                  item i = Item mtname (Just $!  Vector.enumFromN (i::Int32) (100- (fromIntegral i)))
                                  itemsv = Vector.map item  $ Vector.enumFromN 0  (size-1)
                                  itemPack = ItemPack (Just itemsv) Nothing 
                              putStrLn "getItems"
                              return itemPack

getVector _ mtsize       = do putStrLn "getVector"
                              let size = i32toi $ fromJust mtsize
                              return $ Vector.generate size itoi32

5条回答
地球回转人心会变
2楼-- · 2019-03-07 23:46

This is fairly consistent with what user13251 says: The haskell implementation of thrift implies a large number of small reads.

EG: In Thirft.Protocol.Binary

readI32 p = do
    bs <- tReadAll (getTransport p) 4
    return $ Data.Binary.decode bs

Lets ignore the other odd bits and just focus on that for now. This says: "to read a 32bit int: read 4 bytes from the transport then decode this lazy bytestring."

The transport method reads exactly 4 bytes using the lazy bytestring hGet. The hGet will do the following: allocate a buffer of 4 bytes then use hGetBuf to fill this buffer. hGetBuf might be using an internal buffer, depends on how the Handle was initialized.

So there might be some buffering. Even so, this means Thrift for haskell is performing the read/decode cycle for each integer individually. Allocating a small memory buffer each time. Ouch!

I don't really see a way to fix this without the Thrift library being modified to perform larger bytestring reads.

Then there are the other oddities in the thrift implementation: Using a classes for a structure of methods. While they look similar and can act like a structure of methods and are even implemented as a structure of methods sometimes: They should not be treated as such. See the "Existential Typeclass" antipattern:

One odd part of the test implementation:

  • generating an array of Ints only to immediately change them to Int32s only to immediately pack into a Vector of Int32s. Generating the vector immediately would be sufficient and faster.

Though, I suspect, this is not the primary source of performance issues.

查看更多
祖国的老花朵
3楼-- · 2019-03-07 23:47

You should take a look at Haskell profiling methods to find what resources your program uses/allocates and where.

The chapter on profiling in Real World Haskell is a good starting point.

查看更多
\"骚年 ilove
4楼-- · 2019-03-07 23:52

Everyone is pointing out that is the culprit is the thrift library, but I'll focus on your code (and where I can help getting some speed)

Using a simplified version of your code, where you calculate itemsv:

testfunc mtsize =  itemsv
  where size = i32toi $ fromJust mtsize
        item i = Item (Just $ Vector.fromList $ map itoi32 [i..100])
        items = map item [0..(size-1)]
        itemsv = Vector.fromList items 

First, you have many intermediate data being created in item i. Due to lazyness, those small and fast to calculate vectors becomes delayed thunks of data, when we could had them right away.

Having 2 carefully placed $!, that represent strict evaluation :

 item i = Item (Just $! Vector.fromList $! map itoi32 [i..100])

Will give you a 25% decrease in runtime (for size 1e5 and 1e6).

But there is a more problematic pattern here: you generate a list to convert it as a vector, in place of building the vector directly.

Look those 2 last lines, you create a list -> map a function -> transform into a vector.

Well, vectors are very similar to list, you can do something similar! So you'll have to generate a vector -> vector.map over it and done. No more need to convert a list into a vector, and maping on vector is usually faster than a list!

So you can get rid of items and re-write the following itemsv:

  itemsv = Vector.map item  $ Vector.enumFromN 0  (size-1)

Reapplying the same logic to item i, we eliminate all lists.

testfunc3 mtsize = itemsv
   where 
      size = i32toi $! fromJust mtsize
      item i = Item (Just $!  Vector.enumFromN (i::Int32) (100- (fromIntegral i)))
      itemsv = Vector.map item  $ Vector.enumFromN 0  (size-1)

This has a 50% decrease over the initial runtime.

查看更多
啃猪蹄的小仙女
5楼-- · 2019-03-07 23:56

The Haskell implementation of the basic thrift server you're using uses threading internally, but you didn't compile it to use multiple cores.

To do the test again using multiple cores, change your command line for compiling the Haskell program to include -rtsopts and -threaded, then run the final binary like ./Main -N4 &, where 4 is the number of cores to use.

查看更多
虎瘦雄心在
6楼-- · 2019-03-07 23:57

I don't see any reference to buffering in the Haskell server. In C++, if you don't buffer, you incur one system call for every vector/list element. I suspect the same thing is happening in the Haskell server.

I don't see a buffered transport in Haskell directly. As an experiment, you may want to change both the client and server to use a framed transport. Haskell does have a framed transport, and it is buffered. Note that this will change the wire layout.

As a separate experiment, you may want to turn -off- buffering for C++ and see if the performance numbers are comparable.

查看更多
登录 后发表回答