During trying to achieve the performance with Hyperledger Fabric which IBM team reported in their article Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains, I faced some problems and errors. I collected all useful information and want to share it with the HF community. Also, I have a couple of questions to the Fabric developers about its performance.
Target description
Hyperledger Fabric v1.1.0 network deployed using Cello on four c5.9xlarge (36vCPU) aws instances:
{
fabric001: {
cas: [],
peers: ["anchor@peer1st.main"],
orderers: ["orderer1st.orderer"],
zookeepers: ["zookeeper1st"],
kafkas: ["kafka1st"]
},
fabric002: {
cas: [],
peers: ["worker@peer2nd.main"],
orderers: ["orderer2nd.orderer"],
zookeepers: ["zookeeper2nd"],
kafkas: ["kafka2nd"]
},
fabric003: {
cas: [],
peers: ["worker@peer3rd.main"],
orderers: ["orderer3rd.orderer"],
zookeepers: ["zookeeper3rd"],
kafkas: ["kafka3rd"]
},
fabric004: {
cas: ["ca1st.main"],
peers: [],
orderers: ["orderer4th.orderer"],
zookeepers: ["zookeeper4th"],
kafkas: ["kafka4th"]
}
}
TLS is disabled.
Fabric channel configuration (all others parameters are the default):
BatchTimeout: 1s
BatchSize:
MaxMessageCount: 500
AbsoluteMaxBytes: 200 MB
PreferredMaxBytes: 50 MB
I performed tests for both CouchDB and LevelDB as a state database. I use official Fabcar chaincode (Golang implementation) for my tests. I created simple nodejs app which interacts with the Fabric network using SDK and exposes HTTP API for load tests. This app is stateless and can be easily scaled. For load testing, I'm using tool YandexTank. I've performed two kinds of tests with high load: query (requests via peer001 to the Fabric state when blockchain is empty) and insert (transactions within the blockchain).
Results
CouchDB as a state database
Query results: https://overload.yandex.net/101153. At ~1100 rps latency starts to increase. But Fabric instance is not loaded and have a lot of free resources. On the figure below you can see CPU and Memory usage by the Fabric network containers on the instance fabric001 during the test. 100% CPU usage means one full vCPU load. Also peer001 prints a lot of similar error logs (not full output, just tiny part, I can share it with you if needed): https://gist.github.com/krabradosty/9780cacc92fcdeaa0c36377a91727ade
Insert results: https://overload.yandex.net/101217. At ~600 rps latency degradation is very fast. Before is slowly, but anyway, exist. CPU and Memory usage of the fabric003 containers on the figure below: A lot of error logs from the peer (again, not full output): https://gist.github.com/krabradosty/3810151b8e101d8279cc705aef22863e
Based on this I can conclude that Fabric Peer has problems with the CouchDB connection under the load.
My questions: Does Fabric comminity know about this bug? Do you have plans how to solve it?
LevelDB as a state database
- Query results: https://overload.yandex.net/102035. CPU and Memory usage of the fabric001 containers on the figure below: There are no any errors from the blockchain, I just see latency degradation.
- Insert results: https://overload.yandex.net/102040. CPU and Memory usage of the fabric001 containers on the figure below: Aggressive latency degradation starts at ~850 rps. No errors from the blockchain.
My questions: What is the cause of this latency degradation? Why I can't achieve 3500 rps performance that IBM report in their article? What plans does Fabric community have on improving the performance?