Distributed logs in Cassandra

2019-09-01 04:07发布

I am finding the way to store the application logs in Cassandra.

I have three node setup (Node 1, Node 2 and Node 3) in which my web application runs as cluster in all the three nodes and load balanced so logs will be generated from all nodes.

Cassandra runs in all the three nodes and logs are dumped from all the three web application into Cassandra cluster which is partitioned for every day.

Problem in this approach :
1) I am using my web application to write the data to Cassandra.
2) For every day partition, the amount of data is very high

So Is there a better approach for this?

Is this the good design approach?

1条回答
冷血范
2楼-- · 2019-09-01 04:12

The choice of storing logs in Cassandra is debatable; as the analysis of that data becomes difficult but doable. ELK (Elastic-Logstash-Kibana) or Splunk are more popular choices for log analysis because of their native "text" search support and dashboards.

Having said that, lets look at the problems in hand

1) I am using my web application to write the data to Cassandra.

The suggestions that come to my mind here are:

  • Are the writes being done asynchronously? Recommended.
  • What is the consistency level used during these writes? Higher the consistency, slower the web-app is going to get as its waiting that much longer on C* (assuming synchronous writes). Remember C* could still have RF = 3, but you could do consistency = 1.
  • What happens if C* cluster goes down? Does web-app go down along with it?

2) For every day partition, the amount of data is very high

  • There are two problems here - Fat partitions and same node being hit for the entire day (resulting in hot spots). The workload isn't being distributed to the entire cluster.
  • Partition sizing can be reduced to be hourly instead of entire day. But we just reduced the footprint of one node being hit from a day to an hour. Its still hot spot for the hour.
  • You could do "second" level partition, to get an uniform distribution of data across nodes and not cause huge partitions (depends on how chatty the app is). But this is where merits of C* for log monitoring becomes questionable?
  • What are all the queries that C* would solve? How would I aggregate the second level data partition and answer various questions arising during typical log analysis?

Revisit the design with what are all the log analysis questions (queries) that this C* DB would have to answer? Answers should line up automagically.

查看更多
登录 后发表回答