June 21, 2020
All technology companies collect raw data, whether as part of the business or to be used as metrics for monitoring. However, data collection is not an easy task. DevOps engineers have to pay attention to many variables that may interfere with the company’s processing pipeline.
Questions include, but are not limited to:
Whenever you are building your company’s data collection pipeline, it is imperative to define which variables are important and how they should be treated.
To collect and deliver data as seamlessly as possible, you will probably be using one of the four most popular data collection services available in this ecosystem.
To make things easier for you, we’ve made a comparison of all four services in order to determine whether they satisfy some of the questions above. Our main focus is to benchmark their performance, including their CPU usage, for the same amount of data and time frame.
We’ve compared the four most popular data collection services, namely:
image by Trink.io
All of these platforms are commonly known as "log shippers'', "log forwarders" or "log aggregators" and are open-sourced. They are also all designed with the following plug-in architecture: Input > Filter > Output
Note: If you want to set up your own test environment, make sure to benchmark the service first.
We’ve tested each and every one of the data collection services in a separate Amazon EC2 C5.xlarge Instance on an Amazon Linux 2 AMI. We recommend that you use a different instance for each service.
Install and configure every collection service on the designated server. Installation is pretty straightforward, involving just a few command lines which can be found in the technical documentation of each service’s website. Make sure to visit their website in order to get the latest version of the service.
Pay attention to the resources available on the given server. We’ve used a C5.xlarge instance type, which includes four vCPUs, where nothing else should run on our instances. For a few of the tests, the CPU percentage of the software increases up to 250% (meaning it occupies two and a half out of four available vCPUs).
If you don't want your other services running on and impairing the same server, consider limiting the CPU usage of the software (Note that this can affect the performance of the software itself).
Although it comes empty, the Logstash configuration is easy to set up. The syntax is very straightforward and the software works almost right out-of-the-box. The configuration we’ve used can be found in this article's appendix.
Filebeat's configuration has a very simple syntax, too. More than that, the configuration comes with many commented-out examples, so you can quickly uncomment your desired features and change the parameters. We'd say this service is the easiest to configure out of all four services.
In our eyes, Fluent Bit takes second place for the easiest service to configure. It also includes some examples in the default configuration, so that we could run it within a couple of minutes only.
Although this service is highly popular, configuring Fluentd is a bit cumbersome. The syntax is not very complicated, but definitely not easy. You need to learn the basics of the syntax as it is not straightforward and you'll find yourself going over the technical documentation often in order to understand all constraints. Some of the configuration sections are also not the most intuitive or straightforward.
Moreover, Fluentd’s default settings are not optimized. For example, the enablewatchtimer setting is a much-used CPU timer which is employed because it supports all platforms (instead of the well-known inotify timer which is not supported on macOS).
Another example is that Fluentd runs a single worker or process which can utilize only one CPU at a time, potentially due to a Ruby constraint. In order to utilize more CPUs, you have to add more workers. Unfortunately, the tail input plugin, which is very common, only supports one worker.
Splitting the path between a few workers can do the trick functionality-wise, but adds a lot of overhead to the CPU and for every worker, which makes this service incomparable to the others.
One last challenge we’ve faced with Fluentd is that its internal queue for events gets full very quickly, raising many errors and making the processing memory large (x25). This is due to the default number of flush threads which stands at 1 so that we’ve changed it to 8. This configuration can be found in the appendix.
As mentioned above, we want to compare the services’ performance of collecting and shipping the same amount of data and within the same time frame.
We consider memory to be a less interesting factor because it is much cheaper than CPU. For example, the highest memory pick in our test is 400 MB, which is pretty insignificant to overall performance.
We’ve measured CPU usage within the same time frame, namely, an average of ten minutes and/or until the data is fully shipped.
For every service, we’ve conducted tests consisting of three factors:
The amount of data we’ve tested on all services is the same. We’ve performed tests by writing data to the files on disk, per second, using three different speed rates. The tests are all run for ten minutes so that the time multiplied by the speed rate equals the total amount of data:
The common case for input on all services is to tail existing files or a directory on disk. We’ve decided to use three parameters for the number of files to track:
We’ve tested all the platforms once without a filter, to show their performance without any further digestion, and once with a simple filter loading a script (based on the service’s language support) and do two things:
We’ve tested the services with filters as running more code affects CPU usage, demonstrating how using filters may affect overall performance.
When applying a filter, each one of the services uses another language for script writing as a filter plugin. To test the services with filters, we’ve used the following languages:
We’ve created a permutation of the four data collection services based on the three above mentioned factors—using the same amount of data and the same time frame run on three different speed rates, with various numbers of files to track, and with/without an additional filter.
The results can be found below, shown once without a filter and secondly with a filter.
A note about Fluentd results:
As mentioned in the Configuration section above, Fluentd only runs one worker using a tail plugin. This approach means Fluentd can utilize a maximum of one CPU alone. Other data collection services can utilize more than one CPU, which is why we see a CPU usage of 150% being translated into a total of 1.5 CPU in full occupancy.
Trying to add more workers, by splitting the path between a couple of input plugins, results in much higher CPU usage (probably due to some overhead for every worker).
To normalize results, we’ve done the following:
Assuming that Fluentd’s CPU is limited, it would take more time to collect and ship data. Thus, we measure the time it takes for Fluentd to process all data. For example, if it has a data processing rate of 10 mbps for ten minutes, but it takes Fluentd twelve minutes instead, then we measure it as Fluentd needing a CPU usage of 120% to process that same data.
Since Fluent Bit’s performance is much better than Fluentd, yet lacks many features for log aggregation, it is possible to install Fluent Bit on every node and use it as a log forward for Fluentd (one for each cluster). This architecture helps to maintain a low resource CPU footprint. On the other hand, this suggestion complicated the architecture.
To maintain a low resource CPU footprint all across the board (same use case is common for Filebeat and Logstash).
Despite FluentBit’s leading performance, showing the lowest CPU footprint among all four services, it still ends up using almost 50% of one of the CPUs in a 20 mbps environment with simple filters applied and 500 files to track.
This is where we step in. At Trink.io, we make it possible to get rid of the downsides commonly experienced with existing data collection services.
Not only does our platform enable simple client deployment, with barely any footprint on your nodes - having to only maintain data shipping without needing to read the data itself. We also provide a much more convenient and configurable platform, through which you can control your architecture, as everything is centralized on our dashboard.
Want to try it out for free? Sign up right away and get started.