Logging and the Homelab

2023-03-19

"So what you're saying is that your backups have been failing for months, and you have no logs to tell you why?" is a sentence I don't want to be hearing anytime soon.

As a completely unrelated fact, I recently added restic to make backups for the photos and documents I've stored on my homelab (a small Synology NAS running docker). Since then I've noticed that my motivation to set up monitoring for the workloads I have running there has mysteriously increased.

What I'm currently doing is either ssh-ing onto the server and running docker logs for the container I'm interested in, or I'm using the quite bare-bones web interface nomad comes with. It's not the most streamlined experience. Another problem is that logs are lost after the container is removed, and I'd very much like to understand why my backup job fails in case it does. Of course, logging is just one part of a proper observability setup, but it's also the lowest-hanging fruit for me to spend a weekend configuring before life sweeps me away again.

I've worked with a couple of different logging setups at work, but what I'm looking for here is slightly different. We're only talking about a handful of services running so the amount of logs per second is: probably just that, maybe a log per second. I also only have one node, and even if I were to add a Raspberry Pi or two later I don't need high scalability here. My needs are small and it would be easy to go overkill. Instead, I want my lean needs to be met with equally lean resource usage, I would be happy with something in the 100-150MiB range. Maintainability is also a big concern for me, I might not want to look at this for a couple of months, and when I return it should be easy for me to get back into it. So I'd like containers but the fewer needed the better. A handful of lines of configuration to read and understand would be good, and it should be possible to keep them in a git repo so I don't have to fiddle around in a GUI too much if it needs to be dropped and re-created.

Before we get started you might be thinking: "Hey if you enjoy playing with your Linux machine and want it to be lean why don't you just use journald/syslog/pipe to a file". You'd get log persistence and you can use good old grep all day. Well, I want a nice and quick web interface to look at what's happening, from anywhere where I happen to be. And I don't know of an application that exposes journald (or the others) directly like that, could be an interesting little project though. The second reason is that I'm running on a Synology NAS, which is a Linux system, but it doesn't use systemd (and therefore journald). Also, no cloud-hosted services, while they work great and have generous free tiers we still want to have a bit of fun ourselves.

The Logging Stack

Let's take a quick refresher on the logging stack.

I'm going to have some Applications running on the server, mostly long-running services but also the backup batch job. They are all running as docker containers printing their logs to stdout/stderr, so an easy solution would involve just grabbing the logs from the docker socket. If you feel a little bit uncomfortable giving the containers full access to the docker socket, there are other log drivers that can push them to different places. But I'm fine with this for my homelab.

Now it's Database Time, because once we have some logs we need to store, index and query them. We're going to talk about a few different ones below. If you want to keep it simple, just storing the logs as files (on object storage like s3 if you're feeling fancy) and using ripgrep would also work. I want a nice web interface though, and filtering by services and time ranges. Even though the server is a NAS with plenty of storage, it would also be nice if the logs are stored compressed. So what we're going to have is a service with an API for storing logs and querying them, with a search index to make it all fast (which will cost us a bit of memory usage though).

The next piece is the one that gets the logs from docker into the database, a Log Shipper/Forwarder. Taking the logs from any number of sources to their destination, optionally transforming them along the way. The transformation might be handy as I'm running a nice little mix of open-source tools that also have a nice mix of logging formats I don't control.

And finally, we need some way to query and analyze the logs as the database usually speaks HTTP/REST/JSON and while I'm happy to speak those languages I'd prefer a Dashboard filled scrolling log lines and colorful buttons. A lot of tools could technically fill this job, and the overlap between tools typically seen as Business Intelligence tools (say Metabase or Apache Superset for things you can run at home) and Kibana/Grafana is quite high. I'm looking for something that can do the basics and looks okay doing it. Seeing the logs come stream in makes for fast feedback loops while setting new things up so that would be a bonus.

The Alternatives and the Test

With that, we're ready to look at a few different constellations of these tools. I spent some time reading and researching, thinking over what I used and liked and what would be interesting to play with. We have a couple of stacks coming up, and as I said in the beginning resource usage (CPU, Memory) is my biggest concern, less so query performance or storage used.

To test them I started all of the logging stacks, waited for them to reach a steady state, and hit the play button to start the party: spin-up 5 containers that are logging 1 log line per second for 10 minutes, then replace them with 5 new ones logging 10 lines per second. This is at the upper end of what I'm expecting for my Homelab, but it should give us some insight into how they scale both up and down.

If you want to follow along, I have a companion repo to this text called logger-tests with everything configured as I had it.

Alternative 1: Elastic

The Elastic Stack or ELK is the Cadillac of Logging, it's big, you have nice memories of the old ones, has a lot of features on its dashboard, and don't expect to get the real deal from Amazon. I didn't expect it to come away as the winner for me, but I wanted to have something to compare to. It consists of ElasticSearch as the database, Logstash for log forwarding (though there is some variation with fluentd, beat, and others), and Kibana for dashboards and log streaming. I've used it at several workplaces before, and it's a pretty natural choice for a logging stack. All three of them also contain a mountain of features, and it took me a little while to re-orient myself in the newest Kibana version after opening and clicking through it. After some clicks, we do find a page streaming in all the logs nicely. You can try a demo online.

The setup was not very complicated but a bit annoying, and it was the stack that took the longest to configure. This was mostly due to me wanting to turn off distribution and all the xpack security features that I didn't want to deal with in my small single-node setup. For the test, I replaced logstash with the more lightweight Filebeat (also part of the elastic stack), but we're still left with some heavy Java services. Nothing against the language, especially the last couple of versions (pattern matching yes!), and you can tweak the GC behavior in minute details. But out of the box, it eats memory, and when I first started the Elastic containers and they chugged through memory and quickly reached 3gb. After setting the docker memory limits to 512mb, which causes ES to re-evaluate if it needs a heap that big, I could improve this to a bit over 1gb total.

We are still way past my memory budget, and without forcing a memory limit I'd barely have room to run anything else on the server. So while we have some very fancy technology running, this is not what I'm going to be running at home.

Alternative 2: Loki

Another tool familiar in the observability space is Grafana, which started as a fork of Kibana 3 around 2014. As far as I know, it doesn't share any of that DNA anymore and was rewritten in go. Grafana Labs grew up around Grafana, and now offers a full logging stack which we are going to try out. With Loki as our database, Promtail to ship the logs and finally Grafana to show them to us. I was looking into running just Grafana anyway for metrics and alerting, so if this stack worked out I can also re-use that instance. The interface manages to balance pretty well on the line of being full of features and still being easy to navigate, and with the Loki integration we have nicely streaming logs. With the components written in Go, even though it's another GC'd language I'm expecting the memory usage to be a fair bit below Elastic.

The setup pretty much just consists of starting the containers, and the tools play well together. While the Loki configuration from the quick-start tutorial splits it into multiple containers for reading and writing I ran it all in one. I also reduced the embedded_cache.max_size_mb to 10mb after running the tests for a bit. That was pretty much the extent of my configuration though.

And when it comes to the testing we also see much better performance, Loki staying somewhere around 180MiB and the whole stack consuming 380-ish MiB. We still have more alternatives to look at but I could imagine using this.

Alternative 3: Quickwit and Vector

There are a couple of rust-based upstarts into the search game now, MeiliSearch and TypeSense are two others, but I settled for trying out Quickwit. Mostly because I've already used Tantivy, the underlying search library in another side project, and out of them it seemed like the one that had focused the most on the log searching use case. Quickwit also comes with a built-in search interface out of the box! We can't see the logs streaming in but it has the essentials and querying and filtering is easy and fast. It does also expose an ElasticSearch-compatible querying API, so we could add Grafana on top if we wanted to. And for log shipping, I wanted to try out Vector, which the docs also helpfully had a guide for. As both Quickwit and Vector are built in rust I'm assuming we're going to get good memory usage.

The setup was also very simple, you do have to specify a document schema up front but it was pretty easy and fields can be a JSON type so it's not as inflexible as it sounds. I'm storing my logs on a local disk, but its architecture splits storage and compute so you could also have them on s3-like object storage (something Loki also does, ES can do that too as far as I know but have fun configuring it). Vector was also nice to configure, and the transformation language seems quite powerful.

For the test itself, we see pretty good performance. Interestingly Loki seems to be a bit better regarding memory. In the battle of log shippers where we are primarly looking at Vector and Promtail it's pretty much a wash, so I'd pick whichever would be easiest to configure. Quickwit is a bit less known and without the clear performance crown it would be harder to choose, but it's also a young project so I'm looking forward to seeing it evolve.

Alternative 4: Postgres

So I need a database to store my logs in? I already have a database I like very much and its name is Postgres. Now you might be scratching your head, but recall that I said the amount of logs I expect is tiny and Postgres can scale farther than you might assume. I have a Postgres instance already running on the server, so it would reduce complexity a bit and I could subtract the additional memory usage from it. We're going to add the TimescaleDB extension which adds more support for handling time-series data for easy partitioning and compression, and even though it's not a problem it will increase the insert performance too. This setup also uses Vector as the log shipper, but there's a slight problem as it doesn't have a Postgres sink. We can solve this by using PostgREST, which given some tables will expose a REST API for them, and this we can use with Vector. There are a lot of tools that can connect to Postgres and give us a query interface, but let's use Grafana again.

Configuration was a little bit more complicated than our last alternatives as you might expect, not too bad though. And it has more running parts, even if Postgres already is. Other admin we need to take care of is a migration for the tables. Concerning tables and databases, I take backups of all the Postgres databases, and it would be a little extra step to exclude the logging database from that. Just a line or two to do though.

I started by using the timescaledb image but noticed it idled at a much higher memory usage than the plain postgres one. Comparing the two postgres.conf files I saw they had tweaked it a bit to increase some buffer sizes, probably a really smart idea considering what people use Timescale for. But I'm happier with the default defaults so I changed them back for less idle use.

As we come to the test and performance, we see that postgres starts with good idle performance. After having run for a while to we become quite competitive with Loki on the database side but have the additional PostgREST container to worry about. And while I'm already running a Postgres instance, and can "subtract" the 30MiB or so it currently consumes the total memory usage is still higher. The promise of lower complexity goes unfulfilled as well in my opinion, but I could see this setup working for some people.

Alternative 5: Log-Store

Log-Store is another wildcard in my comparison here. I was almost done with writing all of this together when it grabbed my attention inside a timely hacker news thread. It is written in rust, comes as a single binary, and supports log-ingestion via a http/json api as well as via syslog. This is kind of interesting as there is a docker syslog log-driver and we could potentially skip the log shipper with this setup. Log-Store also comes bundled with a web interface for querying, building dashboards, and adding some alerting. A demo page is also available to check out.

Configuration was very easy, there are almost no options available outside of specifying the ip/port to listen to. What is nice is that we can get by with just adding one additional container since we can use the syslog driver, and it comes bundled with a webpage. However as far as I could see it is not open-source, is very new, and does not have an ecosystem around it. Perhaps not that important for my homelab, but I'd want to avoid my tooling disappearing in a couple of months/years when I revisit the setup.

Since log-store is written in Rust, I'm again hoping for excellent memory usage and log-store delivers in spades! Out of all the setups, it uses the least by far, and CPU usage follows suit. Idling around 80MiB which slowly climbs up while increasing the rate of logs and we finish around 190MiB.

But while testing I also got reminded that containers expose how much memory the actual host has and might use that to decide how much memory to allocate (which is why the free command doesn't care about your memory limits).

The graph below and the idle memory usage around 80MiB was what I observed on my main Linux desktop at home, equipped with 32GiB of RAM. Sometime after the main test I was experimenting a bit on my MacBook, where I have 4GiB allocated to docker. To my surprise, I saw an idle usage of 7-8MiB instead there. In absolute terms that is not much of a difference but it is a relative increase of 10x. This was a bit annoying to realize after I've already created most of the graphs, but it's a learning moment and we can use that knowledge to see how the different alternatives scale down on lighter hardware when comparing them all.

Finale

I've now presented all of the tools for this little comparison, and while you might already have a feeling for the outcome it's time to compare them a bit more directly to each other.

As I wrote above but you might have skimmed, the test had two phases. First we are idle, then we start logging at 5 logs/s for 10 minutes, after which we turn up the heat to 50 logs/s for another 10 minute round.

Even when trying to constrain elastic a bit by setting memory limits, it easily takes the prize for the most memory-hungry stack, even if it luckily is very stable over the entire test. Out of the bunch it also uses the most CPU, including some interesting spikes that persisted even after I let all of the services idle for a bit to make sure any indexing jobs would have time to run. Quickwit comes second, and while the setup was very easy and it is quite competitive with the others it is also a lesser-known alternative with a smaller ecosystem. The Loki setup goes head to head with our franken-postgres alternative, though Loki comes out looking a bit better due to it barely increasing over time. I fun setting up the Postgres alternative, however, my plan of amortizing the memory usage against my already running instance doesn't make sense. Even removing those 30MiB or so it's still pretty hungry, and we see memory usage increase over time which is a bit concerning. Loki+Grafana also gives me a much better experience overall since they are more tightly integrated. But there could be ways to make it more competitive by simplifying, a better mechanism for shipping the logs with something more purpose-built than the generic vector into PostgREST could make it easier and leaner.

Finally, we have Log-Store, and it wins the memory usage game. It's able to do this since it has fewer running parts, we can skip both a log shipper and another dashboarding tool. It might be interesting to compare it to using Loki not with Promtail but using the docker log-driver plugin to natively ship logs to Loki. It is also a lesser-known alternative though, and I had moments when the dashboard didn't work for me, showing "0 out of 20,000 logs" even with all of the filters removed.

When we're looking at the memory-constrained host, we see that Log-Store adjusts and starts idling at a very svelte 4MiB. Though when we start ingesting we see the memory usage creeping up to handle it, but it's still performing well below the rest. The rest of the pack also manages a bit better, and except Elastic they are hovering around the 200-256MiB area. Good news for when I spin up one of the stacks on my server.

And for the answer to what we've been chasing the whole time, which one will I choose to run on my server? While I've been reiterating over and over how much I'm interested in the memory usage; at the end of the day I'm willing to pay a bit of memory for an easier setup and tighter integration. This means I'm going for Loki+Grafana, switching Promtail for Vector. I get a very flexible setup that I can easily also add alerting, dashboards, and in the future show also show traces (for my own services).

I had fun writing and putting together this little comparison. It is not the most exhaustive, and there are several other important factors that I'm omitting completely. Maybe in a follow-up, we could take a look at storage use, query performance and maybe load-testing ingestion (I'm thinking it should be pretty easy using k6). It was also pretty fun setting up the test harness, even though writing PromQL queries is always a humbling experience. I had to re-do all of the measurements after realizing my query was measuring memory allocations (so all my pretty graphs were going up up up the whole time), not memory usage...

For now, I'm satisfied that I have choice that will last a couple of years, and it was interesting to see how some of the bigger logging stacks I've been using can scale down to a smaller environment.

Thank you for joining me this far on the adventure. I'm trying to write down and share a bit more of the things I'm thinking during 2023, and I hope you enjoyed this time. Send an email to hi at this domain /PV