From 775b51c2757993377ab0990ee4222a0a53e48ca4 Mon Sep 17 00:00:00 2001 From: Artemis Tosini Date: Wed, 9 Jun 2021 23:29:55 +0000 Subject: [PATCH] Fixup todos, still have a lot of editing to do --- _drafts/journald-1.md | 43 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 5 deletions(-) diff --git a/_drafts/journald-1.md b/_drafts/journald-1.md index 14d8f0e..0dc2341 100644 --- a/_drafts/journald-1.md +++ b/_drafts/journald-1.md @@ -6,6 +6,7 @@ date: 2021-06-06 On many Linux systems, systemd-journald runs as a daemon at boot and collects your logs. You can access them through journalctl but it turns out journald is a lot more complicated then just sending something to a text file. +I'll look at two main things here: What kind of information is included in a journald entry and how these entries get from programs to journald. ## Anatomy of an Entry While you'll mostly see entries as a terse error message on one line, every time you send a message journald collects and stores a lot more information. @@ -97,8 +98,8 @@ Now for the actual fields: - **__MONOTONIC_TIMESTAMP** is the monotonic timestamp from when the message is written into the journal file. Journald uses CLOCK_MONOTONIC but there's not much of a specification to tell people implementing it which one to use. ### Reader context -Fields starting with two underscores are generated by `journalctl` while reading. - +Fields starting with two underscores are generated by `journalctl` while reading. In addition to the two `_TIMESTAMP` fields mentioned above, journalctl will +generate `__CURSOR`. This is defined as an opaque string (meaning its format can change and you shouldn't need to figure out what it means) referencing the position in the file. I haven't been particularly interested in the reader part so I haven't looked into this. ### The actual message Finally there's the untrusted message sent by the process. - **MESSAGE** is the only required field and is what shows up in `journalctl` when you don't use `-o export`. @@ -118,7 +119,7 @@ Linux has the concept of a "namespace" mostly seen with containers which allows ## Transports There's still one field I haven't described: **_TRANSPORT**. This requires a little more context. -Journald can get messages from one of 6 separate sources: **journal** (using the native journald protocol), **stdout** (a process's standard output or error redirected to systemd), **syslog** (the pre-systemd Unix logging system used if you need to be compatible with BSD or non-systemd distributions), **kernel** (kernel messages you can get through the `dmesg` command), **audit** (logs the kernel generates about programs' activities), and **driver** (error messages from within journald). Each has their own peculiarities from both the journald side and the client side but I'll mostly be talking about journal and stdout. +Journald can get messages from one of 6 separate sources: **journal** (using the native journald protocol), **stdout** (a process's standard output or error redirected to systemd), **syslog** (the pre-systemd Unix logging system used if you need to be compatible with BSD or non-systemd distributions), **kernel** (kernel messages you can get through the `dmesg` command), **audit** (logs the kernel generates about programs' activities), and **driver** (error messages from within journald). Each has their own peculiarities from both the journald side and the client side but I'll only be talking about how journal and stdout work. ### Native (journal) If you want to send arbitrary fields you'll want to use the native transport. It's conceptually the simplest (connect to journald's socket and send messages) but has some strange idiosynchrasies. @@ -129,16 +130,48 @@ auxiliary data like references to files through the socket. This is also a datagram socket, meaning it's message-based. Like UDP, you send individual messages and are responsible for splitting up your data into chunks. However, unlike UDP, unix datagram sockets are reliable, in-order, and have large maximum message sizes. When you want to add an entry to the log, you can connect to the socket then send a message formatted as newline separated `FIELD=value`. The `MESSAGE` field is required and your your entry will be ignored if you forget it. + +For small messages, you can send the message directly using the `write` or `sendmsg` syscalls (requests to the kernel). However, for larger messages you must send a reference to a file. +Let's see what this looks like from `logger --journald=large_message`. I've recorded this using [strace](https://jvns.ca/blog/2015/04/14/strace-zine/) then extracted the relevant parts + +```c +socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4 +getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 +setsockopt(4, SOL_SOCKET, SO_SNDBUF, [8388608], 4) = 0 +getsockopt(4, SOL_SOCKET, SO_SNDBUF, [425984], [4]) = 0 +setsockopt(4, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = -1 EPERM (Operation not permitted) +sendmsg(4, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/journal/socket"}, msg_namelen=30, msg_iov=[{iov_base="MESSAGE=mO4NvlMGp/1VB/gEcY LWk5ed"..., iov_len=5592416}, {iov_base="\n", iov_len=1}, {iov_base="SYSLOG_IDENTIFIER=", iov_len=18}, {iov_base="logger", iov_len=6}, {iov_base ="\n", iov_len=1}], msg_iovlen=5, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EMSGSIZE (Message too long) +prctl(PR_GET_NAME, "logger") = 0 +memfd_create("sd-logger", MFD_CLOEXEC|MFD_ALLOW_SEALING) = 5 +writev(5, [{iov_base="MESSAGE=mO4NvlMGp/1VB/gEcYLWk5ed"..., iov_len=5592416}, {iov_base="\n", iov_len=1}, {iov_base="SYSLOG_IDENTIFIER=", iov_le n=18}, {iov_base="logger", iov_len=6}, {iov_base="\n", iov_len=1}], 5) = 5592442 +fcntl(5, F_ADD_SEALS, F_SEAL_SEAL|F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE) = 0 +sendmsg(4, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/journal/socket"}, msg_namelen=30, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg _len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[5]}], msg_controllen=24, msg_flags=0}, MSG_NOSIGNAL) = 0 +``` +More clearly, this program will: +- Create a socket +- Attempt to expand the send buffer to 8 MiB in case the program sends a lot of output. The kernel only lets it go up to 416KiB. +- Try to send the full message directly to `/run/systemd/journal/socket`. The kernel returns an error since the message is longer than the send buffer. +- Get the current process name, then use that for the name of a virtual, in-memory file +- Copy the message into the virtual file +- Seal the message, making it read-only +- Send permission to access the file to `/run/systemd/journal/socket` + +The key part of this is sending permission to access the file to journald. Linux lets you do this by sending a special control message with "ancillary data". +This specifically uses the [SCM_RIGHTS](https://blog.cloudflare.com/know-your-scm_rights/) type where you give the kernel a file descriptor (ID for an open file in the current process) +and tell it to make a copy and give it to the process that receives the message. The receivng process can then ask the kernel for this ancillary data and in it will be a new file descriptor +(likely a different number than what was sent) pointing to the same file. On this case, journald will read the file then parse it as if it were sent directly, using the same format + This protocol is simple enough that you can send log messages from your terminal using netcat, a tool for sending and receiving data from sockets: `echo -e "MESSAGE=owo\nOWO=uwu" | nc -Uu /run/systemd/journal/socket` will create a new entry with `MESSAGE=owo` and `OWO=uwu`. You can view this output with `journalctl -xeo export`. (-e on echo allows us to create a new line with `\n` and -Uu on netcat tells it that we're using a unix datagram socket). Note that netcat won't quit but the message will still send. nc is also not suitable for messages over 16384 bytes. + ### Service output (stdout) Unfortunately, it would be a lot of work and cause security issues for systemd systemd to read every process's output, reformat it for the native protocol, then send it to journald. Therefore, the journald authors added another method: stdout. When you execute a service with systemd, the process's stdout and stderr will point to a socket connected to journald. You can also use the `systemd-cat` program to do this. #### Into the Syscalls -When you run `systemd-cat echo` it performs these system calls (i.e. requests to the kernel). I've extracted the relevant part I recorded using [strace](https://jvns.ca/blog/2015/04/14/strace-zine/) +When you run `systemd-cat echo` it performs these system calls (i.e. requests to the kernel). As above, I've extracted this with strace ```c socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path="/run/systemd/journal/stdout"}, 30) = 0 @@ -158,7 +191,7 @@ execve("/run/current-system/sw/bin/echo", ["echo"], 0x7fff9dc55070 /* 75 vars */ This does a few things: - Connect to `/run/systemd/journal/stdout` - Make the connection to the journal socket write-only, since there's no need to read responses from journald and it could confuse programs -- Attempt to expand the send buffer to 8 MiB in case the program sends a lot of output. The kernel only lets it go up to 416KiB. +- As above, attempt to expand the send buffer to 8 MiB - Send some setup information to journald - Create a copy of the original stderr in order to print error messages if it has problems calling your program - Check which descriptor flags are set for standard input. In this case there's none