228 lines
24 KiB
Markdown
228 lines
24 KiB
Markdown
---
|
|
layout: post
|
|
title: Jumping into journald
|
|
date: 2021-06-29
|
|
---
|
|
|
|
On many Linux systems, systemd-journald runs as a daemon at boot and collects your logs. You can access
|
|
them through journalctl but it turns out journald is a lot more complicated then just sending something to a text file.
|
|
I'll look at two main things here: What kind of information is included in a journald entry and how these entries get from programs to journald.
|
|
|
|
## Anatomy of an Entry
|
|
While you'll mostly see entries as a terse error message on one line, every time you send a message journald collects and stores a lot more information.
|
|
For example, here's an error from my audio server, [pipewire](https://pipewire.org/). Note that some fields are reordered from the raw journalctl output
|
|
```shell
|
|
artemis@starlight ~> journalctl --user -xeu pipewire.service -o export
|
|
...
|
|
|
|
_BOOT_ID=eba612c42f634b58b0484c42756ba712
|
|
_UID=1000
|
|
_GID=1000
|
|
_CAP_EFFECTIVE=0
|
|
_MACHINE_ID=b3bee68a0f884e6c982529efec408a61
|
|
_HOSTNAME=starlight
|
|
_TRANSPORT=journal
|
|
_SELINUX_CONTEXT=kernel
|
|
_AUDIT_SESSION=2
|
|
_AUDIT_LOGINUID=1000
|
|
_SYSTEMD_OWNER_UID=1000
|
|
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/user@1000.service/session.slice/pipewire.service
|
|
_SYSTEMD_UNIT=user@1000.service
|
|
_SYSTEMD_SLICE=user-1000.slice
|
|
_SYSTEMD_USER_SLICE=session.slice
|
|
_SYSTEMD_USER_UNIT=pipewire.service
|
|
_PID=4046
|
|
_COMM=pipewire
|
|
_EXE=/nix/store/4qp4npwqabf3mnsy230w3z1nqdjl1gxr-pipewire-0.3.26/bin/pipewire
|
|
_CMDLINE=/nix/store/4qp4npwqabf3mnsy230w3z1nqdjl1gxr-pipewire-0.3.26/bin/pipewire
|
|
_SYSTEMD_INVOCATION_ID=a9477826e73747ac810f958f894c302e
|
|
_SOURCE_REALTIME_TIMESTAMP=1622336991484040
|
|
__CURSOR=s=ac8b24ddf2634038b49168edc5d6e544;i=b4233a3;b=4950abcba7ec46629fce878fb239a6e6;m=1408502e8be;t=5c381c41568ad;x=2af33cc0b675a511
|
|
__REALTIME_TIMESTAMP=1622336991488173
|
|
__MONOTONIC_TIMESTAMP=1376621095102
|
|
PRIORITY=4
|
|
SYSLOG_IDENTIFIER=pipewire
|
|
CODE_FILE=../src/pipewire/impl-node.c
|
|
CODE_LINE=957
|
|
CODE_FUNC=dump_states
|
|
MESSAGE=(PipeWire ALSA [.electron-wrapped]-110) client too slow! rate:256/48000 pos:4451017216 status:triggered
|
|
```
|
|
|
|
An entry consists of freeform variables with binary (though generally ASCII/US English) values. Values starting with an underscore
|
|
are "trusted" and generated by journald while others are sent by the process along with the primary message. This helps provide
|
|
context about what exact process failed and what state it was in during that failure. Unfortunately the [official descriptions](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html) of what these fields mean can be a bit obtuse.
|
|
|
|
While working on my prototype for a system-journald replacement, [rjournald](https://github.com/artemist/rjournald) I've discovered what many of these mean through context or reading the systemd code. You can categorize these into one of a few types
|
|
### System context
|
|
These fields help you figure out if the error is coming from this computer, OS install, or boot.
|
|
- **_BOOT_ID** is a unique ID (UUID in this case) generated at every startup. The kernel creates it and you can access the current one at `/proc/sys/kernel/random/boot_id`. I've found it useful to help figure out if I've rebooted my system since an error occurred.
|
|
- **_MACHINE_ID** is a unique ID to the system which you can find in `/etc/machine-id`. This should be set on the first boot of your system by systemd and helps you figure out if the logs could be from before a system was reinstalled.
|
|
- **_HOSTNAME** is the name of the system. You probably set this in install. There's a few places to get this but journald uses `/proc/sys/kernel/hostname` (what the kernel thinks your hostname is). You can also get the hostname from systemd-hostnamed (which lets you set non-ASCII hostnames for some programs) or `/etc/hostname` (which is where systemd will read your hostname and tell it to the kernel), but these might be different.
|
|
|
|
### Process permissions
|
|
These help you understand what kind of access a process has. You might get errors if a process has insufficient permission or runs as the wrong user
|
|
- **_UID** tells you what user executed the process, as seen by journald. You can get this for your user with the command `id`.
|
|
- **_GID** tells you which group the process was using, as seen by journald. While a user can have several groups, a process executes under one primary group ID
|
|
- **_CAP_EFFECTIVE** provides what [capabilities](https://linux.die.net/man/7/capabilities) a process can use. Capabilities give fine-grained privileged access to processes without requiring them to be the root user. For example, binding to port 80 or 443 requires the CAP_NET_BIND_SERVICE capability. If _CAP_EFFECTIVE=0 then you know you've missed that capability.
|
|
- **_SELINUX_CONTEXT** is an additional set of permissions when using the [SELinux LSM](https://www.redhat.com/en/topics/linux/what-is-selinux) (Linux Security Module). I don't use SELinux on this system so it just shows up as "kernel", meaning SELinux will not limit permissions. Fedora, CentOS, and RHEL use SELinux by default.
|
|
|
|
### systemd context
|
|
If you're using journald you're almost certainly using systemd to start all your processes. Systemd organizes processes into "units", such as OS services and user sessions, and "slices", a set of similar units.
|
|
System units are started and organized by PID 1, the first program executed by the Linux kernel when you startup, while user units are organized by another instance of systemd running as your user, started by PID 1 when you login.
|
|
These are represented to the rest of the OS as a hierarchical set of "cgroups" which allow systemd to set maximum resources for units and slices (e.g. make it so users can only use 1GB of RAM at a time).
|
|
- **_SYSTEMD_CGROUP** is the full cgroup path of the process. As it's hierarchical setting a limit for e.g. `user.slice` would make it so all users together cannot go above the limit. While this is the only option you really need (systemd-journald parses the slices and units from the path it retrieves from `/proc/[pid]/cgroup`) others are included for convenience .
|
|
- **_SYSTEMD_SLICE** is the deepest slice that the PID 1 systemd assigns to the process. In this example journald chooses to say `user-1000.slice` even though the process is also in `user.slice`. Note that there is a special case for systemd: its full cgroup path is `/init.scope` but journald claims it's in the special slice `-.slice`.
|
|
- **_SYSTEMD_UNIT** is the unit that PID 1 systemd assigns to this process. For a system service this would probably be something like `dbus.service`.
|
|
- **_SYSTEMD_USER_SLICE** and **_SYSTEMD_USER_UNIT** are similar to the same fields without USER, but assigned by the user systemd
|
|
|
|
### Process context
|
|
These fields give you some information about how the process was executed. The program will pass its PID to journald in a transport-specific way, then journald will retrieve the other values from `/proc/[pid]/[field name]`.
|
|
|
|
- **_PID** is the process ID (a numeric ID from 1 to 4194304) as seen by journald. PIDs are not unique over a system boot but should not be reused at the same time.
|
|
- **_EXE** is the location of the executable. This is the result of canonicalizing the symlinks at executable start (i.e. if originally a → b → c, you execute a, then a → b → d, then `_EXE` will still contain a). On your system this will probably end up being something in `/usr/bin` but I use [NixOS](https://nixos.org/) which uses extremely long executable path names.
|
|
- **_CMDLINE** is the full command with arguments as you might see in `char **argv`. Note that a program can change this. The most high-profile example I've seen of this is nginx, where you will see logs from `nginx: worker process`.
|
|
- **_COMM** is the command name. This will normally be the final part of the path in `_EXE` but can be different, especially when running programs like [busybox](https://www.busybox.net/) where multiple programs are in file. This is also the pthread name of the sending thread. For example when logging from Firefox this might be `Web Content`.
|
|
|
|
### Time
|
|
Time, as it turns out, is extremely complicated. You'll get 3 separate time fields. Two of them are the "wall clock" time in Unix time (nominally microseconds since midnight at the beginning of 1 January 1970 UTC, though leap seconds make this a bit more complicated). Unfortunately, wall clock time can jump forwards or backwards if your computer's clock is too slow or fast, respectively. Therefore, systemd also includes the "monotonic time", a number of seconds since some point in the past. This is guaranteed to always move forward so this is what you'll want to discern ordering.
|
|
|
|
Unfortunately wall clock time is also more complicated than you might expect. Linux has 4 separate monotonic timers:
|
|
- **CLOCK_MONOTONIC_RAW** counts the amount of time that Linux has spent not asleep since last boot.
|
|
- **CLOCK_MONOTONIC** is similar but will also include incremental adjustments from your time synchronization daemon noticing a fast or slow clock and reporting it to the kernel with [adjtime](https://linux.die.net/man/3/adjtime). The adjustments happen over a long period by speeding up or slowing the clock.
|
|
- **CLOCK_BOOTTIME** is similar to CLOCK_MONOTONIC but also counts time where they system is asleep.
|
|
- **CLOCK_MONOTONIC_COARSE** is similar to CLOCK_MONOTONIC but requires slightly less work in the Kernel and is less accurate. This isn't particularly useful for our use case
|
|
|
|
Now for the actual fields:
|
|
- **_SOURCE_REALTIME_TIMESTAMP** is the UTC timestamp collected as soon as journald receives the message
|
|
- **__REALTIME_TIMESTAMP** is the UTC timestamp collected when message is written into the journal file. It takes a bit of time for journald to read all the context data out of /proc so this is a bit later.
|
|
- **__MONOTONIC_TIMESTAMP** is the monotonic timestamp from when the message is written into the journal file. Journald uses CLOCK_MONOTONIC but there's not much of a specification to tell people implementing it which one to use.
|
|
|
|
### Reader context
|
|
Fields starting with two underscores are generated by `journalctl` while reading. In addition to the two `_TIMESTAMP` fields mentioned above, journalctl will
|
|
generate `__CURSOR`. This is defined as an opaque string (meaning its format can change and you shouldn't need to figure out what it means) referencing the position in the file. I haven't been particularly interested in the reader part so I haven't looked into this.
|
|
### The actual message
|
|
Finally there's the untrusted message sent by the process.
|
|
- **MESSAGE** is the only required field and is what shows up in `journalctl` when you don't use `-o export`.
|
|
- Some programs also send where the error comes from with **CODE_FILE**, **CODE_LINE**, and **CODE_FUNC**
|
|
- **PRIORITY** is how severe the issue is with 0 being the most important and 7 being the least. In this example, 4 means "warning"
|
|
- **SYSLOG_IDENTIFIER** is the program identifier and is what you would get as the program source if you were using syslogd (as you would before systemd)
|
|
|
|
## A Note on Namespaces
|
|
Linux has the concept of a "namespace" mostly seen with containers which allows different processes to see the system differently. A few types make things interesting when logging to a journald outside the namespace (e.g. if you pass through the host journald socket to a container). Running journald in a container then logging from outside is possible but that will cause some problems when journald doesn't have access to process data and I won't talk about it.
|
|
|
|
- **PID** namespaces allow different processes to see a different list of processes. For example, I'm currently running a container to run games on my laptop. If I list processes in the container I only see 34 while there are 472 running on my system overall. Additionally, within a PID namespace PIDs will be remapped. My games container is running systemd at PID 1 but that same process appears to the rest of my system as PID 2454372. The Linux kernel remaps PIDs to make sense in the receiver's PID namespace when sending the sender's credentials, so journald will record the PID as seen from the host if you pass it through to a container.
|
|
- **User** namespaces remap user and group IDs. This can be useful so that Linux doesn't assume everything running as root in your container has full root permissions on the host with respect to e.g. loading kernel modules. When you make a user namespace you specify a UID map for reading and setting UIDs. For example, I use UID 1000000 instead of UID 0 in my games container. Journald will see the PID and GID as seen from the host namespace
|
|
- **Mount** namespaces can cause some problems when you're looking through the log. If the executable was a symlink then it is canonicalized in the mount namespace before it becomes `_EXE`. This means you can end up with an executable path that doesn't exist in your current namespace. If a program changes namespaces, like `unshare` will, then `_EXE` will be from the namespace where it starts.
|
|
- **UTS** namespaces allow programs to see a different hostname. Journald deals with these by not caring. Journald sets the `_HOSTNAME` field by asking for the hostname from the OS once then caching it. Messages from containers will show up in the log as using the hostname where journald receives it.
|
|
- **Time** namespaces allow programs to have a different time offset for both realtime and monotonic timers. Journald ignores these and just reads the time from its own namespace, which will probably be the main system time.
|
|
|
|
|
|
## Transports
|
|
There's still one field I haven't described: **_TRANSPORT**. This requires a little more context.
|
|
Journald can get messages from one of 6 separate sources: **journal** (using the native journald protocol), **stdout** (a process's standard output or error redirected to systemd), **syslog** (the pre-systemd Unix logging system used if you need to be compatible with BSD or non-systemd distributions), **kernel** (kernel messages you can get through the `dmesg` command), **audit** (logs the kernel generates about programs' activities), and **driver** (error messages from within journald). Each has their own peculiarities from both the journald side and the client side but I'll only be talking about how journal and stdout work.
|
|
|
|
### Native (journal)
|
|
If you want to send arbitrary fields you'll want to use the native transport. It's conceptually the simplest (connect to journald's socket and send messages) but has some strange idiosynchrasies.
|
|
|
|
In order to send a message you must connect to journald's Unix datagram socket at `/run/systemd/journal/socket`. This is a Unix socket, meaning it's represented by a location on the filesystem instead of a port.
|
|
Limiting the socket to one system gives you a few advantages: You can use standard filesystem permissions, you're less likely to have port collisions, the kernel can provide context about the program sending the message, and programs can send
|
|
auxiliary data like references to files through the socket.
|
|
This is also a datagram socket, meaning it's message-based. Like UDP, you send individual messages and are responsible for splitting up your data into chunks. However, unlike UDP, unix datagram sockets are reliable, in-order, and have large maximum message sizes.
|
|
|
|
When you want to add an entry to the log, you can connect to the socket then send a message formatted as newline separated `FIELD=value`. The `MESSAGE` field is required and your your entry will be ignored if you forget it.
|
|
|
|
For small messages, you can send the message directly using the `write` or `sendmsg` syscalls (requests to the kernel). However, for larger messages you must send a reference to a file.
|
|
Let's see what this looks like from `logger --journald=large_message`. I've recorded this using [strace](https://jvns.ca/blog/2015/04/14/strace-zine/) then extracted the relevant parts
|
|
|
|
```c
|
|
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4
|
|
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
|
|
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [8388608], 4) = 0
|
|
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [425984], [4]) = 0
|
|
setsockopt(4, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = -1 EPERM (Operation not permitted)
|
|
sendmsg(4, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/journal/socket"}, msg_namelen=30, msg_iov=[{iov_base="MESSAGE=mO4NvlMGp/1VB/gEcY LWk5ed"..., iov_len=5592416}, {iov_base="\n", iov_len=1}, {iov_base="SYSLOG_IDENTIFIER=", iov_len=18}, {iov_base="logger", iov_len=6}, {iov_base ="\n", iov_len=1}], msg_iovlen=5, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EMSGSIZE (Message too long)
|
|
prctl(PR_GET_NAME, "logger") = 0
|
|
memfd_create("sd-logger", MFD_CLOEXEC|MFD_ALLOW_SEALING) = 5
|
|
writev(5, [{iov_base="MESSAGE=mO4NvlMGp/1VB/gEcYLWk5ed"..., iov_len=5592416}, {iov_base="\n", iov_len=1}, {iov_base="SYSLOG_IDENTIFIER=", iov_le n=18}, {iov_base="logger", iov_len=6}, {iov_base="\n", iov_len=1}], 5) = 5592442
|
|
fcntl(5, F_ADD_SEALS, F_SEAL_SEAL|F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE) = 0
|
|
sendmsg(4, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/journal/socket"}, msg_namelen=30, msg_iov=NULL, msg_iovlen=0, msg_control=[{cmsg _len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[5]}], msg_controllen=24, msg_flags=0}, MSG_NOSIGNAL) = 0
|
|
```
|
|
More clearly, this program will:
|
|
- Create a socket
|
|
- Attempt to expand the send buffer to 8 MiB in case the program sends a lot of output. The kernel only lets it go up to 416KiB.
|
|
- Try to send the full message directly to `/run/systemd/journal/socket`. The kernel returns an error since the message is longer than the send buffer.
|
|
- Get the current process name, then use that for the name of a virtual, in-memory file
|
|
- Copy the message into the virtual file
|
|
- Seal the message, making it read-only
|
|
- Send permission to access the file to `/run/systemd/journal/socket`
|
|
|
|
The key part of this is sending permission to access the file to journald. Linux lets you do this by sending a special control message with "ancillary data".
|
|
This specifically uses the [SCM_RIGHTS](https://blog.cloudflare.com/know-your-scm_rights/) type where you give the kernel a file descriptor (ID for an open file in the current process)
|
|
and tell it to make a copy and give it to the process that receives the message. The receiving process can then ask the kernel for this ancillary data and in it will be a new file descriptor
|
|
(likely a different number than what was sent) pointing to the same file. On this case, journald will read the file then parse it as if it were sent directly, using the same format
|
|
|
|
This protocol is simple enough that you can send log messages from your terminal using netcat, a tool for sending and receiving data from sockets: `echo -e "MESSAGE=owo\nOWO=uwu" | nc -Uu /run/systemd/journal/socket` will create a new entry with `MESSAGE=owo` and `OWO=uwu`.
|
|
You can view this output with `journalctl -xeo export`. (-e on echo allows us to create a new line with `\n` and -Uu on netcat tells it that we're using a unix datagram socket). Note that netcat won't quit but the message will still send. nc is also not suitable for messages over 16384 bytes.
|
|
|
|
|
|
|
|
### Service output (stdout)
|
|
Unfortunately, it would be a lot of work and cause security issues for systemd systemd to read every process's output, reformat it for the native protocol, then send it to journald. Therefore, the journald authors added another method: stdout.
|
|
When you execute a service with systemd, the process's stdout and stderr will point to a socket connected to journald. You can also use the `systemd-cat` program to do this.
|
|
|
|
#### Into the Syscalls
|
|
When you run `systemd-cat echo` it performs these system calls (i.e. requests to the kernel). As above, I've extracted this with strace
|
|
```c
|
|
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
|
|
connect(3, {sa_family=AF_UNIX, sun_path="/run/systemd/journal/stdout"}, 30) = 0
|
|
shutdown(3, SHUT_RD) = 0
|
|
getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
|
|
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [8388608], 4) = 0
|
|
getsockopt(3, SOL_SOCKET, SO_SNDBUF, [425984], [4]) = 0
|
|
setsockopt(3, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = -1 EPERM (Operation not permitted)
|
|
write(3, "\n\n6\n1\n0\n0\n0\n", 12) = 12
|
|
fcntl(2, F_DUPFD_CLOEXEC, 3) = 4
|
|
fcntl(0, F_GETFD) = 0
|
|
dup2(3, 1) = 1
|
|
dup2(3, 2) = 2
|
|
close(3) = 0
|
|
execve("/run/current-system/sw/bin/echo", ["echo"], 0x7fff9dc55070 /* 75 vars */) = -1 ENOENT (No such file or directory)
|
|
```
|
|
This does a few things:
|
|
- Connect to `/run/systemd/journal/stdout`
|
|
- Make the connection to the journal socket write-only, since there's no need to read responses from journald and it could confuse programs
|
|
- As above, attempt to expand the send buffer to 8 MiB
|
|
- Send some setup information to journald
|
|
- Create a copy of the original stderr in order to print error messages if it has problems calling your program
|
|
- Check which descriptor flags are set for standard input. In this case there's none
|
|
- Copy the socket to journald to file descriptors 1 and 2, stdout and stderr respectively
|
|
- Close the original file descriptor so that the other program is free to use number 3
|
|
- Call the program you specified
|
|
|
|
The main part is copying the socket file descriptor to stdout and stderr. This means that any time the program writes something to its output
|
|
it will actually send a message to journald
|
|
|
|
#### But What is That Setup Information Anyway?
|
|
There are a few options you can configure when setting up a stdout journald stream. The journald authors decided the best way to do this was
|
|
to send some newline separated options. Once the last option is sent (7 newlines) then journald assumes everything is a log message.
|
|
They also designed their [parsing function](https://github.com/systemd/systemd-stable/blob/37c4cfde0ce613f0f00544d3f4e2e72bf93d9c76/src/journal/journald-stream.c#L359) to support sending the header as multiple messages.
|
|
In order the option sent are:
|
|
- Stream Identifier (empty in this case) is a string that identifies what service the message comes from. It will show up as SYSLOG_IDENTIFIER in the journald output.
|
|
- Unit ID (empty in this case) overrides which systemd unit the message is from. It's ignored if you aren't root when you send this.
|
|
- Default Priority (6 in this case) is the priority to use when it is not otherwise specified. In this case 6 means "Informational" or that it doesn't require any action.
|
|
- Level Prefix (1 in this case) is whether to parse syslog priority prefixes. For example, if a line starts with `<3>` then journald will log it as having priority 3, overriding the default.
|
|
- Forward to Syslog (0 in this case) is whether to send a copy of the message to syslog in case you have another logging daemon to pick it up
|
|
- Forward to kmsg (0 in this case) is whether to send a copy to the kernel message buffer. You can read this through dmesg
|
|
- Forward to Console (0 in this case) is whether to send a copy of the message to your console, if you're in a Linux TTY. I haven't gotten this to work
|
|
|
|
#### Trying it out
|
|
You can also try out this submission method using netcat.
|
|
A basic example:
|
|
`echo -ne "\n\n6\n1\n0\n0\n0\n<4> hewwo\n" | nc -U /run/systemd/journal/stdout`
|
|
This opens a stream with no stream identifier or unit ID, a default priority of 6 (info), parsing level prefixes enabled, and no forwarding enabled. Then, we send the message "hewwo" with a prefix setting the priority to 4. Note that we use `-U` but not `-u` on netcat since this is a Unix stream socket (analogous to TCP) and not a Unix datagram socket (analogous to UDP).
|
|
If you run this you'll get a message in your logs! You can quickly view it with `journalctl -xe`. Note that you won't get the normal extra fields as netcat exits too quickly for journald to grab the data.
|
|
|
|
## Linux, ..., why?
|
|
Anything related to Linux quickly turns into a huge rabbit hole. I could certainly write pages and pages about small things I've barely mentioned here, but I think this is a good place to stop.
|
|
This is mostly from my own experimentation. If you have more information and noticed an error, please contact me. I'd be happy to correct anything.
|