13 KiB
layout | title | date |
---|---|---|
post | Jumping into journald | 2021-06-02 |
On many Linux systems, systemd-journald runs as a daemon at boot and collects your logs. You can access them through journalctl but it turns out journald is a lot more complicated then just sending something to a text file.
Anatomy of an Entry
While you'll mostly see entries as a terse error message on one line, every time you send a message journald collects and stores a lot more information. For example, here's an error from my audio server, pipewire. Note that some fields are reordered from the raw journalctl output
artemis@starlight ~> journalctl --user -xeu pipewire.service -o export
...
_BOOT_ID=eba612c42f634b58b0484c42756ba712
_UID=1000
_GID=1000
_CAP_EFFECTIVE=0
_MACHINE_ID=b3bee68a0f884e6c982529efec408a61
_HOSTNAME=starlight
_TRANSPORT=journal
_SELINUX_CONTEXT=kernel
_AUDIT_SESSION=2
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/user@1000.service/session.slice/pipewire.service
_SYSTEMD_UNIT=user@1000.service
_SYSTEMD_SLICE=user-1000.slice
_SYSTEMD_USER_SLICE=session.slice
_SYSTEMD_USER_UNIT=pipewire.service
_PID=4046
_COMM=pipewire
_EXE=/nix/store/4qp4npwqabf3mnsy230w3z1nqdjl1gxr-pipewire-0.3.26/bin/pipewire
_CMDLINE=/nix/store/4qp4npwqabf3mnsy230w3z1nqdjl1gxr-pipewire-0.3.26/bin/pipewire
_SYSTEMD_INVOCATION_ID=a9477826e73747ac810f958f894c302e
_SOURCE_REALTIME_TIMESTAMP=1622336991484040
__CURSOR=s=ac8b24ddf2634038b49168edc5d6e544;i=b4233a3;b=4950abcba7ec46629fce878fb239a6e6;m=1408502e8be;t=5c381c41568ad;x=2af33cc0b675a511
__REALTIME_TIMESTAMP=1622336991488173
__MONOTONIC_TIMESTAMP=1376621095102
PRIORITY=4
SYSLOG_IDENTIFIER=pipewire
CODE_FILE=../src/pipewire/impl-node.c
CODE_LINE=957
CODE_FUNC=dump_states
MESSAGE=(PipeWire ALSA [.electron-wrapped]-110) client too slow! rate:256/48000 pos:4451017216 status:triggered
An entry consists of freeform variables with binary (though generally ASCII/US English) values. Values starting with an underscore are "trusted" and generated by journald while others are sent by the process along with the primary message. This helps provide context about what exact process failed and what state it was in during that failure. Unfortunately the official descriptions of what these fields mean can be a bit obtuse.
While working on my prototype for a system-journald replacement, rjournald I've discovered what many of these mean through context or reading the systemd code. You can categorize these into one of a few types
System context
These fields help you figure out if the error is coming from this computer, OS install, or boot.
- _BOOT_ID is a unique ID (UUID in this case) generated at every startup. The kernel creates it and you can access the current one at
/proc/sys/kernel/random/boot_id
. I've found it useful to help figure out if I've rebooted my system since an error occurred. - _MACHINE_ID is a unique ID to the system which you can find in
/etc/machine-id
. This should be set on the first boot of your system by systemd and helps you figure out if the logs could be from before a system was reinstalled. - _HOSTNAME is the name of the system. You probably set this in install. There's a few places to get this but journald uses
/proc/sys/kernel/hostname
(what the kernel thinks your hostname is). You can also get the hostname from systemd-hostnamed (which lets you set non-ascii hostnames for some programs) or/etc/hostname
(which is where systemd will read your hostname and tell it to the kernel), but these might be different.
Process permissions
These help you understand what kind of access a process has. You might get errors if a process has insufficient permission or runs as the wrong user
- _UID tells you what user executed the process, as seen by journald. You can get this for your user with the command
id
. - _GID tells you which group the process was using, as seen by journald. While a user can have several groups, a process executes under one primary group ID
- _CAP_EFFECTIVE provides what capabilities a process can use. Capabilities give fine-grained privileged access to processes without requiring them to be the root user. For example, binding to port 80 or 443 requires the CAP_NET_BIND_SERVICE capability. If _CAP_EFFECTIVE=0 then you know you've missed that capability.
- _SELINUX_CONTEXT is an additional set of permissions when using the SELinux LSM (Linux Security Module). I don't use SELinux on this sytem so it just shows up as "kernel", meaning SELinux will not limit permissions. Fedora, CentOS, and RHEL use SELinux by default.
systemd context
If you're using journald you're almost certainly using systemd to start all your processes. Systemd organizes processes into "units", such as OS services and user sessions, and "slices", a set of similar units. System units are started and organized by PID 1, the first program executed by the Linux kernel when you startup, while user units are organized by another instance of systemd running as your user, started by PID 1 when you login. These are represented to the rest of the OS as a hierarchical set of "cgroups" which allow systemd to set maximum resources for units and slices (e.g. make it so users can only use 1GB of RAM at a time).
- _SYSTEMD_CGROUP is the full cgroup path of the process. As it's hierarchical setting a limit for e.g.
user.slice
would make it so all users together cannot go above the limit. While this is the only option you really need (systemd-journald parses the slices and units from the path it retrieves from/proc/[pid]/cgroup
) others are included for convenience . - _SYSTEMD_SLICE is the deepest slice that the PID 1 systemd assigns to the process. In this example journald chooses to say
user-1000.slice
even though the process is also inuser.slice
. Note that there is a special case for systemd: its full cgroup path is/init.scope
but journald claims it's in the special slice-.slice
. - _SYSTEMD_UNIT is the unit that PID 1 systemd assigns to this process. For a system service this would probably be something like
dbus.service
. - _SYSTEMD_USER_SLICE and _SYSTEMD_USER_UNIT are similar to the same fields without USER, but assigned by the user systemd
Process context
- _PID is the process ID (a numeric ID from 1 to 4194304) as seen by journald. PIDs are not unique over a system boot but should not be reused at the same time.
- _EXE is the location of the executable. This is the result of canonicalizing the symlinks at executable start (i.e. if originally a → b → c, you execute a, then a → b → d, then
_EXE
will still contain a). On your system this will probably end up being something in/usr/bin
but I use NixOS which uses extremely long executable path names. - _CMDLINE is the full command with arguments as you might see in
char **argv
. Note that a program can change this. The most high-profile example I've seen of this is nginx, where you will see logs fromnginx: worker process
. - _COMM is the command name. This will normally be the final part of the path in
_EXE
but can be different, especially when running programs like busybox where multiple programs are in file. This is also the pthread name of the sending thread. For example when logging from Firefox this might beWeb Content
.
Time
Time, as it turns out, is extremely complicated. You'll get 3 separate time fields. Two of them are the "wall clock" time in unix time (nominally microseconds since midnight at the beginning of 1 January 1970 UTC, though leap seconds make this a bit more complicated). Unfortunately, wall clock time can jump forwards or backwards if your computer's clock is too slow or fast, respectively. Therefore, systemd also includes the "monotonic time", a number of seconds since some point in the past. This is guaranteed to always move forward so this is what you'll want to discern ordering.
Unfortunately wall clock time is also more complicated than you might expect. Linux has 4 separate monotonic timers:
- CLOCK_MONOTONIC_RAW counts the amount of time that Linux has spent not asleep since last boot.
- CLOCK_MONOTONIC is similar but will also include incremental adjustments from your time synchronization daemon noticing a fast or slow clock and reporting it to the kernel with adjtime. The adjustments happen over a long period by speeding up or slowing the clock.
- CLOCK_BOOTTIME is similar to CLOCK_MONOTONIC but also counts time where they system is asleep.
- CLOCK_MONOTONIC_COARSE is similar to CLOCK_MONOTONIC but requires slightly less work in the Kernel and is less accurate. This isn't particularly useful for our use case
Now for the actual fields:
- _SOURCE_REALTIME_TIMESTAMP is the UTC timestamp collected as soon as journald receives the message
- __REALTIME_TIMESTAMP is the UTC timestamp collected when message is written into the journal file. It takes a bit of time for journald to read all the context data out of /proc so this is a bit later.
- __MONOTONIC_TIMESTAMP is the monotonic timestamp from when the message is written into the journal file. Journald uses CLOCK_MONOTONIC but there's not much of a specification to tell people implementing it which one to use.
Reader context
Fields starting with two underscores are generated by journalctl
while reading.
The actual message
Finally there's the untrusted message sent by the process.
- MESSAGE is the only required field and is what shows up in
journalctl
when you don't use-o export
. - Some programs also send where the error comes from with CODE_FILE, CODE_LINE, and CODE_FUNC
- PRIORITY is how severe the issue is with 0 being the most important and 7 being the least. In this example, 4 means "warning"
- SYSLOG_IDENTIFIER is the program identifier and is what you would get as the program source if you were using syslogd (as you would before systemd)
A Note on Namespaces
Linux has the concept of a "namespace" mostly seen with containers which allows different processes to see the system differently. A few types make things interesting when logging to a journald outside the namespace (e.g. if you pass through the host journald socket to a container)
- PID namespaces allow different processes to see a different list of processes. For example, I'm currently running a container to run games on my laptop. If I list processes in the container I only see 34 while there are 472 running on my system overall. Additionally, within a PID namespace PIDs will be remapped. My games container is running systemd at PID 1 but that same process appears to the rest of my system as PID 2454372. The Linux kernel remaps PIDs to make sense in the receiver's PID namespace when sending the sender's credentials, so journald will record the PID as seen from the host if you pass it through to a container.
- User namespaces remap user and group IDs. This can be useful so that Linux doesn't assume everything running as root in your container has full root permissions on the the host with respect to e.g. loading kernel modules. When you make a user namespace you specify a UID map for reading and setting UIDs. For example, I use UID 1000000 instead of UID 0 in my games container. Journald will see the PID and GID as seen from the host namespace
- Mount
- UTS namespaces allow programs to see a different hostname. Journald deals with these by not caring. Journald sets the
_HOSTNAME
field by asking for the hostname from the OS once then caching it. Messages from containers will show up in the log as using the hostname where journald receives it. - Time
Transports
There's still one field I haven't described: _TRANSPORT. This requires a little more context.
Journald can get messages from one of 6 separate sources: journal (using the native journald protocol), stdout (a process's standard output or error redirected to systemd), syslog (the legacy linux logging system), kernel (kernel messages you can get through the dmesg
command), audit (logs the kernel generates about programs' activities), and driver (error messages from within journald). Each has their own peculiarities from both the journald side and the client side but I'll mostly be talking about journald, stdout, and syslog.
Native (journal)
Service output (stdout)
Legacy (syslog)
Overview
Anything related to Linux quickly turns into a huge rabbit hole. I could certainly write articles on many of the This is mostly from my own experimentation. If you have more inforomation and noticed an error, please contact me. I'd be happy to correct anything.