DevOps & Workflow21 min read

    Troubleshoot Linux Using dmesg

    By DanaServer Monitoring & Linux
    Share

    When something misbehaves on a Linux box — a disk goes read-only, a service gets killed for no obvious reason, the NIC resets mid-traffic, a process segfaults, a USB device won't enumerate — the kernel almost always wrote why into its ring buffer before anything in userspace noticed. dmesg is the tool that reads that buffer. It's the first place a competent sysadmin looks when a Linux machine surprises them, and it's the difference between "the box is being weird" and "the SATA controller threw a UNC error at LBA 8423000 forty seconds before the service died."

    The literal command is dmesg. The useful skill is knowing what to filter for, what each common message class actually means, and how to correlate a dmesg line with the userspace event that brought you here. This guide walks through how to read dmesg productively, the recipes for the half-dozen problem classes that account for most of what you'll see, and how to handle the cases where the kernel knows what went wrong but the ring buffer has already overwritten it.

    Everything below assumes a modern Linux distro (kernel 4.x or newer, systemd-based) — AlmaLinux/Rocky 8+, RHEL 8+, Debian 10+, Ubuntu 20.04+, current Arch/Fedora. Older systems work mostly the same; the journald integration is the main thing that's evolved.


    What dmesg actually is

    The name is short for "display message" — specifically, the kernel's in-memory printk ring buffer. Every printk() call inside the kernel (and every kernel module, driver, and subsystem) writes into this buffer. dmesg is just a thin reader for it.

    Three properties of that buffer matter for troubleshooting:

    • It's circular. The buffer is fixed size (usually 1–16 MiB depending on CONFIG_LOG_BUF_SHIFT). Once full, new messages overwrite the oldest ones. On a busy or chatty system, an interesting message from six hours ago may simply be gone.
    • It survives across runlevels but not reboots. Reboot the machine and the buffer is empty. To investigate a crash after a reboot, you need a persistent capture (journald, rsyslog, or a kdump crash log) — dmesg itself is gone.
    • It's everything the kernel said, in order. Boot-time hardware probing, driver loads, every later subsystem event, every error, every warning, every OOM-killer decision — all in one stream. That's its strength (correlation across subsystems is trivial) and its weakness (signal-to-noise can be brutal without filters).

    This is also why dmesg is fundamentally different from journalctl, /var/log/syslog, or /var/log/messages. Those are userspace logs collected from running daemons and from forwarded kernel messages. dmesg is the kernel's own perspective, read directly from kernel memory. They overlap (journald captures kernel messages too — see below) but they're not the same source of truth.


    dmesg vs journalctl vs /var/log

    A quick mental model so you reach for the right tool:

    Tool Source Survives reboot? Good for
    dmesg Kernel ring buffer (live, in kernel memory) No Anything the kernel said — hardware, drivers, OOM, MCE, network stack
    journalctl -k journald's captured copy of kernel messages Yes, if persistent journal is enabled The same as dmesg, but historical
    journalctl -u <unit> journald, per-unit Yes Why a specific service died
    /var/log/messages or /var/log/syslog rsyslog/syslog-ng, if installed Yes Legacy aggregation; many distros still tee here
    /var/log/kern.log rsyslog-filtered kernel messages Yes The historical "kernel-only" view on Debian/Ubuntu

    Practically: dmesg for "what is the kernel saying right now", journalctl -k --since "1 hour ago" for "what did the kernel say earlier today", journalctl -u <unit> for "why did this service die". Most modern incidents are some interleaving of those three, and you'll often have two terminals open running them in parallel.

    If journalctl -k returns nothing on your distro, persistent journal isn't enabled. mkdir -p /var/log/journal && systemctl restart systemd-journald turns it on. Without it, every reboot wipes the kernel log history — which is exactly when you most want to read it.


    Reading dmesg output

    Run dmesg on a modern system and you'll get something like:

    [    0.000000] Linux version 6.1.0-21-amd64 (debian-kernel@lists.debian.org) ...
    [    0.123456] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian
    [    1.234567] usb 1-1: new high-speed USB device number 2 using xhci_hcd
    [ 1283.456789] EXT4-fs (sda1): mounted filesystem with ordered data mode
    [ 8421.987654] sd 0:0:0:0: [sda] tag#16 UNC error
    [ 9001.111111] Out of memory: Killed process 12345 (mysqld) total-vm:...
    

    The bracketed number is seconds since boot — useful for ordering, useless for "when was this in wall-clock time" if your incident was at 03:14 yesterday. Two flags fix that:

    # Wall-clock timestamps
    dmesg -T
    
    # ISO-8601 with timezone (saner for grep, sort, sharing in tickets)
    dmesg --time-format=iso
    

    -T is what most people remember and it's fine for interactive use. --time-format=iso is what you want in any output you're going to paste into a ticket or feed into another tool — the format is unambiguous and sortable.

    Each line also has an implicit log level (emerg, alert, crit, err, warn, notice, info, debug) and facility (kernel subsystem). They're not printed by default. To see them:

    dmesg -x
    
    kern  :info  : [    1.234567] usb 1-1: new high-speed USB device ...
    kern  :err   : [ 8421.987654] sd 0:0:0:0: [sda] tag#16 UNC error
    kern  :emerg : [ 9001.111111] Out of memory: Killed process ...
    

    The level is the first knob you'll filter on.


    Filtering: the four flags that matter

    dmesg has a lot of flags. In practice, four cover 95% of troubleshooting.

    -l, --level — by severity

    # Just errors and worse
    dmesg --level=err,crit,alert,emerg
    
    # Warnings too
    dmesg --level=warn,err,crit,alert,emerg
    

    This is the first thing to run when you arrive at a misbehaving box. Cutting out info-level chatter (the bulk of every dmesg dump) takes you from "wall of text" to "the dozen lines that actually flag a problem" in one command.

    -f, --facility — by kernel subsystem

    # Only messages from a specific facility
    dmesg -f kern              # kernel core
    dmesg -f user              # user-space (rare)
    dmesg -f daemon            # daemon-tagged messages
    

    For kernel troubleshooting, -f kern is almost always what you want — most other facilities are empty.

    -w, --follow — live tail

    sudo dmesg -wT
    

    This is tail -f for the kernel ring buffer. Run it in a side terminal when you're actively reproducing a bug — plug in a USB device, mount a disk, trigger the failing operation — and watch what the kernel says in real time. Combining with -T gives wall-clock timestamps so you can correlate with whatever you typed.

    -H, --human — paginated, colorized, friendly

    sudo dmesg -H
    

    Equivalent to dmesg --color --reltime | less. Errors are red, the buffer scrolls in a pager, and timestamps are formatted as relative ("3 minutes ago"). Nice for interactive browsing; useless for piping anywhere.

    A combination I run on every box I troubleshoot: sudo dmesg -T --level=err,warn,crit,alert,emerg | tail -100. That's the last 100 anything-worse-than-info kernel messages, with wall-clock timestamps. Almost any kernel-level problem will surface in that view.


    A note on permissions

    On modern kernels (Debian, Ubuntu, RHEL 8+, all current distros), dmesg requires root by default because the ring buffer can leak addresses useful for kernel exploits. Plain dmesg as a regular user returns:

    dmesg: read kernel buffer failed: Operation not permitted
    

    sudo dmesg works. If you want to allow regular users to read it (for example, on a personal machine), sudo sysctl kernel.dmesg_restrict=0 flips the toggle; persist it with echo 'kernel.dmesg_restrict=0' | sudo tee /etc/sysctl.d/99-dmesg.conf. On servers and any multi-user box, leave it restricted.


    What you'll actually find: the common message classes

    The vast majority of "real" dmesg findings fall into one of seven categories. Recognizing each on sight saves the loop of "weird kernel line → grep → confused → ask the internet."

    OOM killer

    When the kernel runs out of memory and chooses a victim, it logs a substantial block that starts like:

    [ 9001.111] Out of memory: Killed process 12345 (mysqld) total-vm:8390124kB, ...
    [ 9001.112] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0
    [ 9001.113] Mem-Info:
    

    Followed by a per-zone memory breakdown and the process table at the time of the kill. The first line names the killed process. The block that follows tells you what else was using memory — often more useful, because the OOM killer's choice of victim is not always the actual culprit; it just picked whoever had the highest oom_score. If mysqld got killed but the real RSS hog was a runaway Python job that escaped its cgroup, you'll see it in the process list.

    Grep recipe:

    sudo dmesg -T | grep -i -A 30 "out of memory"
    

    If you see OOM-killer activity on a server you didn't expect to be memory-constrained, server-level memory monitoring with alerts on commit and swap usage is the next step — the kill itself is the symptom; the slow climb that preceded it is the signal you want to alert on, ideally before the kill.

    Block layer and filesystem errors

    Disks dying — or about to die — talk to you through dmesg long before SMART tools or filesystem checkers do. The classic patterns:

    [ 8421.987] sd 0:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [ 8421.988] sd 0:0:0:0: [sda] tag#16 Sense Key : Medium Error [current]
    [ 8421.989] sd 0:0:0:0: [sda] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
    [ 8421.990] sd 0:0:0:0: [sda] tag#16 CDB: Read(10) 28 00 00 80 b3 80 00 00 08 00
    [ 8421.991] blk_update_request: I/O error, dev sda, sector 8434048 op 0x0:(READ) ...
    [ 8421.992] EXT4-fs warning (device sda1): ext4_end_bio:341: I/O error 10 writing to inode ...
    [ 8421.993] EXT4-fs error (device sda1): __ext4_iget:5167: inode #524289: comm postgres: ...
    [ 8421.994] EXT4-fs (sda1): Remounting filesystem read-only
    

    That whole cascade is one event: a bad sector on sda, the SCSI layer reporting it, the block layer giving up, and ext4 deciding to go read-only rather than risk corruption. The last line — "Remounting filesystem read-only" — is the moment your application started failing all writes. The first line is the moment the underlying hardware actually broke.

    Grep recipes:

    # Block-layer I/O errors
    sudo dmesg -T | grep -iE "i/o error|blk_update_request|sense key"
    
    # Filesystem-layer errors
    sudo dmesg -T | grep -iE "ext4|xfs|btrfs|zfs" | grep -iE "error|warning"
    

    If you see these on a production server, treat it as "the disk is dying, plan replacement today." It almost never gets better on its own.

    Network and NIC issues

    Driver problems, link flaps, and offload bugs all surface here:

    [12345.678] e1000e 0000:00:1f.6: eth0: Detected Hardware Unit Hang
    [12345.789] r8169 0000:03:00.0: eth0: rtl_chipcmd_cond loop too long!
    [12346.001] e1000e 0000:00:1f.6: eth0: Reset adapter unexpectedly
    [12346.234] mlx5_core 0000:01:00.0: mlx5_health_check: device's health compromised
    

    Random connection drops on a server that "should be stable" are very often this kind of NIC reset, not a switch or upstream problem. The fix is usually a firmware update, an offload flag toggle (ethtool -K eth0 tso off gso off), or — for some Realtek hardware — accepting that this is just what that driver does and replacing the NIC.

    Grep recipe:

    sudo dmesg -T | grep -iE "eth[0-9]|en[ops][0-9]|wlp|link|nic" | grep -iE "down|reset|hang|error"
    

    Hardware errors (MCE)

    Machine Check Exceptions are the CPU and memory subsystem telling you something physically wrong happened — a corrected (or uncorrected) ECC error, a CPU pipeline glitch, a cache fault. They're rare but always worth attention:

    [34567.890] mce: [Hardware Error]: Machine check events logged
    [34567.891] [Hardware Error]: CPU 3: Machine Check: 0 Bank 7: ee2000000010005a
    [34567.892] [Hardware Error]: TSC 1ec4c6e34a48f4 ADDR 1bf3e8240 MISC 86
    [34567.893] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 ...
    

    A single corrected error (CE) on a server-grade ECC system isn't an emergency, but a steady stream of them from the same DIMM is a "replace before it goes uncorrected" signal. Uncorrected errors (UE) typically panic the kernel and reboot the box; you'll find them in the post-reboot journal, not live dmesg.

    Grep recipe:

    sudo dmesg -T | grep -iE "mce|machine check|hardware error|edac"
    

    USB and hotplug

    Every USB device plug, unplug, and reset is logged. This is the easiest dmesg pattern to read because it's so verbose:

    [45678.123] usb 1-2: new high-speed USB device number 4 using xhci_hcd
    [45678.234] usb 1-2: New USB device found, idVendor=1d6b, idProduct=0002
    [45678.345] usb 1-2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
    [45678.456] usb 1-2: Product: USB 2.0 Hub
    [45678.567] hub 1-2:1.0: USB hub found
    [45678.678] hub 1-2:1.0: 4 ports detected
    

    If a device "doesn't work when plugged in," running sudo dmesg -wT and then plugging it in shows you exactly which subsystem rejected it and why — wrong driver, insufficient power, descriptor errors, etc.

    Process crashes (segfaults, traps)

    When a userspace process dies on a signal, the kernel logs it:

    [56789.012] traps: myapp[8432] general protection fault ip:7f5d3a1c1234 sp:7ffce5a8 error:0
    [56789.013] myapp[8432]: segfault at 0 ip 00007f5d3a1c1234 sp 00007ffce5a8 error 4 in libfoo.so.1.2.3[7f5d3a1c0000+5000]
    

    The ip: is the instruction pointer at the crash, and the library name tells you which shared object the fault was inside. Combined with addr2line or a coredump, that's enough to find the actual line of code. For "this binary crashes occasionally but I can't reproduce it," dmesg | grep segfault over the last week often reveals a clear pattern (same library, same offset, only on certain machines).

    TCP, conntrack, and netfilter

    Network-stack tuning issues show up here:

    [67890.123] nf_conntrack: nf_conntrack: table full, dropping packet
    [67890.234] TCP: out of memory -- consider tuning tcp_mem
    [67890.345] net_ratelimit: 137 callbacks suppressed
    

    nf_conntrack: table full is the classic — your conntrack table is sized for, say, 65k connections, and a traffic burst exceeded it. New connections silently drop until entries expire. The fix is sysctl net.netfilter.nf_conntrack_max=... (and persistent in /etc/sysctl.d/), but the dmesg line is what tells you that's the problem in the first place. Without it, you'd be debugging mysterious connection failures with no kernel-level signal.


    A complete worked example: investigating a hung service

    You get an alert that myapp on web03 is unreachable. SSH still works. systemctl status myapp shows it failed 12 minutes ago. The systemd journal for the unit just says "main process exited, code=killed, status=9/KILL". That's signal 9 — SIGKILL — which myapp didn't send to itself. Something else killed it. Time for dmesg:

    # Pull the last hour of kernel errors and warnings with wall-clock timestamps
    sudo dmesg -T --level=err,warn,crit,alert,emerg --since="1 hour ago" 2>/dev/null \
      || sudo dmesg -T --level=err,warn,crit,alert,emerg | tail -200
    

    (--since is supported on recent util-linux; the fallback tail -200 works everywhere.)

    Scrolling through, you find:

    [Wed May 21 13:42:11 2026] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
    [Wed May 21 13:42:11 2026] CPU: 2 PID: 8765 Comm: myapp Not tainted 6.1.0-21-amd64 #1
    [Wed May 21 13:42:11 2026] Call Trace:
    ...
    [Wed May 21 13:42:11 2026] Mem-Info:
    [Wed May 21 13:42:11 2026]   active_anon:412345 inactive_anon:0 isolated_anon:0
    [Wed May 21 13:42:11 2026]   ... (full breakdown) ...
    [Wed May 21 13:42:11 2026] [  8765]  1000  8765  2098156   412034  3678208     1234       0 myapp
    [Wed May 21 13:42:11 2026] [  9123]  1000  9123   524288   125678  1058816      512       0 python3
    [Wed May 21 13:42:11 2026] Out of memory: Killed process 8765 (myapp) total-vm:8392624kB, anon-rss:1648136kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14352kB oom_score_adj:0
    

    Now you have the answer: myapp allocated until the system ran out of memory and was killed by the OOM killer at 13:42:11. The kill was 9 minutes before the alert fired, which fits — it took that long for the load balancer to notice and the alert pipeline to escalate.

    A few minutes later you grep further back to confirm this isn't a recurrence:

    sudo dmesg -T | grep -B 2 -A 1 "killed process"
    

    Three OOM-kills in the last 24 hours, all myapp. So this isn't a one-off — there's a leak, or the working set has grown past the box's memory, or someone removed a memory limit. Now you have something concrete to dig into. Without dmesg, you'd have spent the next hour staring at "killed, status=9/KILL" with no idea who held the gun.

    That's the loop: alert → quick dmesg -T --level=err,warn,... → recognize the message class → correlate with the userspace timeline → identify the root cause. On a healthy box, the kernel says almost nothing; when it does say something, it usually says why.


    Ring buffer size and persistence

    The default ring buffer is usually 1–4 MiB, which on a chatty system holds maybe a day of messages, on a quiet system holds months. To check yours:

    sudo dmesg | wc -c          # current bytes in the buffer
    cat /proc/sys/kernel/printk # current console log level + defaults
    

    To enlarge it persistently (boot-time only — it's a kernel build/boot parameter):

    # /etc/default/grub
    GRUB_CMDLINE_LINUX="... log_buf_len=16M"
    # then update-grub (Debian/Ubuntu) or grub2-mkconfig (RHEL family) and reboot
    

    The more important persistence story is enabling the persistent journal so kernel messages survive across reboots:

    sudo mkdir -p /var/log/journal
    sudo systemd-tmpfiles --create --prefix /var/log/journal
    sudo systemctl restart systemd-journald
    
    # Verify
    journalctl --disk-usage
    # Archived and active journals take up 128.0M in the file system.
    
    # Now this works for historical kernel logs:
    journalctl -k --since "yesterday"
    

    Once that's on, every kernel message dmesg shows in real time is also captured to disk by journald. After a reboot — including a panic-induced reboot — journalctl -k -b -1 gives you the kernel log from the previous boot. That's the single most valuable post-incident artifact on a Linux server, and it's off by default on a surprising number of installs.


    Common mistakes

    The recurring ways people miss what dmesg is telling them.

    Running dmesg without -T

    Bracketed seconds-since-boot is the default and it's terrible for incident work. If the line says [ 8421.987654], you can't tell whether that was 30 seconds ago or 30 hours ago without uptime math. Make dmesg -T your reflex. Better yet, add an alias: alias dmesg='dmesg -T'.

    Forgetting the buffer wraps

    On a chatty machine — a hypervisor with lots of guest churn, a server doing constant USB hotplug for a tape autoloader, a box with a NIC that resets every few seconds — interesting old messages get overwritten quickly. If you're investigating something from "earlier today" and the relevant lines aren't in dmesg, check journalctl -k --since ... before concluding nothing was logged.

    Reading dmesg only when alerts fire

    A clean dmesg --level=err,warn,... should usually be quiet. If yours has been steadily logging a non-fatal warning every hour for three months, treat that as a yellow flag, not as "background noise." Some of the most expensive incidents started as a warning the on-call team got used to ignoring. Periodic dmesg | grep -iE 'error|warn' is worth scripting into a daily report.

    Confusing kernel timestamps with userspace timestamps

    When correlating a dmesg event with a userspace log line, remember that they're stamped from different clocks. dmesg -T derives wall-clock from the kernel monotonic clock plus the system boot time — perfectly accurate unless the system clock was adjusted (e.g. NTP correction after boot). For sub-second correlation across kernel and journald, journalctl -k is more reliable because it uses the journal's own timestamp, which is the same clock as everything else in the journal.

    Filtering by facility on stuff that doesn't use it

    dmesg -f kern is fine. dmesg -f user almost always returns nothing — userspace messages don't normally go through the kernel printk buffer. People sometimes try -f to filter by subsystem (network, scsi, etc.) and get an empty result because the facility model is much coarser than they think. For subsystem-specific filtering, grep the output: dmesg -T | grep -i scsi.

    Ignoring the surrounding lines

    A single dmesg line out of context is often meaningless. Out of memory: Killed process 8765 (myapp) is the headline, but the 50 lines that precede it — the per-process memory breakdown — tell you why the OOM happened. Always grep -B 30 -A 30 (or grep -A 30 ... | less) when you find a notable line; the context is where the answer usually is.

    Running on a fresh-reboot box looking for pre-reboot evidence

    dmesg after a reboot shows only the current boot. If the box crashed and rebooted itself, dmesg will tell you nothing about why. journalctl -k -b -1 (previous boot) is what you want, and only works if persistent journal is enabled — which is exactly why enabling it is one of the first things to do on any new server.


    Troubleshooting recipes

    A handful of one-liners for the cases that come up most often.

    • "Did the OOM killer fire?"

      sudo dmesg -T | grep -iE "oom|out of memory|killed process"
      # If persistent journal:
      sudo journalctl -k --since "7 days ago" | grep -iE "oom|killed process"
      
    • "Is the disk having physical issues?"

      sudo dmesg -T | grep -iE "i/o error|sense key|unrecovered|medium error|sector"
      

      Pair with sudo smartctl -a /dev/sda for the SMART-side picture and disk health monitoring for ongoing visibility.

    • "Is the NIC misbehaving?"

      sudo dmesg -T | grep -iE "eth|enp|wlan|link" | grep -iE "down|reset|hang|err"
      
    • "Did a process segfault?"

      sudo dmesg -T | grep -iE "segfault|traps|general protection|invalid opcode"
      

      The library name in brackets at the end of a segfault line is usually the actionable clue.

    • "Is the network stack dropping things?"

      sudo dmesg -T | grep -iE "conntrack|tcp.*out of memory|net_ratelimit"
      
    • "Anything serious in the last hour?"

      sudo dmesg -T --level=err,crit,alert,emerg | tail -50
      
    • "Live-watch what happens when I plug this in / mount this / start this service":

      sudo dmesg -wT
      # In another terminal: trigger the operation
      
    • "Clear the buffer so I can see only what happens next":

      sudo dmesg -C
      # Then trigger the action you're investigating
      sudo dmesg -T
      

      Only useful interactively, and you lose history — don't do this on a production box unless you've captured the existing buffer first (sudo dmesg -T > /tmp/dmesg-before-clear.log).

    • "Show me kernel logs from the previous boot" (post-crash forensics):

      sudo journalctl -k -b -1
      # Or list all available boots:
      sudo journalctl --list-boots
      

    Where dmesg fits in a monitoring pipeline

    dmesg is interactive — it's what you run when you're at a terminal investigating a specific machine. For continuous coverage, the kernel messages need to flow into an aggregation pipeline that alerts on the patterns that matter:

    • Enable the persistent journal (as above) so post-reboot forensics are possible.
    • Ship /var/log/journal/ to a central log collector (Vector, Fluent Bit, Loki, Elastic, whatever you have). The kernel messages ride along with everything else.
    • Alert on the patterns that historically lead to incidents on your stack: oom-killer, I/O error, EXT4-fs error, nf_conntrack: table full, MCE, repeated NIC resets. These are low-volume on a healthy system — false-positive risk is minimal.
    • Pair with server-level metrics monitoring (memory, disk, network) so you have the gradient leading up to the event, not just the moment of failure. The kernel message tells you what broke; the metric tells you what was happening when it broke.

    dmesg itself is unchanged from how it worked 20 years ago. What's changed is the surrounding infrastructure — journald, log shipping, alerting on patterns — which turns "the kernel logged this error 3 hours ago and nobody noticed" into "we paged on the I/O-error pattern, started the disk swap, and replaced it before the filesystem went read-only."


    Summary

    To troubleshoot Linux effectively with dmesg:

    1. dmesg reads the kernel ring buffer — kernel-only, in-memory, lost on reboot. Different from journalctl and /var/log, but often the most direct answer.
    2. Always use dmesg -T. Boot-seconds timestamps are useless for incident work. Make it a reflex or an alias.
    3. Filter by level first: dmesg -T --level=err,warn,crit,alert,emerg | tail -100 is the universal "is anything broken" view.
    4. Recognize the common message classes on sight: OOM killer, block-layer I/O errors, NIC resets, MCE, segfaults, conntrack-table-full. Each has a distinctive signature and a known fix path.
    5. Read the surrounding lines. A single line is usually a headline; the answer is in the 20–50 lines around it. Use grep -B 30 -A 30 or pipe to less.
    6. Enable the persistent journal (mkdir -p /var/log/journal && systemctl restart systemd-journald) so journalctl -k -b -1 works after a crash. This single change pays for itself the first time a box reboots unexpectedly.
    7. Use dmesg -wT to live-tail when reproducing a bug — plug in the device, trigger the operation, watch the kernel react in real time.
    8. Ship kernel logs to your aggregation pipeline and alert on known-bad patterns. dmesg is the interactive tool; alerting on its output is what makes the patterns visible before they become incidents.

    The command itself is simple. The skill is recognizing what each kind of message means and knowing the right filter for the question you're asking — and the rest is just patience. The kernel almost always wrote down why something went wrong; dmesg is how you read what it said.