"How do I tune the Linux kernel?" is one of those questions that hides a much harder one: for what workload, on what hardware, against what bottleneck? The literal mechanics — sysctl -w, drop a file into /etc/sysctl.d/, run sysctl -p — are easy and take five minutes to learn. The hard parts are knowing which knobs are worth touching on a modern kernel, why they matter, how to measure that the change helped, and how to back it out when it didn't. This guide covers the mechanics, then walks through the parameters that are actually load-bearing on real servers in 2026, what they mean, sensible starting values for common workloads, and the mistakes that make "tuning" a net loss.
Two things to internalize before you change anything:
- The kernel's defaults have improved enormously. Many "must-tune" parameters from old guides (
tcp_tw_recycle,tcp_fin_timeout=15, aggressivevm.swappiness=10for every server) are either gone, harmful, or already the modern default. Check the running default before assuming you need to override it. - Tuning without measurement is theater. A change that might help without a benchmark before and after is just configuration drift. Measure first, change one thing, measure again. If you can't quantify the improvement, the change shouldn't ship.
The examples below assume a recent Linux distro (kernel 5.15+ — Ubuntu 22.04/24.04, AlmaLinux 9, Rocky 9, Debian 12). Older kernels have additional pitfalls noted inline.
What kernel parameters actually are?
Linux exposes a few thousand runtime tunables through three related interfaces:
/proc/sys/— virtual files. Read withcat, write withecho > file. Each file is one parameter.sysctl— the userspace command that reads and writes those same files using dotted names.net.ipv4.tcp_syncookiescorresponds to/proc/sys/net/ipv4/tcp_syncookies./sys/— a separate filesystem for hardware and driver knobs (block scheduler, CPU governor, IRQ affinity). Not controlled bysysctl, edited directly or via systemd'stuned/udevrules.
The first two are the same data, two interfaces. The third is a different beast — when people say "kernel parameters" they usually mean sysctl, and that's what most of this guide is about. A short section near the end covers the /sys/ knobs that genuinely matter.
Three more interfaces frequently get conflated with kernel tuning but aren't:
ulimit//etc/security/limits.conf— per-process resource limits enforced by PAM. Not kernel parameters, but tied to them (fs.file-maxis a kernel-wide ceiling,ulimit -nis the per-process one).- systemd unit
Limit*directives — the systemd equivalent of limits.conf for services started by systemd.LimitNOFILE=,LimitNPROC=, etc. These overridelimits.conffor systemd-managed services. - Boot-time kernel parameters (
/etc/default/grubGRUB_CMDLINE_LINUX) — things liketransparent_hugepage=,mitigations=,intel_iommu=. Different mechanism, same family of decisions.
Knowing which interface owns a setting saves you from "I added nofile=65535 to sysctl.conf and it didn't help" — because that's a limits.conf value.
Reading and writing parameters: the mechanics
Look at a current value
# By dotted name
sysctl net.ipv4.tcp_syncookies
# net.ipv4.tcp_syncookies = 1
# Or read the proc file directly
cat /proc/sys/net/ipv4/tcp_syncookies
# 1
# Dump everything (huge — pipe to grep)
sysctl -a 2>/dev/null | grep tcp_
sysctl -a will print warnings for parameters in restricted namespaces; the 2>/dev/null is conventional. Use sysctl -A (capital) to show only world-readable parameters quietly.
Change a value at runtime (does NOT survive reboot)
# Either of these works — pick one and be consistent
sudo sysctl -w net.ipv4.tcp_syncookies=1
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_syncookies
Runtime changes are immediate and can be reverted by rewriting the previous value. This is the right place to test a change before persisting it. Note: not every parameter accepts a runtime change — a small set (mostly module-load-time) require a reboot or a module reload.
Persist a value across reboots
The modern path is a file in /etc/sysctl.d/, not editing /etc/sysctl.conf. The conventions:
- One file per topic or role:
99-server-tuning.conf,90-network.conf. - Filename starts with a 2-digit prefix that controls load order — higher numbers win when the same key appears in multiple files.
99-is the conventional "my customizations" range. - The system loads files from
/etc/sysctl.d/,/run/sysctl.d/,/usr/lib/sysctl.d/(in that precedence order;/etcwins).
sudo tee /etc/sysctl.d/99-tuning.conf >/dev/null <<'EOF'
# Network
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.ip_local_port_range = 10240 65535
# Memory
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
EOF
# Apply without reboot
sudo sysctl --system # loads everything from all the d/ paths
# Or just the file you wrote
sudo sysctl -p /etc/sysctl.d/99-tuning.conf
The difference between sysctl --system and sysctl -p:
sysctl -p(no argument) reads only/etc/sysctl.conf— it ignores/etc/sysctl.d/. This trips people up: they edit/etc/sysctl.conf, runsysctl -p, and assume/etc/sysctl.d/got applied. It didn't.sysctl -p <file>reads exactly that file.sysctl --systemreads every file from everysysctl.ddirectory in the proper order. Use this when applying a change you've persisted to ad/file.
Verify a value took effect
sysctl net.core.somaxconn
# net.core.somaxconn = 4096
If sysctl --system reported a value but reading it back shows the old one, something is overriding you — usually a higher-numbered file in /etc/sysctl.d/ or a runtime override from a service. sudo sysctl --system 2>&1 | grep -i error surfaces parse errors and unknown-parameter warnings.
Network parameters worth understanding
This is where most "performance tuning" guides spend their words, and where the most cargo-culted defaults live. The list below is opinionated: each entry is something that actually matters on a server under real load.
net.core.somaxconn — accept queue depth
The maximum number of pending connections in a listening socket's accept queue. The kernel default is 4096 on recent kernels (it was 128 for years, which was the basis of every "raise somaxconn" article ever written).
net.core.somaxconn = 4096
When it matters. Reverse proxies, busy web servers, anything with high connection burstiness. If ss -lnt shows your listener's Recv-Q near Send-Q (the kernel's column header for accept-queue / listen-backlog under SYN-RCVD context for listening sockets), connections are being dropped at SYN-ACK time. The application also has to pass a matching backlog to listen() — min(somaxconn, listen_backlog) is the effective cap. nginx defaults to 511; raise both, not just one.
When it doesn't. If your accept queue never fills, raising this does nothing.
net.ipv4.tcp_max_syn_backlog — half-open connection queue
Pending connections that have received SYN but not yet completed the handshake.
net.ipv4.tcp_max_syn_backlog = 4096
Pair with somaxconn. Under SYN flood or simply heavy connection churn, the SYN queue fills and the kernel starts using SYN cookies (which tcp_syncookies=1 enables — already the default).
net.ipv4.ip_local_port_range — ephemeral port pool
The port range used for outgoing connections (the source port). Default is typically 32768 60999 — about 28k ports.
net.ipv4.ip_local_port_range = 10240 65535
When it matters. Servers that originate lots of connections — reverse proxies, API gateways making upstream calls, scrapers. If you're running out of ephemeral ports you'll see EADDRNOTAVAIL errors and ss -s will show many TIME-WAIT sockets. Widening the range to ~55k roughly doubles the headroom.
When it doesn't. Pure inbound services (a web server with no upstream) — the source port is the client's, not yours.
net.ipv4.tcp_fin_timeout — leave it alone
The classic "tuning guide" advice: drop this from 60s to 15s. Don't.
# net.ipv4.tcp_fin_timeout = 30 # default; do not lower without a reason
tcp_fin_timeout controls how long an orphaned socket sits in FIN-WAIT-2 waiting for the remote close. Lowering it doesn't reduce TIME-WAIT (different state), and on lossy or slow paths a too-low value can break legitimate slow clients. Modern advice: leave the default.
net.ipv4.tcp_tw_reuse — yes; tcp_tw_recycle — never
net.ipv4.tcp_tw_reuse = 1
Allows the kernel to reuse a socket in TIME-WAIT for a new outgoing connection when the timestamps prove it's safe. Helpful for connection-heavy clients. Defaults to 2 (loopback only) on recent kernels; 1 enables it everywhere — usually fine.
tcp_tw_recycle (note the different name) was a different and broken feature that interacted badly with NAT and was removed in kernel 4.12 (2017). If a guide tells you to enable it, the guide is older than your kernel.
net.ipv4.tcp_keepalive_* — for long-lived connections
net.ipv4.tcp_keepalive_time = 300 # default 7200 (2h) — too long for most apps
net.ipv4.tcp_keepalive_intvl = 30 # default 75
net.ipv4.tcp_keepalive_probes = 5 # default 9
The kernel sends a keepalive probe after tcp_keepalive_time seconds of idle. The default of 2 hours is fine for desktops but useless for servers behind NAT/firewall middleboxes that drop idle connections in 5–15 minutes. Lowering this catches dead peer/middlebox state quickly so the application can reconnect.
This only applies when the application sets SO_KEEPALIVE on the socket. Most languages/frameworks expose a knob to set it.
net.core.rmem_* / net.core.wmem_* and net.ipv4.tcp_rmem / tcp_wmem — socket buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Three values for the TCP buffers: minimum, default, maximum. The kernel auto-tunes within this range. Defaults are fine for typical RTTs and 1Gbit links. Raise these only if you are pushing high-bandwidth traffic on high-RTT paths (cross-region replication, video, large object transfers) and you've measured the bandwidth-delay product is hitting the ceiling. Blindly raising buffer maxes on a busy host wastes RAM and can hide application bugs.
net.ipv4.tcp_congestion_control — BBR vs CUBIC
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
CUBIC is the default; BBR is Google's congestion control that often performs noticeably better on lossy or long-distance paths. Requires fq qdisc (net.core.default_qdisc = fq). Worth trying for outbound-heavy services serving global users. Test under your actual workload before committing — there are corner cases (very congested links with mixed CUBIC/BBR flows) where BBR is unfair to CUBIC traffic.
net.ipv4.tcp_syncookies — already on, don't disable
net.ipv4.tcp_syncookies = 1
Already the default. Mitigates SYN floods. Only disable if you know exactly why (some niche network appliances), and then turn it back on.
net.netfilter.nf_conntrack_max — for NAT/firewall hosts
net.netfilter.nf_conntrack_max = 524288
Connection-tracking table size. Only relevant if nf_conntrack is loaded — i.e. you're running a firewall, NAT, or any iptables/nftables rule that uses state/ct matches. The default (sized from RAM) is usually fine for endpoints; routers and busy NAT boxes need it raised to avoid nf_conntrack: table full, dropping packet. Increase net.netfilter.nf_conntrack_buckets proportionally (typically nf_conntrack_max / 4).
net.ipv4.ip_forward — for routers only
# net.ipv4.ip_forward = 1 # only enable on actual routers / NAT hosts
Off by default; enabling it makes the box forward packets between interfaces. Required for NAT, gateways, VPN servers. Don't enable it on a server that's not supposed to be a router — it changes the security posture.
Reverse path filtering
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
Drops packets whose source address wouldn't route back through the same interface. Default on modern distros. Disable only on multi-homed hosts with asymmetric routing (rare and intentional).
Memory / VM parameters
Memory tuning is the area where blind copy-paste hurts the most because the right answer depends on what kind of memory pressure you have. Read the explanations.
vm.swappiness — when to swap
vm.swappiness = 10 # for database / latency-sensitive servers
# vm.swappiness = 60 # default, fine for general-purpose
# vm.swappiness = 1 # absolutely-no-swap-unless-OOM (pre-systemd-OOMD era)
A 0–100 knob (some kernels accept up to 200) that biases the kernel toward swapping anonymous pages vs reclaiming page cache. Lower = prefer to drop cache. Higher = more willing to swap.
Common myths:
vm.swappiness=0doesn't disable swap. It just makes the kernel only swap when free memory is critically low. Genuinely disabling swap isswapoff -aand removing the swap line from/etc/fstab.swappiness=10is not always better than60. For a database whose working set fits in RAM, a low value keeps hot data resident — good. For a build/CI box that benefits from page cache for compilation artefacts, the default is fine. Don't change without a reason.
vm.dirty_background_ratio and vm.dirty_ratio — write-back behaviour
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
These control how much of memory can hold dirty (unwritten) page cache before the kernel starts writing it back (background_ratio) and before processes are forced to write synchronously (dirty_ratio). Defaults are typically 10/20.
Why lower them: big-RAM servers with the defaults can accumulate 10s of GB of dirty pages, and when dirty_ratio is hit every write blocks until the kernel flushes that backlog — visible as a multi-second latency spike. Lowering the ratios means more frequent, smaller flushes — smoother latency at the cost of slightly more I/O.
For databases on dedicated hardware, the related vm.dirty_bytes and vm.dirty_background_bytes (set absolute byte values rather than percentages) are usually a better fit because they don't scale with RAM.
vm.overcommit_memory and vm.overcommit_ratio
# vm.overcommit_memory = 0 # default — heuristic; fine for most servers
# vm.overcommit_memory = 1 # always overcommit; for Redis and similar
# vm.overcommit_memory = 2 # never overcommit; require commit ratio
overcommit_memory=1 is what Redis and a few other workloads explicitly want — they fork to do background snapshots, and the heuristic can refuse the fork even though the child won't actually touch most of the parent's pages. If you run Redis without this set, redis-cli will warn you on startup. For general-purpose servers, leave it alone.
vm.max_map_count — memory map count per process
vm.max_map_count = 262144
Required for Elasticsearch (which mmap's many segments) and some JVM workloads. Default is usually 65530, which Elasticsearch refuses to start with. The Elasticsearch docs prescribe 262144; that's the right number.
vm.min_free_kbytes — emergency reserve
# vm.min_free_kbytes = 65536
The amount of memory the kernel keeps strictly free for atomic allocations. Default is auto-calculated from RAM and usually correct. Raising it on hosts with a lot of network traffic and a high nf_conntrack count can prevent page allocation failure messages — but only if you've seen them. Don't pre-emptively change.
File descriptors and process limits
fs.file-max — system-wide open file limit
fs.file-max = 2097152
The kernel-wide cap on open file descriptors across the entire system. The default on modern kernels is already huge (often >9 million on a typical 16GB box). You almost never need to raise this — what you usually want is a higher per-process limit, which is ulimit -n / LimitNOFILE, not fs.file-max.
cat /proc/sys/fs/file-nr shows <allocated> <unused> <max> — if <allocated> is far below <max>, this knob isn't your bottleneck.
fs.nr_open — per-process FD ceiling
fs.nr_open = 1048576
The hard ceiling for ulimit -n for any process. Raising LimitNOFILE to a value above fs.nr_open silently fails. If you want individual services to be able to open >1M FDs (rare, but real for some proxies), raise this.
Per-process limits — not sysctl
A reminder, because this is the most common mix-up:
# /etc/security/limits.conf — login shells
* soft nofile 65536
* hard nofile 65536
# systemd unit override — services
# /etc/systemd/system/myapp.service.d/override.conf
[Service]
LimitNOFILE=65536
LimitNPROC=4096
limits.conf is enforced by PAM at login time — services started by systemd ignore it. For services, set Limit* in the unit. Verify with cat /proc/<pid>/limits.
Workload-specific starting points
These are starting points, not gospel. Measure against your actual workload.
General-purpose Linux server (small/medium VPS)
# /etc/sysctl.d/99-tuning.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_time = 300
vm.swappiness = 10
That's it. Anything beyond this on a generic VPS is over-tuning.
Web server / reverse proxy (nginx, HAProxy, Caddy)
net.core.somaxconn = 16384
net.ipv4.tcp_max_syn_backlog = 16384
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30 # default — left explicit for documentation
net.ipv4.tcp_keepalive_time = 300
# Buffers — only if you've measured BDP requires them
# net.core.rmem_max = 16777216
# net.core.wmem_max = 16777216
And: raise the proxy's own backlog (nginx listen ... backlog=16384) to match. The application's listen backlog is the one that bites first; sysctl is the second line.
Database server (PostgreSQL, MySQL, MariaDB)
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
# Or, on big-RAM boxes, prefer absolute values:
# vm.dirty_background_bytes = 67108864 # 64 MB
# vm.dirty_bytes = 268435456 # 256 MB
vm.overcommit_memory = 0 # default — leave alone for PG/MySQL
fs.file-max = 2097152
Plus: disable Transparent Huge Pages (transparent_hugepage=never at boot) — most DB engines explicitly recommend this. THP is not a sysctl; it's a /sys/ knob and a kernel boot parameter.
NAT gateway / firewall
net.ipv4.ip_forward = 1
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_buckets = 262144 # ~max/4
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240 65535
Containers host (Docker, Podman, k8s node)
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
# Inotify often runs out under heavy container churn
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192
# More PIDs available to a busy node
kernel.pid_max = 4194304
Kubernetes' own preflight checks the bridge-netfilter and ip_forward settings. The inotify limits are the most common silent failure on container hosts (file watches exhausted → file system events stop being delivered).
Security-related parameters
These don't affect performance but are part of any sensible hardening pass.
# Kernel ASLR / pointer protections
kernel.kptr_restrict = 2
kernel.dmesg_restrict = 1
kernel.yama.ptrace_scope = 1
kernel.unprivileged_bpf_disabled = 1
# Network anti-spoof / sane defaults
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.tcp_syncookies = 1
# Log impossible source addresses
net.ipv4.conf.all.log_martians = 1
Most of these are already the defaults on Debian 12 / Ubuntu 22.04+ / Rocky 9 / Alma 9. Including them in your /etc/sysctl.d/99-hardening.conf makes the intent explicit and survives a kernel default that changes in the wrong direction.
A few /sys/ knobs worth knowing about
These aren't sysctl but they live in the same neighbourhood and affect performance.
Block I/O scheduler
# Find the device
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# Set persistently with a udev rule, e.g. /etc/udev/rules.d/60-ioscheduler.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none"
For NVMe, none is correct — the device's own queueing handles it. For SSD/SATA, mq-deadline or bfq are the modern choices. For latency-sensitive databases, none on NVMe is what you want.
CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set everywhere (transient — set via cpupower or tuned for persistence)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance keeps the CPU at top frequency — eliminates the latency variance from frequency scaling decisions. For latency-sensitive servers, worth the few extra watts.
Transparent Huge Pages
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# Boot-time disable (preferred for DBs)
# In /etc/default/grub:
# GRUB_CMDLINE_LINUX="... transparent_hugepage=never"
Most database vendors recommend madvise or never. THP can cause latency spikes from compaction even though it nominally reduces TLB miss cost.
How to test changes safely
The discipline that separates real tuning from cargo-culting:
- Baseline the metric you're trying to move. "Latency feels high" isn't a baseline. p50/p95/p99 over a representative window, with the workload that matters. Record it.
- Change one thing. Two changes at once means you don't know which one moved the metric.
- Apply at runtime first (
sysctl -w ...) — not in ad/file. Easier to revert. - Run the same test. Same load, same duration, same time of day if the workload is diurnal.
- Compare. If the change helped and you can explain why — make it persistent. If it didn't help, revert and don't ship.
- Persist with a comment. A line like
vm.dirty_ratio = 10 # was 20 default; reduced 2026-05 — observed multi-second latency spikes during checkpointsis far more useful than a bare value to whoever inherits this box.
A reasonable smoke test for a change to a network parameter: drive the load you care about (wrk, oha, ab, your real client), watch ss -s, ss -lnt, ss -tin, nstat, and the application's own latency histograms. For memory parameters: vmstat 1, iostat -xz 1, cat /proc/meminfo, and the application's own slow-query / GC logs. For file descriptors: lsof -p <pid> and cat /proc/<pid>/limits.
Xitoring's server monitoring collects a lot of this automatically via Xitogent — useful both as the baseline (so you have a real "before") and as the regression detector (so a tuning change that helps p99 latency but causes a memory leak shows up before users do). Pair with disk-health monitoring and CPU temperature alerts on hosts where the tuning is also pushing thermal limits.
Common mistakes
The recurring ways "kernel tuning" goes wrong.
Editing /etc/sysctl.conf and running sysctl -p, but the values come from /etc/sysctl.d/
Then your edit doesn't apply. sysctl --system is the correct apply command on any modern distro. Better still, don't edit sysctl.conf — drop a file into sysctl.d/.
Persisting without applying (or vice versa)
sysctl -w only changes runtime state. If you don't write the value to a d/ file, a reboot reverts it. Conversely, writing to a d/ file without sysctl --system leaves the running kernel on the old value. Both halves matter.
Two files in /etc/sysctl.d/ setting the same key
50-default.conf says vm.swappiness = 60; 99-mine.conf says vm.swappiness = 10. The 99- file wins (load order). Fine — but if you forget the prefix matters and name your file tuning.conf, alphabetic ordering means it loads after 99-cloudimg-ipv6.conf and before 99-sysctl.conf, depending on what's installed. Always use a numeric prefix; 99- is the conventional "highest priority" range for operator overrides.
Tuning ulimit / nofile via sysctl
fs.file-max is the kernel-wide ceiling — raising it does not raise the per-process limit. If a service hits "too many open files" at 1024 FDs, the fix is LimitNOFILE (systemd) or limits.conf (login shells), not sysctl.
Copying values from a tuning guide written for kernel 3.x
tcp_tw_recycle is gone. tcp_fin_timeout=15 is questionable. vm.swappiness=0 doesn't disable swap. net.ipv4.tcp_max_tw_buckets rarely needs raising on a 16GB+ box. Check the kernel default before assuming you need an override; sysctl <param> shows what's running right now.
Tuning to fix the wrong bottleneck
Your p99 latency is bad. You raise tcp_rmem. p99 doesn't move because the bottleneck was a single hot Postgres index, not network buffering. Profile first. perf top, bpftrace, the application's own slow logs, and iostat -xz 1 will tell you whether the bottleneck is CPU, I/O, network, lock contention, or app code. Sysctl won't fix CPU bottlenecks or app bugs.
Persisting a value that the running kernel ignores
If you write net.bridge.bridge-nf-call-iptables = 1 to a sysctl file but the br_netfilter module isn't loaded, sysctl --system will warn and the value won't be set. Either modprobe br_netfilter (and add to /etc/modules-load.d/) or accept that the value won't persist until the module is loaded. The same goes for nf_conntrack_max and the conntrack module.
Disabling IPv6 to "simplify"
# net.ipv6.conf.all.disable_ipv6 = 1 # rarely the right answer in 2026
Many services and tools assume IPv6 is available. Disabling it breaks things subtly (Postfix, Java apps binding [::]:port, container networking). If you genuinely don't have IPv6 connectivity, prefer to leave it enabled but unconfigured — most software handles that correctly.
Reading the kernel docs
When you don't know what a parameter does, sysctl itself doesn't tell you — but the kernel ships its own documentation:
# Networking parameters
less /usr/share/doc/linux-doc/networking/ip-sysctl.rst # Debian/Ubuntu
# Or browse online: kernel.org/doc/html/latest/networking/ip-sysctl.html
# VM / memory
less /usr/share/doc/linux-doc/admin-guide/sysctl/vm.rst
# Generic admin guide
less /usr/share/doc/linux-doc/admin-guide/sysctl/kernel.rst
If the package isn't installed, apt install linux-doc or dnf install kernel-doc (the package name varies) gets it. The official Documentation/admin-guide/sysctl/ directory in the kernel source is the same content. It's denser than blog posts but it's correct and current to the kernel you're running, which most blog posts aren't.
A complete worked change
You're running an nginx reverse proxy that's started dropping accept queue overflow warnings under traffic spikes. Walk-through of a change you can defend.
# 1. Confirm the symptom
# Look at TcpExt's "ListenOverflow" / "ListenDrop" counters
nstat -az | grep -i listen
# TcpExtListenOverflows 1924 ...
# TcpExtListenDrops 1924 ...
# Look at the running listen-backlog
ss -lnt sport = :443
# State Recv-Q Send-Q ...
# LISTEN 0 511 *:443
# Send-Q = 511 = nginx's default listen backlog.
# Recv-Q rising = pending queue depth.
# 2. Try a runtime change first
sudo sysctl -w net.core.somaxconn=16384
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=16384
# Update nginx to match — listen ... backlog=16384;
sudo nginx -t && sudo systemctl reload nginx
# 3. Drive load again, look at the same counters
nstat -az | grep -i listen
# Want: ListenOverflows / ListenDrops not increasing
If the change holds up, persist it:
sudo tee /etc/sysctl.d/99-nginx-tuning.conf >/dev/null <<'EOF'
# Raised 2026-05-10 — nginx accept-queue overflow under spike load.
# Paired with `listen ... backlog=16384` in nginx config.
net.core.somaxconn = 16384
net.ipv4.tcp_max_syn_backlog = 16384
EOF
sudo sysctl --system
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
The comment is the load-bearing part. Six months from now, when you or someone else asks "why is this here?", the file answers without a git-blame archaeology session.
Summary
- Linux exposes runtime kernel knobs through
sysctl(mirrored in/proc/sys/). Persistent changes go into/etc/sysctl.d/<NN>-<name>.conf, applied withsudo sysctl --system. - Modern kernel defaults are good. The list of parameters that actually matter on a typical server is short: a few network knobs (
somaxconn,tcp_max_syn_backlog,ip_local_port_range,tcp_tw_reuse,tcp_keepalive_time), a few VM knobs (vm.swappiness, dirty ratios), andfs.file-max/fs.nr_open. Workload-specific parameters (conntrack, bridge-netfilter, inotify, max_map_count) only matter if you run that workload. - Per-process FD limits are not sysctl. They live in
limits.conf(login shells) and systemdLimit*directives (services). /sys/knobs (block scheduler, CPU governor, THP) sit alongside sysctl and matter for latency-sensitive workloads.- Tuning without measurement is theater. Baseline, change one thing, measure again, persist with a comment that explains why.
- Most "kernel tuning guides" on the internet are old.
tcp_tw_recycleis gone,tcp_fin_timeout=15is questionable,vm.swappiness=0doesn't disable swap. Always check the running default before overriding it.
The kernel gives you a thousand knobs because the workloads it has to serve span embedded sensors to NVMe-attached racks. Knowing which knobs are load-bearing for your workload — and leaving the rest alone — is the actual skill.