DevOps & Workflow22 min read

    How to fine-tune Linux kernel parameters

    By DanaServer Monitoring & Linux
    Share

    "How do I tune the Linux kernel?" is one of those questions that hides a much harder one: for what workload, on what hardware, against what bottleneck? The literal mechanics — sysctl -w, drop a file into /etc/sysctl.d/, run sysctl -p — are easy and take five minutes to learn. The hard parts are knowing which knobs are worth touching on a modern kernel, why they matter, how to measure that the change helped, and how to back it out when it didn't. This guide covers the mechanics, then walks through the parameters that are actually load-bearing on real servers in 2026, what they mean, sensible starting values for common workloads, and the mistakes that make "tuning" a net loss.

    Two things to internalize before you change anything:

    1. The kernel's defaults have improved enormously. Many "must-tune" parameters from old guides (tcp_tw_recycle, tcp_fin_timeout=15, aggressive vm.swappiness=10 for every server) are either gone, harmful, or already the modern default. Check the running default before assuming you need to override it.
    2. Tuning without measurement is theater. A change that might help without a benchmark before and after is just configuration drift. Measure first, change one thing, measure again. If you can't quantify the improvement, the change shouldn't ship.

    The examples below assume a recent Linux distro (kernel 5.15+ — Ubuntu 22.04/24.04, AlmaLinux 9, Rocky 9, Debian 12). Older kernels have additional pitfalls noted inline.


    What kernel parameters actually are?

    Linux exposes a few thousand runtime tunables through three related interfaces:

    • /proc/sys/ — virtual files. Read with cat, write with echo > file. Each file is one parameter.
    • sysctl — the userspace command that reads and writes those same files using dotted names. net.ipv4.tcp_syncookies corresponds to /proc/sys/net/ipv4/tcp_syncookies.
    • /sys/ — a separate filesystem for hardware and driver knobs (block scheduler, CPU governor, IRQ affinity). Not controlled by sysctl, edited directly or via systemd's tuned/udev rules.

    The first two are the same data, two interfaces. The third is a different beast — when people say "kernel parameters" they usually mean sysctl, and that's what most of this guide is about. A short section near the end covers the /sys/ knobs that genuinely matter.

    Three more interfaces frequently get conflated with kernel tuning but aren't:

    • ulimit / /etc/security/limits.confper-process resource limits enforced by PAM. Not kernel parameters, but tied to them (fs.file-max is a kernel-wide ceiling, ulimit -n is the per-process one).
    • systemd unit Limit* directives — the systemd equivalent of limits.conf for services started by systemd. LimitNOFILE=, LimitNPROC=, etc. These override limits.conf for systemd-managed services.
    • Boot-time kernel parameters (/etc/default/grub GRUB_CMDLINE_LINUX) — things like transparent_hugepage=, mitigations=, intel_iommu=. Different mechanism, same family of decisions.

    Knowing which interface owns a setting saves you from "I added nofile=65535 to sysctl.conf and it didn't help" — because that's a limits.conf value.


    Reading and writing parameters: the mechanics

    Look at a current value

    # By dotted name
    sysctl net.ipv4.tcp_syncookies
    # net.ipv4.tcp_syncookies = 1
    
    # Or read the proc file directly
    cat /proc/sys/net/ipv4/tcp_syncookies
    # 1
    
    # Dump everything (huge — pipe to grep)
    sysctl -a 2>/dev/null | grep tcp_
    

    sysctl -a will print warnings for parameters in restricted namespaces; the 2>/dev/null is conventional. Use sysctl -A (capital) to show only world-readable parameters quietly.

    Change a value at runtime (does NOT survive reboot)

    # Either of these works — pick one and be consistent
    sudo sysctl -w net.ipv4.tcp_syncookies=1
    echo 1 | sudo tee /proc/sys/net/ipv4/tcp_syncookies
    

    Runtime changes are immediate and can be reverted by rewriting the previous value. This is the right place to test a change before persisting it. Note: not every parameter accepts a runtime change — a small set (mostly module-load-time) require a reboot or a module reload.

    Persist a value across reboots

    The modern path is a file in /etc/sysctl.d/, not editing /etc/sysctl.conf. The conventions:

    • One file per topic or role: 99-server-tuning.conf, 90-network.conf.
    • Filename starts with a 2-digit prefix that controls load order — higher numbers win when the same key appears in multiple files. 99- is the conventional "my customizations" range.
    • The system loads files from /etc/sysctl.d/, /run/sysctl.d/, /usr/lib/sysctl.d/ (in that precedence order; /etc wins).
    sudo tee /etc/sysctl.d/99-tuning.conf >/dev/null <<'EOF'
    # Network
    net.core.somaxconn = 4096
    net.ipv4.tcp_max_syn_backlog = 4096
    net.ipv4.ip_local_port_range = 10240 65535
    
    # Memory
    vm.swappiness = 10
    vm.dirty_background_ratio = 5
    vm.dirty_ratio = 10
    EOF
    
    # Apply without reboot
    sudo sysctl --system        # loads everything from all the d/ paths
    # Or just the file you wrote
    sudo sysctl -p /etc/sysctl.d/99-tuning.conf
    

    The difference between sysctl --system and sysctl -p:

    • sysctl -p (no argument) reads only /etc/sysctl.conf — it ignores /etc/sysctl.d/. This trips people up: they edit /etc/sysctl.conf, run sysctl -p, and assume /etc/sysctl.d/ got applied. It didn't.
    • sysctl -p <file> reads exactly that file.
    • sysctl --system reads every file from every sysctl.d directory in the proper order. Use this when applying a change you've persisted to a d/ file.

    Verify a value took effect

    sysctl net.core.somaxconn
    # net.core.somaxconn = 4096
    

    If sysctl --system reported a value but reading it back shows the old one, something is overriding you — usually a higher-numbered file in /etc/sysctl.d/ or a runtime override from a service. sudo sysctl --system 2>&1 | grep -i error surfaces parse errors and unknown-parameter warnings.


    Network parameters worth understanding

    This is where most "performance tuning" guides spend their words, and where the most cargo-culted defaults live. The list below is opinionated: each entry is something that actually matters on a server under real load.

    net.core.somaxconn — accept queue depth

    The maximum number of pending connections in a listening socket's accept queue. The kernel default is 4096 on recent kernels (it was 128 for years, which was the basis of every "raise somaxconn" article ever written).

    net.core.somaxconn = 4096
    

    When it matters. Reverse proxies, busy web servers, anything with high connection burstiness. If ss -lnt shows your listener's Recv-Q near Send-Q (the kernel's column header for accept-queue / listen-backlog under SYN-RCVD context for listening sockets), connections are being dropped at SYN-ACK time. The application also has to pass a matching backlog to listen()min(somaxconn, listen_backlog) is the effective cap. nginx defaults to 511; raise both, not just one.

    When it doesn't. If your accept queue never fills, raising this does nothing.

    net.ipv4.tcp_max_syn_backlog — half-open connection queue

    Pending connections that have received SYN but not yet completed the handshake.

    net.ipv4.tcp_max_syn_backlog = 4096
    

    Pair with somaxconn. Under SYN flood or simply heavy connection churn, the SYN queue fills and the kernel starts using SYN cookies (which tcp_syncookies=1 enables — already the default).

    net.ipv4.ip_local_port_range — ephemeral port pool

    The port range used for outgoing connections (the source port). Default is typically 32768 60999 — about 28k ports.

    net.ipv4.ip_local_port_range = 10240 65535
    

    When it matters. Servers that originate lots of connections — reverse proxies, API gateways making upstream calls, scrapers. If you're running out of ephemeral ports you'll see EADDRNOTAVAIL errors and ss -s will show many TIME-WAIT sockets. Widening the range to ~55k roughly doubles the headroom.

    When it doesn't. Pure inbound services (a web server with no upstream) — the source port is the client's, not yours.

    net.ipv4.tcp_fin_timeout — leave it alone

    The classic "tuning guide" advice: drop this from 60s to 15s. Don't.

    # net.ipv4.tcp_fin_timeout = 30   # default; do not lower without a reason
    

    tcp_fin_timeout controls how long an orphaned socket sits in FIN-WAIT-2 waiting for the remote close. Lowering it doesn't reduce TIME-WAIT (different state), and on lossy or slow paths a too-low value can break legitimate slow clients. Modern advice: leave the default.

    net.ipv4.tcp_tw_reuse — yes; tcp_tw_recycle — never

    net.ipv4.tcp_tw_reuse = 1
    

    Allows the kernel to reuse a socket in TIME-WAIT for a new outgoing connection when the timestamps prove it's safe. Helpful for connection-heavy clients. Defaults to 2 (loopback only) on recent kernels; 1 enables it everywhere — usually fine.

    tcp_tw_recycle (note the different name) was a different and broken feature that interacted badly with NAT and was removed in kernel 4.12 (2017). If a guide tells you to enable it, the guide is older than your kernel.

    net.ipv4.tcp_keepalive_* — for long-lived connections

    net.ipv4.tcp_keepalive_time = 300         # default 7200 (2h) — too long for most apps
    net.ipv4.tcp_keepalive_intvl = 30         # default 75
    net.ipv4.tcp_keepalive_probes = 5         # default 9
    

    The kernel sends a keepalive probe after tcp_keepalive_time seconds of idle. The default of 2 hours is fine for desktops but useless for servers behind NAT/firewall middleboxes that drop idle connections in 5–15 minutes. Lowering this catches dead peer/middlebox state quickly so the application can reconnect.

    This only applies when the application sets SO_KEEPALIVE on the socket. Most languages/frameworks expose a knob to set it.

    net.core.rmem_* / net.core.wmem_* and net.ipv4.tcp_rmem / tcp_wmem — socket buffers

    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    net.ipv4.tcp_rmem = 4096 87380 16777216
    net.ipv4.tcp_wmem = 4096 65536 16777216
    

    Three values for the TCP buffers: minimum, default, maximum. The kernel auto-tunes within this range. Defaults are fine for typical RTTs and 1Gbit links. Raise these only if you are pushing high-bandwidth traffic on high-RTT paths (cross-region replication, video, large object transfers) and you've measured the bandwidth-delay product is hitting the ceiling. Blindly raising buffer maxes on a busy host wastes RAM and can hide application bugs.

    net.ipv4.tcp_congestion_control — BBR vs CUBIC

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr
    

    CUBIC is the default; BBR is Google's congestion control that often performs noticeably better on lossy or long-distance paths. Requires fq qdisc (net.core.default_qdisc = fq). Worth trying for outbound-heavy services serving global users. Test under your actual workload before committing — there are corner cases (very congested links with mixed CUBIC/BBR flows) where BBR is unfair to CUBIC traffic.

    net.ipv4.tcp_syncookies — already on, don't disable

    net.ipv4.tcp_syncookies = 1
    

    Already the default. Mitigates SYN floods. Only disable if you know exactly why (some niche network appliances), and then turn it back on.

    net.netfilter.nf_conntrack_max — for NAT/firewall hosts

    net.netfilter.nf_conntrack_max = 524288
    

    Connection-tracking table size. Only relevant if nf_conntrack is loaded — i.e. you're running a firewall, NAT, or any iptables/nftables rule that uses state/ct matches. The default (sized from RAM) is usually fine for endpoints; routers and busy NAT boxes need it raised to avoid nf_conntrack: table full, dropping packet. Increase net.netfilter.nf_conntrack_buckets proportionally (typically nf_conntrack_max / 4).

    net.ipv4.ip_forward — for routers only

    # net.ipv4.ip_forward = 1     # only enable on actual routers / NAT hosts
    

    Off by default; enabling it makes the box forward packets between interfaces. Required for NAT, gateways, VPN servers. Don't enable it on a server that's not supposed to be a router — it changes the security posture.

    Reverse path filtering

    net.ipv4.conf.all.rp_filter = 1
    net.ipv4.conf.default.rp_filter = 1
    

    Drops packets whose source address wouldn't route back through the same interface. Default on modern distros. Disable only on multi-homed hosts with asymmetric routing (rare and intentional).


    Memory / VM parameters

    Memory tuning is the area where blind copy-paste hurts the most because the right answer depends on what kind of memory pressure you have. Read the explanations.

    vm.swappiness — when to swap

    vm.swappiness = 10        # for database / latency-sensitive servers
    # vm.swappiness = 60      # default, fine for general-purpose
    # vm.swappiness = 1       # absolutely-no-swap-unless-OOM (pre-systemd-OOMD era)
    

    A 0–100 knob (some kernels accept up to 200) that biases the kernel toward swapping anonymous pages vs reclaiming page cache. Lower = prefer to drop cache. Higher = more willing to swap.

    Common myths:

    • vm.swappiness=0 doesn't disable swap. It just makes the kernel only swap when free memory is critically low. Genuinely disabling swap is swapoff -a and removing the swap line from /etc/fstab.
    • swappiness=10 is not always better than 60. For a database whose working set fits in RAM, a low value keeps hot data resident — good. For a build/CI box that benefits from page cache for compilation artefacts, the default is fine. Don't change without a reason.

    vm.dirty_background_ratio and vm.dirty_ratio — write-back behaviour

    vm.dirty_background_ratio = 5
    vm.dirty_ratio = 10
    

    These control how much of memory can hold dirty (unwritten) page cache before the kernel starts writing it back (background_ratio) and before processes are forced to write synchronously (dirty_ratio). Defaults are typically 10/20.

    Why lower them: big-RAM servers with the defaults can accumulate 10s of GB of dirty pages, and when dirty_ratio is hit every write blocks until the kernel flushes that backlog — visible as a multi-second latency spike. Lowering the ratios means more frequent, smaller flushes — smoother latency at the cost of slightly more I/O.

    For databases on dedicated hardware, the related vm.dirty_bytes and vm.dirty_background_bytes (set absolute byte values rather than percentages) are usually a better fit because they don't scale with RAM.

    vm.overcommit_memory and vm.overcommit_ratio

    # vm.overcommit_memory = 0       # default — heuristic; fine for most servers
    # vm.overcommit_memory = 1       # always overcommit; for Redis and similar
    # vm.overcommit_memory = 2       # never overcommit; require commit ratio
    

    overcommit_memory=1 is what Redis and a few other workloads explicitly want — they fork to do background snapshots, and the heuristic can refuse the fork even though the child won't actually touch most of the parent's pages. If you run Redis without this set, redis-cli will warn you on startup. For general-purpose servers, leave it alone.

    vm.max_map_count — memory map count per process

    vm.max_map_count = 262144
    

    Required for Elasticsearch (which mmap's many segments) and some JVM workloads. Default is usually 65530, which Elasticsearch refuses to start with. The Elasticsearch docs prescribe 262144; that's the right number.

    vm.min_free_kbytes — emergency reserve

    # vm.min_free_kbytes = 65536
    

    The amount of memory the kernel keeps strictly free for atomic allocations. Default is auto-calculated from RAM and usually correct. Raising it on hosts with a lot of network traffic and a high nf_conntrack count can prevent page allocation failure messages — but only if you've seen them. Don't pre-emptively change.


    File descriptors and process limits

    fs.file-max — system-wide open file limit

    fs.file-max = 2097152
    

    The kernel-wide cap on open file descriptors across the entire system. The default on modern kernels is already huge (often >9 million on a typical 16GB box). You almost never need to raise this — what you usually want is a higher per-process limit, which is ulimit -n / LimitNOFILE, not fs.file-max.

    cat /proc/sys/fs/file-nr shows <allocated> <unused> <max> — if <allocated> is far below <max>, this knob isn't your bottleneck.

    fs.nr_open — per-process FD ceiling

    fs.nr_open = 1048576
    

    The hard ceiling for ulimit -n for any process. Raising LimitNOFILE to a value above fs.nr_open silently fails. If you want individual services to be able to open >1M FDs (rare, but real for some proxies), raise this.

    Per-process limits — not sysctl

    A reminder, because this is the most common mix-up:

    # /etc/security/limits.conf — login shells
    *               soft    nofile  65536
    *               hard    nofile  65536
    
    # systemd unit override — services
    # /etc/systemd/system/myapp.service.d/override.conf
    [Service]
    LimitNOFILE=65536
    LimitNPROC=4096
    

    limits.conf is enforced by PAM at login time — services started by systemd ignore it. For services, set Limit* in the unit. Verify with cat /proc/<pid>/limits.


    Workload-specific starting points

    These are starting points, not gospel. Measure against your actual workload.

    General-purpose Linux server (small/medium VPS)

    # /etc/sysctl.d/99-tuning.conf
    net.core.somaxconn = 4096
    net.ipv4.tcp_max_syn_backlog = 4096
    net.ipv4.ip_local_port_range = 10240 65535
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_keepalive_time = 300
    
    vm.swappiness = 10
    

    That's it. Anything beyond this on a generic VPS is over-tuning.

    Web server / reverse proxy (nginx, HAProxy, Caddy)

    net.core.somaxconn = 16384
    net.ipv4.tcp_max_syn_backlog = 16384
    net.ipv4.ip_local_port_range = 10240 65535
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_fin_timeout = 30           # default — left explicit for documentation
    net.ipv4.tcp_keepalive_time = 300
    
    # Buffers — only if you've measured BDP requires them
    # net.core.rmem_max = 16777216
    # net.core.wmem_max = 16777216
    

    And: raise the proxy's own backlog (nginx listen ... backlog=16384) to match. The application's listen backlog is the one that bites first; sysctl is the second line.

    Database server (PostgreSQL, MySQL, MariaDB)

    vm.swappiness = 10
    vm.dirty_background_ratio = 5
    vm.dirty_ratio = 10
    # Or, on big-RAM boxes, prefer absolute values:
    # vm.dirty_background_bytes = 67108864       # 64 MB
    # vm.dirty_bytes = 268435456                 # 256 MB
    
    vm.overcommit_memory = 0                     # default — leave alone for PG/MySQL
    fs.file-max = 2097152
    

    Plus: disable Transparent Huge Pages (transparent_hugepage=never at boot) — most DB engines explicitly recommend this. THP is not a sysctl; it's a /sys/ knob and a kernel boot parameter.

    NAT gateway / firewall

    net.ipv4.ip_forward = 1
    net.netfilter.nf_conntrack_max = 1048576
    net.netfilter.nf_conntrack_buckets = 262144         # ~max/4
    
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.ip_local_port_range = 10240 65535
    

    Containers host (Docker, Podman, k8s node)

    net.bridge.bridge-nf-call-iptables = 1
    net.bridge.bridge-nf-call-ip6tables = 1
    net.ipv4.ip_forward = 1
    
    # Inotify often runs out under heavy container churn
    fs.inotify.max_user_watches = 524288
    fs.inotify.max_user_instances = 8192
    
    # More PIDs available to a busy node
    kernel.pid_max = 4194304
    

    Kubernetes' own preflight checks the bridge-netfilter and ip_forward settings. The inotify limits are the most common silent failure on container hosts (file watches exhausted → file system events stop being delivered).


    These don't affect performance but are part of any sensible hardening pass.

    # Kernel ASLR / pointer protections
    kernel.kptr_restrict = 2
    kernel.dmesg_restrict = 1
    kernel.yama.ptrace_scope = 1
    kernel.unprivileged_bpf_disabled = 1
    
    # Network anti-spoof / sane defaults
    net.ipv4.conf.all.rp_filter = 1
    net.ipv4.conf.default.rp_filter = 1
    net.ipv4.conf.all.accept_source_route = 0
    net.ipv4.conf.default.accept_source_route = 0
    net.ipv6.conf.all.accept_source_route = 0
    net.ipv4.conf.all.accept_redirects = 0
    net.ipv6.conf.all.accept_redirects = 0
    net.ipv4.conf.all.send_redirects = 0
    net.ipv4.icmp_echo_ignore_broadcasts = 1
    net.ipv4.icmp_ignore_bogus_error_responses = 1
    net.ipv4.tcp_syncookies = 1
    
    # Log impossible source addresses
    net.ipv4.conf.all.log_martians = 1
    

    Most of these are already the defaults on Debian 12 / Ubuntu 22.04+ / Rocky 9 / Alma 9. Including them in your /etc/sysctl.d/99-hardening.conf makes the intent explicit and survives a kernel default that changes in the wrong direction.


    A few /sys/ knobs worth knowing about

    These aren't sysctl but they live in the same neighbourhood and affect performance.

    Block I/O scheduler

    # Find the device
    cat /sys/block/sda/queue/scheduler
    # [mq-deadline] kyber bfq none
    
    # Set persistently with a udev rule, e.g. /etc/udev/rules.d/60-ioscheduler.rules
    ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"
    ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]", ATTR{queue/scheduler}="none"
    

    For NVMe, none is correct — the device's own queueing handles it. For SSD/SATA, mq-deadline or bfq are the modern choices. For latency-sensitive databases, none on NVMe is what you want.

    CPU governor

    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
    # Set everywhere (transient — set via cpupower or tuned for persistence)
    echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    

    performance keeps the CPU at top frequency — eliminates the latency variance from frequency scaling decisions. For latency-sensitive servers, worth the few extra watts.

    Transparent Huge Pages

    cat /sys/kernel/mm/transparent_hugepage/enabled
    # [always] madvise never
    
    # Boot-time disable (preferred for DBs)
    # In /etc/default/grub:
    #   GRUB_CMDLINE_LINUX="... transparent_hugepage=never"
    

    Most database vendors recommend madvise or never. THP can cause latency spikes from compaction even though it nominally reduces TLB miss cost.


    How to test changes safely

    The discipline that separates real tuning from cargo-culting:

    1. Baseline the metric you're trying to move. "Latency feels high" isn't a baseline. p50/p95/p99 over a representative window, with the workload that matters. Record it.
    2. Change one thing. Two changes at once means you don't know which one moved the metric.
    3. Apply at runtime first (sysctl -w ...) — not in a d/ file. Easier to revert.
    4. Run the same test. Same load, same duration, same time of day if the workload is diurnal.
    5. Compare. If the change helped and you can explain why — make it persistent. If it didn't help, revert and don't ship.
    6. Persist with a comment. A line like vm.dirty_ratio = 10 # was 20 default; reduced 2026-05 — observed multi-second latency spikes during checkpoints is far more useful than a bare value to whoever inherits this box.

    A reasonable smoke test for a change to a network parameter: drive the load you care about (wrk, oha, ab, your real client), watch ss -s, ss -lnt, ss -tin, nstat, and the application's own latency histograms. For memory parameters: vmstat 1, iostat -xz 1, cat /proc/meminfo, and the application's own slow-query / GC logs. For file descriptors: lsof -p <pid> and cat /proc/<pid>/limits.

    Xitoring's server monitoring collects a lot of this automatically via Xitogent — useful both as the baseline (so you have a real "before") and as the regression detector (so a tuning change that helps p99 latency but causes a memory leak shows up before users do). Pair with disk-health monitoring and CPU temperature alerts on hosts where the tuning is also pushing thermal limits.


    Common mistakes

    The recurring ways "kernel tuning" goes wrong.

    Editing /etc/sysctl.conf and running sysctl -p, but the values come from /etc/sysctl.d/

    Then your edit doesn't apply. sysctl --system is the correct apply command on any modern distro. Better still, don't edit sysctl.conf — drop a file into sysctl.d/.

    Persisting without applying (or vice versa)

    sysctl -w only changes runtime state. If you don't write the value to a d/ file, a reboot reverts it. Conversely, writing to a d/ file without sysctl --system leaves the running kernel on the old value. Both halves matter.

    Two files in /etc/sysctl.d/ setting the same key

    50-default.conf says vm.swappiness = 60; 99-mine.conf says vm.swappiness = 10. The 99- file wins (load order). Fine — but if you forget the prefix matters and name your file tuning.conf, alphabetic ordering means it loads after 99-cloudimg-ipv6.conf and before 99-sysctl.conf, depending on what's installed. Always use a numeric prefix; 99- is the conventional "highest priority" range for operator overrides.

    Tuning ulimit / nofile via sysctl

    fs.file-max is the kernel-wide ceiling — raising it does not raise the per-process limit. If a service hits "too many open files" at 1024 FDs, the fix is LimitNOFILE (systemd) or limits.conf (login shells), not sysctl.

    Copying values from a tuning guide written for kernel 3.x

    tcp_tw_recycle is gone. tcp_fin_timeout=15 is questionable. vm.swappiness=0 doesn't disable swap. net.ipv4.tcp_max_tw_buckets rarely needs raising on a 16GB+ box. Check the kernel default before assuming you need an override; sysctl <param> shows what's running right now.

    Tuning to fix the wrong bottleneck

    Your p99 latency is bad. You raise tcp_rmem. p99 doesn't move because the bottleneck was a single hot Postgres index, not network buffering. Profile first. perf top, bpftrace, the application's own slow logs, and iostat -xz 1 will tell you whether the bottleneck is CPU, I/O, network, lock contention, or app code. Sysctl won't fix CPU bottlenecks or app bugs.

    Persisting a value that the running kernel ignores

    If you write net.bridge.bridge-nf-call-iptables = 1 to a sysctl file but the br_netfilter module isn't loaded, sysctl --system will warn and the value won't be set. Either modprobe br_netfilter (and add to /etc/modules-load.d/) or accept that the value won't persist until the module is loaded. The same goes for nf_conntrack_max and the conntrack module.

    Disabling IPv6 to "simplify"

    # net.ipv6.conf.all.disable_ipv6 = 1     # rarely the right answer in 2026
    

    Many services and tools assume IPv6 is available. Disabling it breaks things subtly (Postfix, Java apps binding [::]:port, container networking). If you genuinely don't have IPv6 connectivity, prefer to leave it enabled but unconfigured — most software handles that correctly.


    Reading the kernel docs

    When you don't know what a parameter does, sysctl itself doesn't tell you — but the kernel ships its own documentation:

    # Networking parameters
    less /usr/share/doc/linux-doc/networking/ip-sysctl.rst   # Debian/Ubuntu
    # Or browse online: kernel.org/doc/html/latest/networking/ip-sysctl.html
    
    # VM / memory
    less /usr/share/doc/linux-doc/admin-guide/sysctl/vm.rst
    
    # Generic admin guide
    less /usr/share/doc/linux-doc/admin-guide/sysctl/kernel.rst
    

    If the package isn't installed, apt install linux-doc or dnf install kernel-doc (the package name varies) gets it. The official Documentation/admin-guide/sysctl/ directory in the kernel source is the same content. It's denser than blog posts but it's correct and current to the kernel you're running, which most blog posts aren't.


    A complete worked change

    You're running an nginx reverse proxy that's started dropping accept queue overflow warnings under traffic spikes. Walk-through of a change you can defend.

    # 1. Confirm the symptom
    # Look at TcpExt's "ListenOverflow" / "ListenDrop" counters
    nstat -az | grep -i listen
    # TcpExtListenOverflows           1924  ...
    # TcpExtListenDrops               1924  ...
    
    # Look at the running listen-backlog
    ss -lnt sport = :443
    # State    Recv-Q  Send-Q ...
    # LISTEN   0       511    *:443
    
    # Send-Q = 511 = nginx's default listen backlog.
    # Recv-Q rising = pending queue depth.
    
    # 2. Try a runtime change first
    sudo sysctl -w net.core.somaxconn=16384
    sudo sysctl -w net.ipv4.tcp_max_syn_backlog=16384
    
    # Update nginx to match — listen ... backlog=16384;
    sudo nginx -t && sudo systemctl reload nginx
    
    # 3. Drive load again, look at the same counters
    nstat -az | grep -i listen
    # Want: ListenOverflows / ListenDrops not increasing
    

    If the change holds up, persist it:

    sudo tee /etc/sysctl.d/99-nginx-tuning.conf >/dev/null <<'EOF'
    # Raised 2026-05-10 — nginx accept-queue overflow under spike load.
    # Paired with `listen ... backlog=16384` in nginx config.
    net.core.somaxconn = 16384
    net.ipv4.tcp_max_syn_backlog = 16384
    EOF
    
    sudo sysctl --system
    sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
    

    The comment is the load-bearing part. Six months from now, when you or someone else asks "why is this here?", the file answers without a git-blame archaeology session.


    Summary

    • Linux exposes runtime kernel knobs through sysctl (mirrored in /proc/sys/). Persistent changes go into /etc/sysctl.d/<NN>-<name>.conf, applied with sudo sysctl --system.
    • Modern kernel defaults are good. The list of parameters that actually matter on a typical server is short: a few network knobs (somaxconn, tcp_max_syn_backlog, ip_local_port_range, tcp_tw_reuse, tcp_keepalive_time), a few VM knobs (vm.swappiness, dirty ratios), and fs.file-max / fs.nr_open. Workload-specific parameters (conntrack, bridge-netfilter, inotify, max_map_count) only matter if you run that workload.
    • Per-process FD limits are not sysctl. They live in limits.conf (login shells) and systemd Limit* directives (services).
    • /sys/ knobs (block scheduler, CPU governor, THP) sit alongside sysctl and matter for latency-sensitive workloads.
    • Tuning without measurement is theater. Baseline, change one thing, measure again, persist with a comment that explains why.
    • Most "kernel tuning guides" on the internet are old. tcp_tw_recycle is gone, tcp_fin_timeout=15 is questionable, vm.swappiness=0 doesn't disable swap. Always check the running default before overriding it.

    The kernel gives you a thousand knobs because the workloads it has to serve span embedded sensors to NVMe-attached racks. Knowing which knobs are load-bearing for your workload — and leaving the rest alone — is the actual skill.