Ultimate Network Troubleshooting Guide for Infrastructure Engineers

Start with the Decision Tree

The biggest mistake in network troubleshooting is jumping to conclusions. An engineer sees high latency and immediately blames the application, or sees packet loss and rebuilds the routing table. Systematic layer-by-layer diagnosis takes longer to start but dramatically reduces total time to resolution. The model is simple: work up the stack. Is the link up? Is routing correct? Are connections completing? Is the application responding? Answer each question before moving to the next layer.

This guide is organized as exactly that — a decision tree you can walk top to bottom. Each section includes the commands to run and what their output tells you. At the bottom you will find a quick-reference cheat sheet for the moments when you need the command and cannot remember the flags.

Layer 2: Physical and Link Layer

Is the interface up and is the link detected?

# Link state and basic interface stats
ip link show eth0
# Look for: "state UP" and "LOWER_UP" flags
# "NO-CARRIER" means the cable is unplugged or the switch port is down

# More detail including speed, duplex, auto-negotiation
ethtool eth0
# Look for: Speed, Duplex, Link detected: yes/no
# Auto-negotiation mismatches (one end forced to 1000/Full, other at 100/Half)
# cause 50-100x throughput degradation and high error rates

Are there hardware errors on the interface?

# Interface error counters
ip -s link show eth0
# The second stats block shows: RX errors, dropped, overrun, mcast
# TX errors, dropped, carrier, collisions

# Driver-level error counters (more granular, vendor-specific)
ethtool -S eth0 | grep -iE "error|drop|miss|bad|crc|over"

# Watch for incrementing errors in real time
watch -n2 'ethtool -S eth0 | grep -iE "error|drop|crc"'

CRC errors indicate physical layer problems: a bad cable, a dirty fiber connector, RF interference, or a failing NIC. Even a 0.01% CRC error rate on a 10GbE link will cause measurable TCP throughput degradation because every errored frame must be retransmitted. On a production server, CRC errors should be exactly zero. If they are not, start with cable replacement before anything else.

Duplex mismatch is a classic trap. If one side auto-negotiates to full duplex and the other is forced to half duplex, you will see very high collision and late collision counters, throughput capped around 10-20% of link capacity, and high CPU from softirq handling — symptoms that look like a DDoS but are self-inflicted.

Layer 3: Network Layer

Is routing correct?

# Show the full routing table
ip route show table main

# Show which route will be used for a specific destination
ip route get 8.8.8.8
# Output shows: source IP, gateway, interface — confirms path selection

# Show all routing tables (useful if policy routing is in use)
ip rule list
ip route show table all | grep -v "^cache"

MTU issues

MTU mismatches cause a particularly insidious class of problem: large packets fail silently while small packets succeed. The canonical symptom is that ping works but large file transfers stall, or SSH connects but hangs when you try to run a command. This happens because ICMP "fragmentation needed" messages are being filtered somewhere on the path.

# Test if large packets can traverse the path (DF bit set, no fragmentation)
# Start at 1472 (1500 MTU - 28 bytes for IP+ICMP headers) and reduce
ping -M do -s 1472 -c 4 target.example.com
ping -M do -s 1400 -c 4 target.example.com
ping -M do -s 1200 -c 4 target.example.com

# Find the exact PMTU to a destination
# Keep reducing until ping succeeds; add 28 for the actual path MTU
ping -M do -s 1453 -c 2 target.example.com   # fails
ping -M do -s 1452 -c 2 target.example.com   # succeeds → PMTU is 1480

# Check the PMTU cache
ip route show cache | grep mtu

When it is an attack, not a fault — know within 2 seconds

Flowtriq detects attacks like this in under 2 seconds, classifies them automatically, and alerts your team instantly. 7-day free trial.

Start Free Trial →

Layer 4: Transport Layer

Connection state analysis with ss

ss (socket statistics) is the modern replacement for netstat. It queries the kernel directly and is significantly faster on hosts with thousands of connections:

# Summary of all socket states
ss -s

# All listening TCP sockets with process names
ss -tlnp

# Count of connections by state
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn

# Find connections to a specific remote port (e.g., database at port 5432)
ss -tn dst :5432

# Show connections with send/receive buffer usage
ss -tmn | head -30

SYN flood indicators

# Count half-open connections (SYN_RECV = server sent SYN-ACK, waiting for ACK)
ss -tan | grep SYN-RECV | wc -l
# Normal: 0-10 at any time
# Under SYN flood: can reach tens of thousands

# Monitor the SYN cookies counter (enabled when SYN backlog overflows)
watch -n1 'netstat -s | grep -i "syn cookies"'

# Full TCP statistics including retransmissions and failed connections
netstat -s | grep -iE "retrans|fail|reset|syn|backlog"

TIME_WAIT buildup

A large number of sockets in TIME_WAIT state is normal for a busy server — TIME_WAIT ensures delayed packets from closed connections are not mistaken for new connection data. The default TIME_WAIT duration is 2 × MSL (Maximum Segment Lifetime), which on Linux is 60 seconds (controlled by net.ipv4.tcp_fin_timeout for FIN_WAIT2, not directly for TIME_WAIT). Problems arise when the TIME_WAIT count exceeds the ephemeral port range (typically 28,232 ports: 32768–61000). At that point, new outbound connections start failing with "Cannot assign requested address".

# Count TIME_WAIT sockets
ss -tan | grep TIME-WAIT | wc -l

# Enable TIME_WAIT socket reuse for outbound connections
# (safe for most server-to-server workloads)
sysctl -w net.ipv4.tcp_tw_reuse=1

# Expand the local port range to delay exhaustion
sysctl -w net.ipv4.ip_local_port_range="10240 65535"

Layer 7: Application Layer

Is the process listening?

# Show all listening sockets with the process that owns them
ss -tlnp
# or the older equivalent:
netstat -tlnp

# Confirm a specific port is open and which process owns it
ss -tlnp 'sport = :443'

# Check if the process is actually running
systemctl status nginx
ps aux | grep nginx

What is the process doing?

# Trace system calls for a running process (attach to PID)
# Look for accept(), read(), write() calls and their return values
strace -p $(pgrep -f "nginx: worker") -e trace=network -f 2>&1 | head -50

# Check file descriptor usage (processes hitting fd limits cause EMFILE errors)
ls -la /proc/$(pgrep nginx | head -1)/fd | wc -l
cat /proc/$(pgrep nginx | head -1)/limits | grep "open files"

# System-wide fd limit
sysctl fs.file-max
sysctl fs.file-nr  # in_use / free / max

Attack vs Legitimate Fault: How to Tell the Difference

Many DDoS attack symptoms overlap with legitimate fault symptoms. The key differentiators are directionality, source distribution, and timing.

Link saturation: Legitimate saturation shows a gradual increase correlated with user activity (time of day, marketing events). Attack-induced saturation appears suddenly, often from unusual geographic sources, and the PPS-to-BPS ratio is skewed (many small packets = flood; few large packets = amplification).
CPU spike: Legitimate CPU spikes correspond to application load (requests/second, database queries). Attack-induced CPU spikes come from softirq processing (check top for high si in the CPU line) even when application request rates are low or zero.
Connection table exhaustion: Legitimate exhaustion builds gradually and correlates with connection rates. SYN flood exhaustion appears immediately and ss -s shows thousands of SYN-RECV sockets with no corresponding ESTABLISHED sockets completing.
Timing: Network hardware failures happen at random times. DDoS attacks often begin during low-staff hours (nights, weekends) and stop at predictable intervals.

Flowtriq's continuous monitoring resolves the ambiguity automatically. When a traffic anomaly is detected, it captures the first 30 seconds of packet data before most engineers would even notice the symptom. The classification engine identifies whether the pattern matches known attack signatures (SYN flood, UDP amplification, ICMP flood, etc.) and surfaces that in the alert, so you arrive at the incident already knowing whether you are dealing with a fault or an attack.

When to Call the ISP vs Fix It Yourself

Call the ISP: Loss appearing at a transit hop in mtr that you do not control, link down events that correspond to provider maintenance windows, BGP prefix withdrawals visible in your routing table, volumetric DDoS traffic that saturates your upstream port before reaching your host.
Fix it yourself: CRC errors on your own cabling, duplex mismatches on your NIC or switch port, MTU mismatches between your hosts, TIME_WAIT exhaustion, application fd limit hits, host-based firewall rules blocking legitimate traffic, kernel buffer underruns from inadequate sysctl tuning.

Quick Diagnosis Cheat Sheet

# L2 — Link state and errors
ip link show eth0 && ethtool eth0 && ethtool -S eth0 | grep -iE "error|drop|crc"

# L3 — Routing and reachability
ip route get 8.8.8.8
ping -M do -s 1472 -c 3 target.example.com   # MTU check

# L4 — Connection state
ss -s
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn
ss -tan | grep SYN-RECV | wc -l

# L7 — Process and fd state
ss -tlnp
cat /proc/$(pgrep nginx|head -1)/limits | grep "open files"

# DDoS indicators
watch -n1 'cat /proc/net/dev | grep eth0'    # PPS/BPS
netstat -s | grep -iE "syn cookies|retrans"
mtr --report -c 30 8.8.8.8

Pro tip: Keep a baseline snapshot of ss -s, netstat -s, and cat /proc/net/dev output from a healthy host. When troubleshooting, comparing live output against a known-good baseline makes anomalies immediately obvious and eliminates the "is this normal?" guessing game under pressure.

Protect your infrastructure with Flowtriq

Per-second DDoS detection, automatic attack classification, PCAP forensics, and instant multi-channel alerts. $9.99/node/month.

Start your free 7-day trial →

Back to Blog

Ultimate network troubleshooting guide
for infrastructure engineers

Start with the Decision Tree

Layer 2: Physical and Link Layer

Is the interface up and is the link detected?

Are there hardware errors on the interface?

Layer 3: Network Layer

Is routing correct?

MTU issues

When it is an attack, not a fault — know within 2 seconds

Layer 4: Transport Layer

Connection state analysis with ss

SYN flood indicators

TIME_WAIT buildup

Layer 7: Application Layer

Is the process listening?

What is the process doing?

Attack vs Legitimate Fault: How to Tell the Difference

When to Call the ISP vs Fix It Yourself

Quick Diagnosis Cheat Sheet

Protect your infrastructure with Flowtriq

Related Articles

Ultimate network troubleshooting guidefor infrastructure engineers

Start with the Decision Tree

Layer 2: Physical and Link Layer

Is the interface up and is the link detected?

Are there hardware errors on the interface?

Layer 3: Network Layer

Is routing correct?

MTU issues

When it is an attack, not a fault — know within 2 seconds

Layer 4: Transport Layer

Connection state analysis with ss

SYN flood indicators

TIME_WAIT buildup

Layer 7: Application Layer

Is the process listening?

What is the process doing?

Attack vs Legitimate Fault: How to Tell the Difference

When to Call the ISP vs Fix It Yourself

Quick Diagnosis Cheat Sheet

Protect your infrastructure with Flowtriq

Related Articles

How to detect a DDoS attack: signs, tools & response steps

Packet loss explained: causes, detection & how to fix it

Network traffic analysis tools for DDoS detection

Ultimate network troubleshooting guide
for infrastructure engineers