Nobody gets fired for a sophisticated zero-day. People get fired for the basics: the port that was left open, the alert that was ignored, the backup that was never tested. After talking to dozens of infrastructure teams post-incident, the same patterns come up over and over. Here are the ten mistakes that actually end careers, and what to do instead.
1. Ignoring Alerts Until Users Report Downtime
This is the single most common way infrastructure engineers lose credibility. An alert fires at 2:47 AM. You glance at it, decide it is probably a false positive, and go back to sleep. At 7:15 AM your VP of Engineering is asking why the API has been down for four hours and customers are posting about it on Twitter.
The root cause is almost never laziness. It is alert fatigue. When your monitoring system fires 200 alerts a day and 195 of them are noise, you train yourself to ignore all of them. This is a system design problem, not a people problem, but you are the one who gets blamed.
Fix it:
- Audit every alert that fired in the last 30 days. Delete or tune any alert that did not require human action.
- Implement severity tiers. Critical alerts page you. Warning alerts go to Slack. Info alerts go to a dashboard nobody is obligated to watch.
- Set alert thresholds based on actual baseline data, not gut feelings. If your normal CPU usage is 60%, an alert at 70% is noise. An alert at 92% sustained for 5 minutes is signal.
- Track your alert-to-incident ratio monthly. If fewer than 1 in 10 alerts result in action, your thresholds are wrong.
The goal is not zero alerts. The goal is that every alert that fires is worth waking someone up for. If it is not worth waking someone up, it is not an alert — it is a log entry.
2. Running Production Without Any DDoS Detection
"We are not important enough to get DDoS'd" is the infrastructure equivalent of "I don't need a seatbelt because I'm a good driver." According to Cloudflare's 2025 threat report, DDoS attacks increased 117% year-over-year. Netscout recorded over 17 million attacks in the first half of 2025 alone. The median attack duration is under 10 minutes — short enough that by the time you notice, diagnose, and respond manually, the damage is done.
The attacks that take down small and mid-size infrastructure are not sophisticated. They are commodity UDP floods and SYN floods launched by teenagers with $20 booter subscriptions. Your infrastructure does not need to be a high-value target. It just needs to be online.
Fix it:
- Deploy detection at the node level. You cannot mitigate what you cannot see.
- Establish a baseline for your normal traffic patterns — PPS, bandwidth, protocol distribution — so you can detect deviations automatically.
- Have a mitigation path ready before an attack happens: upstream blackhole, scrubbing service, or CDN failover.
- Test your detection by running controlled traffic generation against a staging node. If your monitoring does not fire, your monitoring is broken.
3. Exposing Management Interfaces to the Public Internet
This one should be extinct by now, and yet Shodan still indexes millions of exposed SSH daemons, MySQL instances, Redis servers, phpMyAdmin panels, and Kubernetes dashboards on every scan. Every single one of these is a breach waiting to happen.
Run this on your production servers right now:
# List all TCP ports listening on all interfaces (0.0.0.0 or ::)
ss -tlnp | grep -E '0\.0\.0\.0|::' | awk '{print $4, $NF}'
# Example output that should terrify you:
# 0.0.0.0:22 users:(("sshd",pid=1234,fd=3))
# 0.0.0.0:3306 users:(("mysqld",pid=5678,fd=22))
# 0.0.0.0:6379 users:(("redis-server",pid=9012,fd=6))
# :::8080 users:(("java",pid=3456,fd=128))
If MySQL, Redis, Elasticsearch, or any database is listening on 0.0.0.0, you have a problem. If your admin panel is accessible without a VPN, you have a problem. If SSH is on port 22 with password authentication enabled, you have a very common problem that bots are already exploiting.
Fix it:
- Bind management services to
127.0.0.1or a private network interface. - Put admin panels behind a VPN or SSH tunnel. WireGuard takes 10 minutes to set up.
- Use firewall rules to whitelist management access by source IP.
- Run
ss -tlnpas part of your deployment pipeline. If a new service binds to0.0.0.0, fail the deploy.
4. No Incident Response Plan
When an attack hits at 3 AM and you are winging it — Googling iptables rules, trying to remember which upstream provider to call, guessing which Slack channel to post in — you are not responding to an incident. You are panicking in public. And everyone watching can tell the difference.
An incident response plan does not need to be a 40-page document. It needs to answer six questions:
- Who gets notified? — Primary on-call, escalation path, who has root access at 3 AM.
- How do we assess severity? — Clear criteria: P1 is customer-facing outage, P2 is degraded, P3 is internal only.
- What are the first 5 minutes? — Verify the alert, check dashboards, open a war room channel.
- What can we do immediately? — Pre-approved mitigation actions: enable rate limiting, activate blackhole route, failover to secondary.
- Who communicates externally? — Status page updates, customer comms, social media.
- When is it over? — Clear resolution criteria, not "it seems fine now."
Print this out. Tape it to the wall next to the on-call laptop. Practice it quarterly. When the next incident hits, you will look like a professional instead of a deer in headlights.
5. Skipping Security Patches for Uptime
The irony is exquisite: you skip a patch because you cannot afford 5 minutes of downtime, and then the unpatched vulnerability causes 8 hours of downtime plus a data breach. This is not a hypothetical. It is the exact sequence of events behind some of the largest breaches in history.
Equifax (2017) was breached through CVE-2017-5638, an Apache Struts vulnerability that had a patch available two months before the breach. The WannaCry ransomware (2017) exploited CVE-2017-0144, patched by Microsoft 59 days before the worm spread. Log4Shell (CVE-2021-44228) had active exploitation within hours of disclosure, but organizations that patched within the first 48 hours were largely unaffected.
Fix it:
- Automate patching for non-critical systems.
unattended-upgradeson Debian/Ubuntu handles security patches automatically. - For critical systems, establish a patch SLA: critical CVEs within 48 hours, high within 7 days, medium within 30 days.
- Use rolling restarts and blue-green deployments to patch without downtime. If your architecture cannot handle a single node restarting, that is a separate problem you also need to fix.
- Subscribe to the CVE feeds for every piece of software in your stack. Not RSS — direct email alerts.
# Check for pending security updates on Debian/Ubuntu apt list --upgradable 2>/dev/null | grep -i security # On RHEL/CentOS dnf updateinfo list security # See when your system was last updated stat /var/log/apt/history.log | grep Modify # If that date is more than 30 days ago, you have a problem
6. Using the Same Credentials Everywhere
One SSH key for every server. One password for every database. The root password written on a sticky note that has been photographed in the background of three Zoom calls. This is how a single compromised credential turns into a total infrastructure takeover.
The attack path is trivial: compromise one low-security staging server, find the SSH key in ~/.ssh/, discover it works on every production server, game over. Or worse: a developer leaves the company, and their personal SSH key still has root access to 47 servers because nobody tracks which keys are deployed where.
Fix it:
- Use unique SSH keys per environment. Better yet, use short-lived SSH certificates (HashiCorp Vault,
step-ca, or Teleport). - Deploy a secrets manager for database credentials, API keys, and service tokens. Even a self-hosted solution like Vault is better than plaintext in
.envfiles. - Audit deployed SSH keys quarterly:
find /home -name "authorized_keys" -exec wc -l {} \;tells you how many keys are trusted on each account. - Rotate credentials on a schedule, not just when someone leaves. If a credential is older than 90 days, rotate it.
- Never share credentials over Slack, email, or any channel that retains history. Use a one-time secret sharing tool.
# Audit: Find all authorized_keys files and count entries
find /home -name "authorized_keys" -exec sh -c \
'echo "$(wc -l < "$1") keys in $1"' _ {} \;
# Audit: Find SSH keys that have no passphrase
# (if ssh-keygen can read it without prompting, it is unencrypted)
find /home -name "id_*" ! -name "*.pub" -exec sh -c \
'ssh-keygen -y -P "" -f "$1" >/dev/null 2>&1 && echo "UNPROTECTED: $1"' _ {} \;
Detect threats before they become incidents
Flowtriq monitors your infrastructure in real time, detects anomalies in under 2 seconds, and alerts your team instantly. 7-day free trial.
Start Free Trial →7. No Network Segmentation
A flat network is an attacker's paradise. One compromised web server gives direct access to the database server, the monitoring stack, the CI/CD system, and the backup infrastructure. There is no lateral movement required when everything is already on the same Layer 2 segment with no firewall rules between them.
Network segmentation is the difference between "we lost one web server" and "we lost everything." It is also the difference between a contained incident and a reportable breach, because most compliance frameworks (PCI DSS, SOC 2, HIPAA) explicitly require network segmentation.
Fix it:
- At minimum, separate your network into three zones: public-facing (web servers, load balancers), application tier (app servers, workers), and data tier (databases, caches, backups).
- Use VLANs or separate subnets with firewall rules between them. The data tier should only accept connections from the application tier on specific ports.
- Your management network (SSH, monitoring, deployment) should be a separate segment accessible only via VPN or bastion host.
- Deny all inter-zone traffic by default, then explicitly allow only what is needed. Document every allowed flow.
If you cannot implement VLANs (some cloud providers make this painful), host-based firewalls are your fallback. Every server should have iptables or nftables rules that restrict which other servers can connect to which ports. It is not as clean as network-level segmentation, but it is infinitely better than nothing.
8. Logging Everything but Reading Nothing
You have 4 TB of logs in Elasticsearch. You are paying $800/month for log storage. And nobody has looked at them since the cluster was set up. Congratulations, you have built an expensive compliance checkbox that provides zero security value.
Logs without alerting are just disk usage. The purpose of logging is not to have logs — it is to detect anomalies, investigate incidents, and provide an audit trail. If you cannot answer "what happened at 2:47 AM last Tuesday" within 5 minutes using your log infrastructure, your logging is not working.
Fix it:
- Define what you actually need to log. Authentication events, privilege escalation, network connections, configuration changes, and application errors. Everything else is optional.
- Set up log-based alerts for the events that matter: failed SSH logins from new IPs,
sudousage by non-admin accounts, unexpected outbound connections, file integrity changes in/etc. - Implement log rotation and retention policies. 90 days of searchable logs and 1 year of compressed archives covers most compliance requirements.
- Schedule a weekly 15-minute log review. Look at top authentication failures, top blocked firewall hits, and any new error patterns. Make it a habit, not a heroic effort.
# Quick audit: are your logs actually being read?
# Check the last access time on key log files
stat -c '%n: last accessed %x' /var/log/auth.log /var/log/syslog /var/log/kern.log
# If the access time matches the modify time, nobody is reading them.
# Set up a basic failed-auth alert with a one-liner:
journalctl -u sshd -f | grep --line-buffered "Failed password" | \
while read line; do
echo "$line" | mail -s "SSH Auth Failure" [email protected]
done &
9. No Backup or Disaster Recovery Testing
There is a saying in the industry: "You do not have backups. You have backup files. You have backups only after you have successfully restored from them." This is the Schrodinger's backup problem — your backup is simultaneously working and broken until you try to restore it, and most people never try.
The failure modes are endless and creative. The backup job has been silently failing for six months because the disk filled up. The backup completes successfully but the MySQL dump is corrupted because it was taken without --single-transaction on an InnoDB database. The backup is perfect but the restore documentation assumes a server configuration that no longer exists. The backup is stored on the same physical host as the production data it is supposed to protect.
Fix it:
- Test a full restore quarterly. Not a spot check — a complete restore to a clean server. Time it. Document every step. Fix every step that requires tribal knowledge.
- Monitor your backup jobs. A backup job that fails silently is worse than no backup at all, because you have false confidence.
- Store backups in a different geographic region and on a different provider than your production infrastructure.
- Calculate your actual Recovery Time Objective (RTO) and Recovery Point Objective (RPO). If your business needs 1-hour RPO and your backups run daily, you have a gap that nobody has acknowledged.
- Test that your backups are encrypted and that you have the decryption keys stored separately. Encrypted backups with lost keys are just random data.
# Quick backup health check # 1. When did the last backup run? ls -la /var/backups/ | head -5 # 2. Is the backup file a reasonable size? (not 0 bytes, not truncated) du -sh /var/backups/latest.sql.gz # 3. Can you actually read it? gunzip -t /var/backups/latest.sql.gz && echo "OK" || echo "CORRUPT" # 4. The real test: restore to a scratch database mysql -u root -e "CREATE DATABASE backup_test;" gunzip -c /var/backups/latest.sql.gz | mysql -u root backup_test mysql -u root -e "SELECT COUNT(*) FROM backup_test.users;" mysql -u root -e "DROP DATABASE backup_test;"
10. Not Learning from Incidents
The attack is over. The servers are back up. Everyone is exhausted. The last thing anyone wants to do is sit in a meeting and talk about what went wrong. So you do not. And six months later, the exact same thing happens again, in the exact same way, and the exact same person gets paged at 3 AM to fight the exact same fire.
Organizations that do not run post-mortems are doomed to repeat their incidents. Worse, they lose institutional knowledge every time someone leaves the team. The engineer who figured out that the root cause was a misconfigured BGP announcement is gone now, and nobody wrote it down.
Fix it by running a post-mortem for every P1 and P2 incident. Use this template:
Post-Mortem Template
Incident: [One-line description]
Date: [When it happened]
Duration: [Time to detection] + [Time to resolution]
Severity: [P1/P2/P3] — [Customer impact summary]
Timeline: Minute-by-minute log of what happened, what was tried, and when it was resolved. Use timestamps.
Root Cause: The actual technical root cause. Not "human error" — go deeper. Why was the human able to make that error?
What Went Well: Detection speed, communication, teamwork — acknowledge what worked.
What Went Poorly: Gaps in monitoring, slow escalation, missing runbooks.
Action Items: Specific, assigned, with deadlines. "Improve monitoring" is not an action item. "Add PPS threshold alert on edge routers at 150% of baseline, assigned to Jamie, due March 20" is an action item.
The key rule of post-mortems: blameless, not accountable-less. You are not looking for someone to punish. You are looking for systems to fix so that the same human mistake becomes impossible or at least detectable. If your post-mortem concludes with "tell Dave to be more careful," you have failed at the exercise.
The Common Thread
Every mistake on this list shares one trait: it is a known problem with a known fix that was deprioritized until it was too late. Nobody gets fired for a novel zero-day exploit that defeats multiple layers of defense. People get fired for the SSH port that was open to the world, the alert that was ignored, the backup that was never tested.
The fix is not heroism. It is hygiene. Spend one hour per week on security fundamentals — patching, access review, log review, backup testing — and you will avoid 90% of the incidents that end careers. The remaining 10% will be genuinely interesting problems that make you better at your job instead of ending it.
Stop flying blind on your infrastructure
Flowtriq gives you per-second DDoS detection, automatic attack classification, PCAP forensics, and instant multi-channel alerts. Fix mistake #1 and #2 in under 5 minutes. $9.99/node/month.
Start your free 7-day trial →