diagnosticssecuritynetwork

DDoS or Platform Failure? A Diagnostic Flowchart for Differentiating Attack vs Operational Outage

UUnknown

2026-02-13

10 min read

Rapidly determine if an outage is DDoS, CDN, BGP/DNS or internal failure using a tactical decision tree and tooling checklist for 2026.

Hook: When every minute offline costs millions — detect attack vs platform failure fast

When your monitoring dashboards and alerting queues light up, the first question you must answer is simple and urgent: is this a large-scale DDoS or a platform/third-party failure? The distinction matters — mitigation steps, communication with upstream providers, and SLA decisions all change. This guide arms engineering and SRE teams with a pragmatic decision tree, commands and tooling checklist to classify outages quickly: DDoS, CDN misconfiguration, BGP/DNS routing problems, or an internal platform failure.

Why this matters in 2026

Late 2025 and early 2026 saw high-profile incidents where major properties appeared offline simultaneously — often traced back to provider configuration errors or routing anomalies rather than pure volumetric attacks. At the same time, adversaries use increasingly sophisticated multi-vector, L7 and AI-driven attack tooling that can mimic operational failure signatures. The result: false assumptions, delayed mitigations, and poor communications. In 2026, fast, evidence-based classification is an operational imperative.

How to use this diagnostic flowchart

This is an inverted-pyramid decision tree: start with cheap, high-signal checks that rule out the largest classes of issues, then progress to deeper packet/route analysis and mitigation. Below you'll find: a stepwise flow, the exact commands and telemetry to collect, a prioritized tooling checklist, and recommended playbook actions for each classification.

Decision tree (high-level)

Run these checks in parallel where possible. Assign a single incident commander to collate answers and mark the classification stage.

Confirm symptom scope
- Are internal service health checks failing or only external user traffic?
- Is the issue global, regional, or a single POP/availability zone?
Surface DNS & TLS
- Can you resolve your authoritative domain? (dig +trace, dig @1.1.1.1 A/AAAA/CNAME)
- Is your certificate valid at edge? (openssl s_client -connect host:443)
Check BGP & routing
- Are your prefixes visible in route collectors? (routeviews, RIPE RIS, bgp.he.net)
- Look for AS path changes or withdrawals.
Edge/CDN signs
- Are certain POPs unreachable, or is the service returning origin errors (502/503) from edge?
- Do provider status pages show maintenance or incidents?
Traffic analysis
- Is traffic volumetric (massive spike in bandwidth) or request-pattern abnormal (many slow POSTs, repeated nonces)?
- Check NetFlow/sFlow, CDN edge logs, and WAF logs for signature patterns.
Internal health & telemetry
- Are internal metrics (CPU, memory, thread pools, DB latency) spiking consistent with a genuine load?
- Do application logs show internal exceptions or only proxy errors?

Detailed diagnostic steps and commands

1. Confirm scope and quick triage

Check global synthetic probes (Uptime checks, synthetic users). If synthetics across multiple regions fail, problem is probably network/edge/provider. For building reliable probes and automating first-level checks, teams often consult hybrid edge workflow guidance.
Ask: are internal telemetry endpoints reachable? If internal services are healthy and only external traffic fails, suspect CDN/BGP/DNS or an upstream provider.

2. DNS & TLS checks — fast and high-signal

dig +trace example.com; dig @8.8.8.8 +short example.com
nslookup or host to verify authoritative servers are responding
openssl s_client -connect host:443 -servername host — look for TLS handshake failures surfaced at edge
If DNS resolution fails from many public resolvers but authoritative answers exist, suspect delegation or resolver blockage. If you need to run a registrar or ownership check as part of triage, see how to conduct due diligence on domains.

3. BGP and routing checks — route visibility

Check route collectors: https://bgp.he.net, https://bgpmon.io, https://routeviews.org, RIPE RIS
Confirm origin AS and prefix announcements. If prefixes are withdrawn or hijacked, this indicates BGP issues. For architectural patterns that reduce routing blast radius, review edge-first patterns for 2026 cloud architectures, which also covers RPKI and origin validation best practices.
Use whois -h whois.radb.net <prefix> to inspect route objects.
Check for sudden AS path changes or unexpected origin ASes — signs of hijacks/route leaks.

4. Edge/CDN validation

Query CDN provider APIs/status pages. Check for reported incidents (in 2026 many providers publish automated incident telemetry). For playbooks on handling provider outages and public comms, see notification and recipient safety playbooks.
Fetch headers from edge using curl -I -v and inspect header footprints (x-cache, via, server).
- 502/503 from CDN edge often indicates origin connectivity or configuration. 503 with no edge headers likely provider outage.
Check whether only certain POPs are affected: use regional probes or RUM data.

5. Traffic analysis & NetFlow

Collect NetFlow/sFlow records and summarise top source IPs, ASNs, ports and protocols.
Volume spike concentrated from many IPs but single ASN — could be mis-routed or reflector. Volume from many ASNs with randomized patterns — likely distributed DDoS.
Look for L7 signatures in access logs: repeated identical user-agents, slowloris patterns, cache-busting query strings.

6. Internal platform and application checks

Inspect application logs and stack traces. A surge of 5xx with internal exceptions points to a platform failure.
Check queue/backpressure metrics, DB connection pools, leader election systems. If these degrade, the root cause may be application scaling or a cascade failure.
If internal control plane (orchestration) is compromised/unreachable, treat as internal failure.

Classification matrix: signatures and actions

Use this quick reference once you have core telemetry.

DDoS
- Signals: massive bandwidth spike, many source ASNs, diverse geographies, odd packet sizes, clear WAF/edge rule triggerring spikes.
- Immediate actions: engage scrubbing partner or CDN scrubbing, add rate-limits, deploy WAF rules, blackhole specific flows only if necessary.
CDN misconfiguration / provider outage
- Signals: edge responses with provider error headers, 502/503 from the CDN, synthetic failure only from external clients, provider status page reports.
- Immediate actions: failover to alternative POPs/origins, enforce origin direct routes, switch DNS to secondary provider or adjust CNAMEs where safe.
BGP / routing issue
- Signals: prefixes withdrawn, unexpected AS origins in route collectors, regional reachability problems, traceroutes halting at a particular ASN.
- Immediate actions: contact upstream transit/IX peers, request community-based mitigation (blackholing or de-aggregation), advertise emergency more-specific prefixes if appropriate. Architecture choices that enable multi-provider resilience can materially shorten recovery times — see edge-first patterns for multi-homing strategies.
DNS/DNSSEC problem
- Signals: dig +trace shows failed delegation, resolver-specific failures, NXDOMAIN spikes, DNSSEC validation errors.
- Immediate actions: verify SOA/NS records, assess recent DNS changes, rollback to prior configuration, communicate with registrar/hostmaster. Domain ownership and registrar interactions are covered in domain due diligence.
Internal platform failure
- Signals: internal health metrics degrade, stateful services fail, application exceptions spike, persistent errors even when edge/route appears healthy.
- Immediate actions: roll back recent deployments, scale critical services, restore from stable leader, shift traffic via load balancers to healthy pools.

Tooling checklist — what to have ready (2026 edition)

Modern outages require both classic tools and newer programmable observability.

Network & routing: access to BGP collectors (routeviews, RIPE RIS), ASNs of peers, BGP session monitoring, RPKI validators
DNS: zone control panel access, registrar credentials, secondary DNS providers, DNS query logs (e.g., from authoritative servers)
Traffic capture: tcpdump/tshark on edge and origin, sFlow/NetFlow exporters, packet store for forensic review
Edge/CDN: provider incident API keys, ability to change CNAMEs quickly, edge log streaming (edge WAF, edge cache logs)
Mitigation: scrubbing partners contact list, support SLAs, automation hooks (API keys) for on-call to enable mitigation
Visibility: eBPF-based host and edge telemetry (2026 best practice) — build on the patterns in Hybrid Edge Workflows for probe placement and host-level signals, RUM, synthetic probes across multiple transit providers
Security: WAF rule repository, rate-limit policies, IP reputation feeds, automated blocklists
Communication: incident bridge templates, status page editing rights, downstream partner contacts. For communication templates and safety considerations when large platforms go down, consult this platform-down playbook.

Playbook: primary mitigations per class

DDoS

Engage scrubbing: route traffic to scrubbing centers (BGP announcement changes or provider routing).
Throttle by edge: apply per-IP rate limits, challenge/ratelimit suspicious clients, and raise WAF protections.
Coordinate with upstreams: request RTBH (remote triggered blackholing) for confirmed malicious prefixes.

CDN failure

Fail over to secondary CDN or direct-to-origin via DNS TTL flipping (if you control DNS and clients can respect TTLs).
Bypass edge caching for critical endpoints and use origin-serving minimal responses to keep business-critical flows alive.

BGP / routing incidents

Use route collectors to confirm scope; contact peers and IX operators. Coordinate community tags for mitigation and filtering.
Temporarily advertise more-specific prefixes from diverse upstreams if you control multiple ASNs to regain reachability.

DNS

Reinstate prior zone file if a recent change caused the issue. Use secondary authoritative servers to restore resolution.
If registrar issues or DNSSEC failures occur, open the registrar ticket immediately and communicate to customers about expected TTL-based recovery. Domain ownership checks and registrar contacts are part of a good escalation — see domain due diligence.

Evidence collection & post-incident classification

Classification must be backed by immutable evidence. Forensics should include:

Timestamped NetFlow/pcap, CDN edge logs, WAF logs, DNS query logs
BGP snapshots showing withdraws/hijacks, route collector exports (RIBs)
Configuration snapshots and deployment audit trails
Internal telemetry and alert timelines from Prometheus/Datadog/NewRelic

Store preserved data in a secure, tamper-evident location and summarise findings in a reproducible incident report. This supports post-incident claims, vendor accountability, and SLA calculations. For storage and long-term retention considerations, teams should align with technical and procurement teams and review guidance such as a CTO’s guide to storage costs when choosing retention infrastructure.

Case study: applying the flowchart (January 2026 provider incident)

In January 2026, multiple consumer platforms reported outages coinciding with a Cloudflare incident that impacted many customers. Rapid classification following the above checks showed:

Edge headers indicating provider-originated 502/503 errors and synthetic checks failing only from external probes.
BGP and DNS remained stable for most impacted prefixes — ruling out a prefix hijack.
Traffic analysis showed no volumetric spike; instead, edge-level errors emerged from the provider. The conclusion: CDN/provider misconfiguration, not a DDoS.

Teams with pre-approved failover CNAMEs and secondary TLS certificates were able to restore partial service within minutes, while others experienced longer outages because they treated it as a DDoS and engaged scrubbing unnecessarily. Lesson: accurate classification reduces scope and recovery time.

Advanced strategies and 2026 trends to adopt

Edge-first observability: eBPF and programmable telemetry at edge and origin to produce rich L7 signals in real time. See edge-first patterns for 2026 for architecture guidance.
Automated mitigation orchestration: tie detection to APIs for scrubbing, BGP community changes, and CDN rule toggles to remove manual friction. Implementation patterns are covered in hybrid edge workflow playbooks like Hybrid Edge Workflows.
RPKI and origin validation: accelerate RPKI adoption across upstreams to reduce risk of hijacks and route leaks. See edge-first patterns for practical steps.
Multi-provider resilience: active-active CDNs, multi-homed BGP with different transit providers, and automated traffic steering based on health scores.
AI-assisted anomaly detection: leverage generative and ML-driven systems to surface novel attack patterns but don’t rely on them for final classification — human verification remains essential. For automation patterns and model integration, teams reference resources like automation with modern AI tooling.

Communications & SLA considerations

Classification affects external messaging and legal obligations. If a provider outage is the root cause, coordinate statements with that provider. If a DDoS from third parties caused the impact, refer to your mitigation contracts and be ready to detail actions taken for SLA credits. Keep status pages factual and timestamped: include scope, classification, mitigation steps, and expected recovery. Transparency builds trust.

Fast, evidence-led classification saves time and reduces collateral damage. Treat every outage like a forensic exercise, not just an operational emergency.

Actionable checklist — first 15 minutes

Assign incident commander and recorder.
Run global synthetic probes and gather RUM headers. For probe automation and placement patterns see Hybrid Edge Workflows.
Run dig +trace and openssl s_client checks from multiple resolvers/locations.
Pull BGP RIB snapshots and check prefix visibility on route collectors.
Collect edge logs, NetFlow, and WAF logs; snapshot recent deployments.
Contact CDN/provider/ISP support channels if their telemetry indicates an incident. Use incident bridge templates and comms playbooks such as this platform-down playbook to coordinate statements.

Key takeaways

Start with high-signal checks: DNS/TLS, BGP visibility, and edge headers provide the fastest classification clues.
Collect immutable evidence: NetFlow/pcap, edge logs, and route collector snapshots are the backbone for post-incident analysis.
Coordinate mitigation with classification: scrubbing for DDoS, failover for CDN issues, BGP fixes for route problems, and rollback for internal failures.
Invest in 2026-ready capabilities: edge telemetry, automated mitigation APIs, and RPKI/BGP hygiene to reduce risk and mean time to recovery.

Next steps & call to action

Use this decision tree to build or refine your incident runbooks this quarter. Start with a tabletop exercise simulating each classification, then automate the first-level checks into your incident templating. If you want a ready-made checklist tailored to your topology (multi-CDN, multi-AS, hybrid cloud), request our incident classification template and tooling manifest — we’ll help you map commands, contacts and runbooks to your environment.

Get the checklist: email your SRE lead or visit our resources page to download the 2026 Incident Classification Kit with scripts, dashboard queries and post-incident report templates.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.