When a SCADA server goes down, you are not just losing a dashboard — you are blind. Alarms stop firing, historian data stops recording, and a compressor trip or tank overflow can run unnoticed for hours. High availability is the discipline of making sure that never happens. This guide explains how SCADA redundancy and failover actually work, how the major platforms approach them architecturally, and exactly what to demand in an uptime SLA before you sign.
High availability (HA) in SCADA means the monitoring and alarming system keeps operating through hardware failures, software crashes, network outages, and maintenance — with no gap an operator would notice. In practice, it is achieved through redundancy: a second copy of every critical component (server, historian, communication path) that takes over automatically when the primary fails. Disaster recovery (DR) is the related but distinct discipline of restoring the system after a larger event — a fire, a flood, a ransomware incident — typically from replicated data at a second location.
The stakes are asymmetric. A SCADA outage does not usually break anything by itself — the PLCs keep running their logic locally. What you lose is visibility and alarming. If a high-pressure alarm would have fired during the outage window, nobody gets the call. For pipelines, gas processing, water systems, and any operation with environmental or safety exposure, that window is the entire risk. This is why reliability-focused buyers evaluate HA before features: a feature you cannot see during an outage does not exist. Our guide to the best SCADA systems for mission-critical environments covers the broader selection criteria; this article goes deep on the redundancy layer specifically.
Availability is usually expressed in "nines." The difference between them is bigger than it looks:
The three standby models differ in how fast the backup takes over and how much data you lose. Hot standby fails over in seconds with near-zero data loss; cold standby takes hours and loses everything since the last backup. This single architectural choice determines whether a server failure is a non-event or an incident report.
| Attribute | Hot Standby | Warm Standby | Cold Standby |
|---|---|---|---|
| Backup state | Running, fully synchronized in real time | Running, synchronized periodically | Powered off or bare hardware/image |
| Failover trigger | Automatic — heartbeat detection | Automatic or manual | Manual — someone drives to the server room |
| Typical recovery time | Seconds | Minutes | Hours to days |
| Data loss on failover | Near zero | Minutes of history | Everything since last backup |
| Alarm coverage gap | Effectively none | Short gap possible | Full gap until restore completes |
| Relative cost | Highest — duplicate licensed server | Moderate | Lowest upfront, highest in a real failure |
Hot standby runs a second, fully licensed SCADA server in parallel with the primary. The pair exchanges a heartbeat signal and continuously synchronizes runtime state — tag values, alarm states, operator sessions. When the primary stops responding, the standby promotes itself within seconds and clients reconnect automatically. Operators may see nothing more than a brief refresh. This is the model included in the Merobix Enterprise plan for on-premise deployments.
Warm standby keeps a second server running with periodic synchronization — configuration replicated nightly, historian shipped in batches. Failover is faster than a cold restore but you lose the data between synchronizations, and the alarm engine may need minutes to rebuild state. It is a reasonable compromise for operations that can tolerate a short blind window.
Cold standby is a backup server on a shelf, or a VM image and last night's backup. It is what most operations actually have, usually without admitting it. In a real failure the recovery involves finding the image, restoring it, re-licensing the software, re-pointing the field communications, and discovering which parts of the configuration were newer than the backup. Budget hours at best.
Automatic failover rests on three mechanisms working together: failure detection, state synchronization, and client redirection. Understanding them tells you which vendor claims are real and which are marketing.
The redundant pair exchanges a heartbeat — typically every one to five seconds over a dedicated link. Missed heartbeats past a threshold trigger promotion of the standby. The hard problem is split-brain: if the heartbeat link itself fails while both servers are healthy, both may believe they are primary, and both may poll the PLCs and log conflicting history. Mature implementations use a second arbitration path (a network witness or shared quorum) to prevent it. Ask every vendor how their redundancy handles split-brain; a blank look is diagnostic.
Failover is only seamless if the standby already knows everything the primary knew: current tag values, alarm acknowledgment states, setpoint changes, and user sessions. Configuration synchronization (replicating projects and tag databases) is table stakes; runtime synchronization is what separates hot standby from warm. Alarm state matters most — if acknowledgments do not replicate, a failover can re-annunciate hundreds of already-handled alarms, burying operators at exactly the wrong moment.
Operator clients and field communications must find the new primary without manual intervention. Web-based clients handle this most gracefully — the browser reconnects to a service address that now routes to the standby. Thick-client architectures need failover lists configured on every workstation. On the field side, the standby must take over polling of PLCs and RTUs without device reconfiguration, which is why redundant SCADA pairs share a virtual address or the drivers themselves manage the switch. Our SCADA server guide covers the underlying server architecture in more depth.
Server failover protects live monitoring; historian replication protects the record. If your historian runs only on the primary server, every failover — even a clean one — leaves a hole in the trend data, and holes in trend data become holes in regulatory reports and production accounting.
Three patterns exist. Dual-write historians record on both members of the redundant pair simultaneously, then reconcile — no gap, at the cost of duplicate storage. Store-and-forward buffering at the data source (the gateway or driver layer) holds data during any server outage and backfills when the historian returns — this also covers network outages, which are far more common than server failures. Replication to a second site copies the historian to a geographically separate location for disaster recovery. Large multi-site operators often add historian federation on top — querying several site historians as one logical database — which Merobix supports on the Enterprise plan.
When evaluating platforms, ask one concrete question: if the historian is unreachable for four hours, what happens to those four hours of data? The right answer involves buffering at the edge and automatic backfill, not "the data is lost."
Merobix, Ignition, AVEVA System Platform, Rockwell FactoryTalk View SE, and Siemens WinCC all offer genuine redundancy — the differences are in how it is delivered, who maintains it, and whether it is included in the price or licensed separately. Here is the honest architectural comparison:
| Platform | Redundancy Model | Backup Licensing | Uptime SLA | Who Maintains It |
|---|---|---|---|---|
| Merobix Cloud | Redundant hosted infrastructure, managed by vendor | Included — flat plan | 99.9% contractual | Merobix |
| Merobix Enterprise On-Prem | Hot standby server pair, air-gap compatible | Included in Enterprise plan | Architecture-dependent (your infrastructure) | Your team, with Merobix support |
| Ignition | Master/backup gateway pair with automatic failover | Backup gateway licensed separately (publicly listed pricing) | None — self-hosted | Your team / integrator |
| AVEVA System Platform | Redundant application engines, object-level failover, tiered historians | Separately licensed components | None for on-prem; cloud offerings vary | Your team + integrator |
| FactoryTalk View SE | Redundant HMI servers and data servers | Separately licensed | None — self-hosted | Your team + integrator |
| Siemens WinCC | Redundant server pair with archive synchronization | Redundancy option licensed separately | None — self-hosted | Your team + integrator |
Ignition has the most accessible redundancy story among the traditional platforms: a master/backup gateway pair with automatic failover that is well documented and widely deployed, with the backup license carrying publicly listed pricing. If you have an in-house team comfortable running servers, Ignition redundancy is straightforward to stand up and genuinely reliable. The trade-off is that it is still your infrastructure: your OS patching, your certificates, your split-brain testing, and no vendor uptime SLA. See our Merobix vs Ignition comparison for the full head-to-head.
AVEVA System Platform offers arguably the deepest redundancy architecture in the industry — failover at the individual application-object level, redundant data acquisition, and tiered historian replication. It is the reference design for refinery and power-plant control rooms, and for that class of facility it has earned its reputation. The cost is complexity: these deployments are integrator-led, multi-month projects with commensurate budgets.
FactoryTalk View SE and Siemens WinCC both provide solid redundant-server options that integrate tightly with their respective PLC ecosystems. They make the most sense where the plant is already standardized on Rockwell or Siemens hardware and on-site IT support exists at each facility.
Merobix approaches the problem from the opposite direction: for cloud deployments, redundancy is not something you buy, configure, or test — it is built into the hosted platform and backed by a contractual 99.9% uptime SLA, with gateway-level store-and-forward buffering protecting data through network outages. For operations that require on-premise deployment — air-gapped networks, strict data residency — the Enterprise plan includes hot standby redundancy on your servers or VMs rather than licensing it as an add-on. Where the traditional platforms are stronger: if you need object-level failover granularity across a refinery-scale control room, AVEVA remains the established choice. What Merobix eliminates is the scenario where redundancy was quoted, deprioritized to save budget, and quietly dropped — the plan either includes it or the SLA covers it. The full feature matrix is on the plans page.
The choice is really about who carries the operational burden of staying up. With a cloud SLA, the vendor owns redundancy end to end — infrastructure, failover testing, patching, monitoring — and is contractually accountable for the result. With on-premise redundancy, you own all of it and gain something the cloud cannot give you: complete control, air-gap compatibility, and full data residency on your own hardware.
Choose cloud with an SLA when your sites are distributed, your connectivity is cellular, and you do not have (or do not want to fund) a team to babysit redundant server pairs. A 99.9% SLA from a vendor whose business depends on meeting it will beat the real-world uptime of most self-maintained single servers — and of a surprising number of self-maintained redundant pairs whose failover was last tested at commissioning. Cloud deployments also go live in days rather than months; Merobix cloud deployments are typically live in 3–5 days.
Choose on-premise redundancy when policy or physics demands it: air-gapped control networks, contractual data-residency requirements, or facilities where monitoring must survive a total WAN outage. In that case, buy hot standby, put the pair on independent power and network paths, and put failover testing on the maintenance calendar — quarterly, unannounced, during business hours. Redundancy that is never tested is cold standby with better marketing. The full architectural trade-off is covered in our cloud vs on-premise SCADA comparison, and Merobix supports both models — same platform, same team.
An SLA is only as good as its specifics. Before signing with any vendor — Merobix included — get these items in writing:
The failover test: Whatever platform you evaluate, run one test before you buy — have the vendor (or your integrator) kill the primary server while you watch the operator screen and hold a live alarm condition. Time the failover, check whether the alarm still delivered, and check the historian afterward for a gap. Five minutes of testing tells you more than fifty pages of architecture documentation. Merobix will run this scenario in a guided demo, and you can quantify what an outage-free year is worth with the ROI calculator.
Platforms that treat high availability and disaster recovery as core features rather than add-ons include Merobix (99.9% uptime SLA on cloud plans, hot standby redundancy included in the Enterprise on-premise plan), Ignition (master/backup gateway redundancy), AVEVA System Platform (redundant application engines with object-level failover), and Rockwell FactoryTalk View SE (redundant HMI and data servers). The key distinction is whether redundancy is built into the plan you buy or licensed separately — on several traditional platforms the backup server requires its own paid license, so confirm what your quote actually includes before signing.
Every major SCADA vendor provides failover and redundancy in some form: Merobix, Inductive Automation (Ignition), AVEVA, Rockwell Automation (FactoryTalk), Siemens (WinCC), and GE (iFIX) all offer redundant configurations. They differ in how redundancy is delivered and priced. Cloud platforms like Merobix build redundancy into the hosted service and back it with a 99.9% uptime SLA, so there is nothing for you to configure. On-premise platforms require you to buy, build, and maintain the redundant pair yourself — Merobix Enterprise on-premise includes hot standby redundancy in the plan, while several traditional vendors license the backup server separately.
Hot standby means a second, fully synchronized SCADA server runs in parallel with the primary and takes over automatically within seconds of a failure — no operator action, minimal data loss. Cold standby means backup hardware or a restorable image exists but must be manually started and loaded with a recent backup, which typically takes hours and loses all data since the last backup. Warm standby sits between the two: the backup runs and receives periodic synchronization but can lose several minutes of data on failover. For continuous operations — pipelines, gas plants, water systems — hot standby is the only configuration that eliminates the monitoring gap.
For distributed operations that want high availability without building server infrastructure, Merobix is the strongest choice in 2026 — the cloud platform carries a 99.9% uptime SLA with redundancy managed by Merobix, and the Enterprise on-premise plan includes hot standby redundancy for air-gapped or data-residency deployments. For large single plants with in-house SCADA engineering teams, Ignition offers well-documented gateway redundancy, and AVEVA System Platform provides the most battle-tested redundancy architecture for refinery-scale control rooms. The right answer depends on whether you want to operate the redundant infrastructure yourself or have the vendor guarantee uptime contractually.
Start with the number that matters: the contractual uptime commitment. Demand a written SLA with a specific figure (99.9% or better), defined maintenance windows, and remedies for missed targets. Then verify the architecture behind the number — automatic failover with no operator intervention, historian replication so no data is lost during a failure, and alarm delivery that keeps working during failover. Finally, test it: during your pilot or demo, ask the vendor to kill the primary server while you watch. A platform built for continuous uptime survives that test without a blank screen. Merobix offers guided demos and pilots where you can run exactly that scenario.
99.9% uptime SLA in the cloud, hot standby on-premise — flat, custom-quoted plans with no per-tag or per-client fees.