kanj technologies

Enterprise Disaster Recovery in minutes, not hours

How we strengthened resilience for a major telecommunications provider by strengthening their Security Operations Centre and wider organisation, ensuring swift recovery and uninterrupted security operations.

PROJECT

Disaster Recovery & Resilience

BIG WINS

Enterprise-grade VMware SRM replicating 27 critical servers
Met recovery targets: RPO of 5 mins and RTO of 15 mins.
Failover capability proven in under 2 minutes.
Business continuity confidence built through regular DR testing.

The same Fortune 500 telecommunications provider mentioned in the previous case study needed to ensure its UK Security Operations Centre (SOC) could withstand a major outage without disrupting monitoring and incident response across more than 200 sites. The SOC aggregated alarms and logs from critical systems such as access control and security infrastructure, where prolonged downtime would present an unacceptable operational and reputational risk.

Kanj Technologies was engaged to design and implement a disaster recovery platform that would deliver rapid, predictable failover between a primary SOC and a secondary site, meeting stringent recovery objectives while supporting 24/7 operations.

The challenge

The SOC was responsible for monitoring a complex estate spanning satellite offices, communications depots, corporate headquarters and customer-facing locations. Each site depended on the SOC’s ability to ingest and analyse alarms and logs in near real time.

An outage at the primary SOC – whether due to infrastructure failure, local incident or wider disruption – would immediately degrade the organisation’s ability to detect and respond to security events. In addition to the operational risk, the organisation faced mounting expectations from insurers, auditors and regulators to evidence a robust and tested disaster recovery capability.

The brief set clear technical and operational requirements. The solution needed to protect twenty-seven mission-critical servers, provide a recovery point objective of around five minutes and support a recovery time objective measured in minutes rather than hours. It also had to support the controlled relocation of SOC staff to a secondary site approximately thirty minutes away, enabling them to continue operations with minimal interruption in the event of a major incident.

The solution

Kanj Technologies designed and implemented an enterprise-grade disaster recovery platform based on VMware Site Recovery Manager (SRM), integrating storage replication, network design and operational runbooks into a single coherent solution.

At the infrastructure layer, twenty-seven critical SOC servers were replicated from the primary to the secondary site using near real-time synchronisation. This ensured that log aggregation, monitoring tools and core SOC applications were continuously protected, with changes committed to the recovery environment within minutes.

Connectivity between the two locations was provided by a dedicated, encrypted 10 Gbps link. By isolating replication traffic from general business usage, Kanj Technologies ensured consistent performance and removed a common source of contention and instability in disaster recovery designs.

VMware SRM was then used to orchestrate failover and failback. Kanj Technologies developed runbooks that defined application start-up order, network mappings and IP address changes, allowing the entire recovery process to be executed in a controlled, repeatable manner. This removed reliance on manual procedures and helped guarantee that recovery times remained within the defined objectives, even under pressure.

The technical design was complemented by a structured programme of testing. Bi-annual disaster recovery exercises were conducted in which servers at the primary site were intentionally powered down, recovery was initiated to the secondary site and application servers were brought online to verify functionality. These tests also incorporated staff relocation to the secondary SOC, validating not only the technology but the processes, communication flows and decision-making required during a real incident.

The results

The new disaster recovery platform delivered a material uplift in resilience and assurance for the SOC and the wider organisation. Failover tests consistently brought application servers online at the secondary site in under two minutes, comfortably within the agreed recovery time objective

Incident response was now elevated by a predictable, well-rehearsed process. This has translated into lower operational risk, reduced potential financial loss and greater confidence among senior stakeholders.

The regular, documented disaster recovery exercises provide clear evidence for boards, auditors and insurers that the SOC can withstand a major incident and continue to operate. At the same time, SOC teams have gained familiarity and confidence in the recovery process, reducing ambiguity when tests – or real-world events – occur.

Overall, the organisation now benefits from a disaster recovery capability that matches the criticality of its security operations: rapid, reliable and demonstrably effective when required.

Case Study

Menu