Senior Engineer, IT Disaster Recovery - Stride, Inc.
Atlanta, GA
About the Job
Job Description
As a Sr. IT Disaster Recovery Engineer, you will be responsible for ensuring the availability and resiliency of critical IT systems and services in the event of a disaster or other disruptive event. Your main duties will include designing and implementing disaster recovery plans, testing and validating recovery procedures, and leading recovery efforts to improve overall resiliency.
You will work closely with other IT professionals, including architects, system engineers, and security specialists, to identify critical systems and services, assess their risk of failure, and develop recovery strategies. You will also collaborate with business leaders to ensure that recovery plans align with business priorities and objectives.
To be successful in this role, you should have extensive experience in disaster recovery planning, including expertise in modern cloud strategies, backup and recovery solutions, replication strategies, and service containerization. You should also have strong project management skills and be able to lead and coordinate recovery efforts across multiple teams and stakeholders.
In addition, you should have excellent communication and interpersonal skills, as you will need to work closely with stakeholders at all levels of the organization to develop and implement recovery plans. You should also have a solid understanding of IT governance frameworks, such as NIST, ISO 27031 & 22301, COBIT and ITIL, and be able to ensure compliance with relevant regulations and standards.
Essential Functions: Reasonable accommodations may be made to enable individuals with disabilities to perform the essential duties.
Supervisory Responsibilities:
This position has no formal supervisory responsibilities.
Minimum Required Qualifications :
Desired Qualifications :
As a Sr. IT Disaster Recovery Engineer, you will be responsible for ensuring the availability and resiliency of critical IT systems and services in the event of a disaster or other disruptive event. Your main duties will include designing and implementing disaster recovery plans, testing and validating recovery procedures, and leading recovery efforts to improve overall resiliency.
You will work closely with other IT professionals, including architects, system engineers, and security specialists, to identify critical systems and services, assess their risk of failure, and develop recovery strategies. You will also collaborate with business leaders to ensure that recovery plans align with business priorities and objectives.
To be successful in this role, you should have extensive experience in disaster recovery planning, including expertise in modern cloud strategies, backup and recovery solutions, replication strategies, and service containerization. You should also have strong project management skills and be able to lead and coordinate recovery efforts across multiple teams and stakeholders.
In addition, you should have excellent communication and interpersonal skills, as you will need to work closely with stakeholders at all levels of the organization to develop and implement recovery plans. You should also have a solid understanding of IT governance frameworks, such as NIST, ISO 27031 & 22301, COBIT and ITIL, and be able to ensure compliance with relevant regulations and standards.
Essential Functions: Reasonable accommodations may be made to enable individuals with disabilities to perform the essential duties.
- Lead risk assessments for IT, tabletops and facilitation of continuity exercises.
- Coordinates and monitors all disaster recovery testing exercise events to ensure activities progress according to event plans, issues are logged, and status reporting is provided to stakeholders.
- Support disaster recovery and preparedness efforts to mitigate, prepare, respond to, and recover from significant events and incidents that impact Stride.
- Partner with leadership, application, and technical teams to identify & define gaps, validate requirements, and define solutions that meet or exceed expected RTO/RPO.
- Work with leadership, application, and technical teams to document DR processes/procedures across Stride's technical environment to ensure that policies, plans, procedures, and strategies effectively provide and support a recovery framework for restoration of critical systems and data to meet or exceed established business, client, and audit requirements.
- Support the creation and integration of Stride's Resiliency & Chaos Engineering strategy, processes, tools, and execution.
- Work with technical teams to ensure that disaster recovery solutions are adequate, in place, maintained, and tested as part of the regular operational lifecycle.
- Oversee Disaster Recovery Plans, document preparedness status and reports to management and track status of any agreed remediation items to closure.
- Provides expert guidance to and coordinates the efforts of relevant technology (infrastructure and application), business, and other function leaders in developing, documenting, and validating recovery procedures & plans.
Supervisory Responsibilities:
This position has no formal supervisory responsibilities.
Minimum Required Qualifications :
- 8+ years' experience supporting or performing a Business Continuity Management or IT Disaster Recovery role.
- Bachelor's degree and/or the equivalent combination of education and experience.
- Understanding of Cloud infrastructure, database, and application development and design.
- Independent, action-oriented and engagement focused on identifying ways to improve resiliency.
- Functional knowledge of frameworks such as NIST, ISO 27031 & ISO 22301, COBIT, and ITIL.
- Experience working with SRE, DiRT, and Chaos Engineering practices.
- Thorough knowledge and understanding of business continuity and disaster recovery planning techniques, technologies and best practices, methods used in performing risk analysis and business impact analyses.
- Strong familiarity with AWS services relevant to DR/HA and resilient architectures, including AWS Config, CloudFormation, Load Balancers, Autoscaling, AWS Resilience Hub, AWS Elastic Disaster Recovery.
- Experience working with enterprise Risk Management solutions (Such as ServiceNow, Archer, Resolver, etc.)
Desired Qualifications :
- Domain Knowledge of Chaos Engineering / Fault Injection and Disaster Recovery best practices.
- Skilled on Comp
Source : Stride, Inc.