Site Reliability engineering (SRE) - TechDigital Corporation
Orlando, FL
About the Job
Responsibilities: Lead the design, implementation, and management of complex systems architecture that emphasizes reliability, scalability, and performance. Collaborate closely with engineering teams to set and uphold service-level objectives (SLOs) and work on continuous improvements to achieve these goals. Mentor and guide junior members of the SRE/PRE team, fostering their technical growth and professional development. Solve intricate technical challenges across the entire technology stack, from hardware and infrastructure to applications and databases. Develop and implement robust automation solutions for deployment, configuration management, and infrastructure provisioning. Play a pivotal role in capacity planning, performance tuning, and optimizing systems for seamless scalability. Drive the establishment of comprehensive monitoring, alerting, and logging strategies to ensure prompt identification and resolution of issues. Participate in on-call rotations and respond promptly to incidents, taking ownership of resolution and post-incident analysis. Continuously advance best practices and processes, promoting a culture of reliability and operational excellence. Collaborate with stakeholders to ensure alignment between development and operations, contributing to product evolution and enhancements.Qualifications: Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience). 7+ years of experience in an SRE, PRE, or similar role, demonstrating a proven track record in driving system reliability and performance. Proficiency in programming languages such as Python, Go, or similar for automation and tool development. Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container technologies (e.g., Kubernetes, Docker). Deep understanding of networking, operating systems, and distributed systems architecture. Experience with infrastructure as code tools (e.g., Terraform, Ansible) for provisioning and configuration management. Strong grasp of observability tools and practices (e.g., Prometheus, Grafana, ELK stack). Exceptional troubleshooting skills and the ability to diagnose complex technical issues. Outstanding communication skills to collaborate effectively with diverse teams. Proactive mindset and a focus on delivering exceptional customer experiences. Optional: Relevant certifications such as Certified Kubernetes Administrator, AWS DevOps Professional, or similar. (1.) To ensure customer engagement or satisfaction and referenceability (2.) To plan for Program and Delivery Management and ensure that the agreed deliverables in terms of margin are met. (3.) To anchor process improvementorcompliance (human error reporting) and other organizational initiatives (automation , Lean IT implemetation) (4.) To guide, manage, develop, engage the team therby ensuring employee retention (5.) To ensure upskillor creation of resources through internal academiesor trainings and growth rotation
Source : TechDigital Corporation