Sr. SRE - Herbalife
Torrance, CA 90502
About the Job
THE ROLE:
The SRE Principal Engineer III will work a hybrid schedule, with a requirement to be onsite at our Torrance as needed. This role is responsible for leading, designing, and implementing robust Site Reliability Engineering (SRE) practices to ensure high availability, scalability, and resilience of critical business systems and applications. The SRE Principal Engineer III will focus on improving system reliability through automation, monitoring, and performance tuning, working closely with development and operations teams to nurture a culture of continuous improvement and operational excellence.
The SRE team consists of:
● SRE Engineers
● Deployment Automation
● Incident Response and Postmortem Analysis
● Observability and Monitoring
This role will drive the adoption of standard methodologies in multi-cloud and hybrid-cloud platforms, leading services from major cloud providers like Microsoft Azure, Amazon AWS, Oracle OCI, Google GCP, and Alibaba Cloud. The SRE Principal Engineer III will focus on automation, incident management, performance monitoring, and optimizing infrastructure to support scalable, reliable systems. The position will also be responsible for fostering teamwork between development, operations, and security teams to streamline system operations across the organization.
DETAILED RESPONSIBILITIES/DUTIES:
● Lead the implementation and optimization of SRE practices, ensuring system reliability, performance, and scalability.
● Architect and maintain automation for infrastructure provisioning, deployment, and incident response.
● Establish and enforce SLOs (Service Level Objectives) and SLIs (Service Level Indicators) for key services.
● Collaborate with development teams to design and deliver reliable software systems, ensuring that production environments are optimized for uptime and performance.
● Create and maintain monitoring, alerting, and observability solutions to provide real-time insights into system health and performance.
● Respond to production incidents, perform root cause analysis, and implement corrective measures to prevent recurrence.
● Continuously improve system performance, capacity planning, and reliability through infrastructure tuning and automation.
● Facilitate post-incident reviews, fostering a blameless culture that focuses on learning from incidents.
● Collaborate with security teams to ensure infrastructure meets compliance, security standards, and best practices.
● Foster a collaborative environment across development, operations, and security teams to enhance operational efficiency and knowledge sharing.
● Drive the adoption of automation tools and frameworks to minimize manual intervention and optimize systems.
SKILLS AND BACKGROUND REQUIRED TO BE SUCCESSFUL:
● Proven expertise in SRE practices, with a focus on automation, incident management, observability, and infrastructure scalability.
● Extensive knowledge of cloud platforms (Azure, AWS, GCP, Alibaba) and hybrid-cloud environments, with a focus on reliability and performance optimization.
● Experience with automation tools and scripting languages, such as Python, Go, Terraform, or Ansible, for managing infrastructure and incident response.
● Strong understanding of containerization (Docker, Kubernetes) and orchestration systems.
● Solid grasp of monitoring and observability tools (Prometheus, Grafana, Dynatrace, Splunk) to ensure real-time system health monitoring.
● Proficiency in maintaining system functionality, optimizing performance, and effectively addressing technical challenges.
● Strong background in incident management, root cause analysis, and postmortem processes to improve system resilience.
● Deep understanding of security and compliance requirements, and the ability to ensure production environments meet industry standards.
● Experience with Agile and DevOps methodologies to ensure fast, reliable delivery of services.
Experience:
● 10+ years of experience in IT, with a focus on SRE, DevOps, or infrastructure engineering roles.
● Extensive hands-on experience with cloud infrastructure management and automation tools such as Terraform, CloudFormation, or equivalent.
● Proficiency in scripting and automation languages like Python, Bash, Go, or Ruby for infrastructure automation.
● Confirmed experience in leading large-scale systems, ensuring reliability, high availability, and scalability.
● Expertise in container orchestration technologies, including Kubernetes, OpenShift, and Docker Swarm.
● Deep knowledge of monitoring and observability platforms (Prometheus, Grafana, ELK, Dynatrace), including experience building and maintaining alerting and dashboard systems.
● Strong understanding of version control systems and CI/CD practices to optimize code deployment as it relates to infrastructure.
● Demonstrated ability to optimize performance in multi-cloud and hybrid-cloud environments, ensuring uptime and performance at scale.
Certificates / Training Preferred:
● Relevant cloud certifications such as AWS Certified Solutions Architect, Azure Solutions Architect Expert, or Google Cloud Professional Cloud Architect.
● SRE-related certifications like Certified Kubernetes Administrator (CKA) or Google Professional Cloud DevOps Engineer.
Education:
Bachelor’s degree in computer science, Information Technology, or related field, or equivalent experience.
#LI-AR1