site Reliability Engineer - Edvenswa Tech
Boston, MA
About the Job
Role: Site Reliability Engineer
Location: Boston, MA
Duration: Long term contract
As a Site Reliability Engineer, you will be responsible for conducting Root Cause Analysis meetings, fostering a
blame-free environment to ensure comprehensive information about events and their resolutions is gathered
effectively. This role requires the ability to navigate complex technical issues while promoting open and transparent
discussions among team members. You will utilize trends and metrics to identify improvement opportunities within
existing frameworks, tools, and processes to improve systems continuously.
Responsibilities:
- You will be part of the SRE team who are focused on Root Cause Analysis of critical production outages to improve resiliency.
- Lead problem tickets and improvements to major software components, systems, and features to improve the
- availability, scalability, latency, and efficiency of client system.
- Engage in and improve the service lifecycle from inception and design to deployment, operation, and refinement based on lessons learned through deep dives.
- SS-SRE-WORD_DOCUMENT_TEMPLATE Hands-on troubleshooting VMware, Kubernetes, System Software functionality, performance, and configuration issues.
- Be a trusted technical advisor who leads complex root cause analysis investigations from beginning to end until improvement implementation.
- Demonstrate sound knowledge of gathering logs and facilitating the root cause analysis with cross-functional teams.
- Assist internal teams with corrective actions and improvement tickets and influence the completion goals.
- Flexibility to work during occasional out of hours including weekend may be required depending on the criticality and workload demands.
Qualifications:
- Bachelor's degree in software engineering, Information systems, computer science, or a related field.
- 10+ years of experience working on ITSM tools such as Jira, ServiceNow, etc.
- 8+ years of infrastructure engineering experience, with a record demonstrating hands-on troubleshooting in large-scale solutions, on-prem distributed systems, and custom-developed software applications.
- 8+ years of experience in operating production systems, including troubleshooting, testing, and automation.
- 5+ years of experience leading technical Root Cause Analysis (Software focus is a plus).
- Team player with excellent communication skills and the ability to prioritize multiple tasks.
- Experience with executive communication, report writing, and presentation skills to non-technical audiences.
- Strong technical background in container technologies such as Kubernetes, detail-driven, and excellent problem-solving abilities.
- Experience in the advanced use of tools like Prometheus, Grafana, Logic Monitor, Elastic, and PowerBi is a plus.
Source : Edvenswa Tech