Site Reliability Engineer - HCL Global Systems Inc.
Westlake, TX
About the Job
Skills:
Datadog
Kubernetes
AWS (EKS) and Azure (AKS) would prefer AWS
On-call experience running incidents
Development background: Ansible, Python, node, Javascript, Jenkins, groovy
The Expertise and Skills we're Looking For
Datadog
Kubernetes
AWS (EKS) and Azure (AKS) would prefer AWS
On-call experience running incidents
Development background: Ansible, Python, node, Javascript, Jenkins, groovy
The Expertise and Skills we're Looking For
- Bachelor's degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required
- 5-8+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale
- Hands-on experience with Public Cloud environments, preferably AWS and Azure. Certifications a plus
- Hands-on experience with container orchestration, preferably with Kubernetes
- Working experience on batch processing using tools like Control M, Informatica etc.
- Ability to solve application issues on Unix/Linux with J2EE, WebSphere, Tomcat and SQL
- Exposure to basic OS level scripting languages such as Korn/Bash/Jscript
- Familiarity with ITIL processes like Incident management, Change/Problem management
- Balancing delivery with ad hoc workloads and re-evaluating priorities
- Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
- Hands on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, etc.)
- Use Datadog, Catchpoint, Splunk & Grafana for Application Observability and monitoring of app & infrastructure
- Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale
- Proven experience in maintaining scalability and resiliency of complex environment.
- Proven experience in implementing advanced observability practices and techniques at scale.
- Provide enterprise Cloud and Platform Engineering support for production environments and ability to participate in on-call rotation to provide solutions.
- Experience in Cloud development (AWS and Azure) and migration skills; Experience with building and operating highly resilient platforms in public cloud environments
- Ability to triage, complete root cause analysis, and be decisive under pressure
- Experience managing and interpreting large datasets using query languages and visualization tools
- Proficient communication skills with an ability to reach both technical and non-technical audience
- Ability to learn new software, method and practices and bringing them to our developers
- Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationships
- Proven experience performing chaos testing to build confidence in the system's capability to withstand turbulent conditions in production
- Strong understanding in API testing tools (SoapUI, Postman)
- Understanding of Agile Methodology
- Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
- Handle a huge fleet of on-prem servers (including security & patching oversight)
- Handle hundreds of SSL certificates for all applications in scope
- Use Ansible & Python for automating day-to-day activities, Web development with Django, JavaScript
- Collaboration and Relationships - Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationship
Source : HCL Global Systems Inc.