Site Reliability Engineer - HCL Global Systems Inc.

Westlake, TX

About the Job

Skills:
Datadog
Kubernetes
AWS (EKS) and Azure (AKS) would prefer AWS
On-call experience running incidents
Development background: Ansible, Python, node, Javascript, Jenkins, groovy
The Expertise and Skills we're Looking For

Bachelor's degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required
5-8+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale
Hands-on experience with Public Cloud environments, preferably AWS and Azure. Certifications a plus
Hands-on experience with container orchestration, preferably with Kubernetes
Working experience on batch processing using tools like Control M, Informatica etc.
Ability to solve application issues on Unix/Linux with J2EE, WebSphere, Tomcat and SQL
Exposure to basic OS level scripting languages such as Korn/Bash/Jscript
Familiarity with ITIL processes like Incident management, Change/Problem management
Balancing delivery with ad hoc workloads and re-evaluating priorities
Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
Hands on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, etc.)
Use Datadog, Catchpoint, Splunk & Grafana for Application Observability and monitoring of app & infrastructure
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale
Proven experience in maintaining scalability and resiliency of complex environment.
Proven experience in implementing advanced observability practices and techniques at scale.
Provide enterprise Cloud and Platform Engineering support for production environments and ability to participate in on-call rotation to provide solutions.
Experience in Cloud development (AWS and Azure) and migration skills; Experience with building and operating highly resilient platforms in public cloud environments
Ability to triage, complete root cause analysis, and be decisive under pressure
Experience managing and interpreting large datasets using query languages and visualization tools
Proficient communication skills with an ability to reach both technical and non-technical audience
Ability to learn new software, method and practices and bringing them to our developers
Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationships
Proven experience performing chaos testing to build confidence in the system's capability to withstand turbulent conditions in production
Strong understanding in API testing tools (SoapUI, Postman)
Understanding of Agile Methodology
Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
Handle a huge fleet of on-prem servers (including security & patching oversight)
Handle hundreds of SSL certificates for all applications in scope
Use Ansible & Python for automating day-to-day activities, Web development with Django, JavaScript
Collaboration and Relationships - Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationship

Source : HCL Global Systems Inc.

Site Reliability Engineer - HCL Global Systems Inc.

Westlake, TX

About the Job

Popular Job Categories

Popular Job Titles

Popular Job Locations

Popular Companies