SME - Open Telemetry - Datum Software, Inc
REMOTE, VA 20170
About the Job
SME- Open Telemetry
Type: 100% Remote
Time Zone: EST
Overview:
We are seeking an experienced Monitoring Tools and Open Telemetry Subject Matter Expert (SME) to design, implement, and optimize monitoring solutions that enhance observability within the Enterprise Command Center (ECC). The SME will collaborate with the Incident Management team to troubleshoot and resolve incidents effectively.
Key Responsibilities:
"All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.”
Type: 100% Remote
Time Zone: EST
Overview:
We are seeking an experienced Monitoring Tools and Open Telemetry Subject Matter Expert (SME) to design, implement, and optimize monitoring solutions that enhance observability within the Enterprise Command Center (ECC). The SME will collaborate with the Incident Management team to troubleshoot and resolve incidents effectively.
Key Responsibilities:
- Lead the design and implementation of monitoring solutions using industry-standard tools such as Splunk.
- Customize monitoring configurations to meet organizational requirements.
- Implement and integrate Open Telemetry across various applications and services for enhanced observability.
- Optimize monitoring solutions for efficiency and accuracy while minimizing system performance impact.
- Design and implement application and infrastructure performance monitoring in an AWS Cloud environment.
- Create monitors and dashboards to track application and infrastructure performance.
- Conduct deep statistical analyses of performance data to identify capacity and performance bottlenecks.
- Configure alerting mechanisms within monitoring tools to proactively identify and address potential issues.
- Develop comprehensive documentation for monitoring tool configurations and Open Telemetry implementations.
- Provide training to incident management teams on utilizing monitoring tools and interpreting Open Telemetry data.
- Set up monitoring dashboards for incident detection and alerting.
- Perform end-to-end analysis of transactions in an observability environment.
- Troubleshoot incidents and identify root causes using wire data analytics, application performance management, and event correlation tools.
- Diagnose and resolve incidents by providing factual data from various monitoring and instrumentation systems.
- Strong understanding of IT cloud infrastructure, including AWS Cloud, middleware, database, storage, and networking.
- In-depth knowledge of IT infrastructure, networking, security concepts, and application architecture.
- Hands-on experience with Open Telemetry instrumentation and telemetry data collection.
- Proven expertise as a Splunk SME with in-depth knowledge of Splunk architecture and components.
- Excellent troubleshooting and problem-solving skills.
- Strong documentation skills and attention to detail.
- Experience in proactively monitoring hardware, software, and environmental alerts.
- Ability to analyze dashboards and monitoring tools to identify trends and patterns in application/infrastructure health.
- Proficiency with monitoring tools such as Splunk, Dynatrace, Catchpoint, MoogSoft, xMatters, SolarWinds, and ExtraHop.
- Expertise in microservice-based applications deployed in the cloud using AWS services like Lambda and ECS Fargate.
- Familiarity with AWS services such as IAM, EC2, S3, RDS, Redshift, and CloudWatch.
- Experience with transaction-level monitoring using Dynatrace and Splunk, including creating search queries and dashboards.
- Ability to onboard new data sources into Splunk and analyze data for anomalies and trends.
- Implement best practices for managing a distributed clustered Splunk environment.
- Familiarity with distributed tracing and logging solutions.
- Knowledge of cloud platforms (AWS, Azure) and their integration with monitoring tools.
- AWS Solution Architect Associate certification or higher.
- Experience in incident management environments, including triaging incidents in a 24/7/365 setup.
- Proficient in UNIX/Linux shell scripting and Python; working knowledge of JavaScript or Perl for customizing monitoring configurations.
- Certification in relevant monitoring tools or Open Telemetry is a plus.
- Bachelor’s Degree or equivalent required.
- Minimum of 8 years of related experience.
"All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.”
Source : Datum Software, Inc