Staff Site Reliability Engineer - Saildrone
Alameda, CA 94501
About the Job
About Us
At Saildrone, we sustainably explore, map, and monitor the oceans to understand, protect, and preserve our world. We provide real-time access to critical data from any ocean on earth, 24/7/365, and use proprietary software applications to transform that data into actionable insights and intelligence. Our fleet of uncrewed surface vehicles (USVs), powered by renewable wind and solar power, have a minimal carbon footprint and operate without the need for a crewed support vessel.
Saildrone works with governments, civil agencies, foundations, universities, and private companies around the globe to drive better information about our oceans and seas—from sailing into the eye of a category 4 hurricane to obtain new data about how storms intensify, collecting new CO2 data in hard-to-reach areas, and counting fish biomass to inform sustainable fishery management, to mapping the ocean floor and reducing illegal fishing and drug trafficking. As a result of our work, we have been included on TIME's List of the 100 Most Influential Companies in 2024, Fast Company’s list of the "World’s Most Innovative Companies for 2022," and earned the Ocean Awards’ Innovation Award, presented by the Blue Marine Foundation. Saildrone's hurricane mission was included as one of The New York Times' "Top 21 Things that happened (for the first time) in 2021" and Popular Science's "100 Greatest Innovations of 2021," and entered into the Guinness Book of World Records for recording the "highest windspeed recorded by a USV."
We are based in Alameda, CA, with offices in Washington DC and St. Petersburg, FL, and operate our missions worldwide. Saildrone is backed by top-tier investors in the frontier tech and sustainability sectors, including Social Capital, Capricorn, Lux Capital, BOND Capital, and Emerson Collective.
This is an exciting opportunity with a fast-growing team at the cutting-edge intersection of big data services and autonomous hardware. You will be an integral part of a high-performing multi-disciplinary team that delivers high impact for humanity and future generations.
The Role
We are seeking a talented Staff Site Reliability Engineer with a strong focus on observability and mentorship to join our dynamic team. In this role, you will act as a team tech lead, guiding engineering efforts to ensure the reliability, scalability, and performance of our systems while fostering a culture of continuous learning and improvement across the Software group. Your expertise in observability tools and practices will play a crucial role in scaling up Saildrone’s Site Reliability Engineering team, helping to ensure the quality of service that our customers have come to expect.
Responsibilities
- Monitoring Architecture: Design and implement robust monitoring frameworks to track the health and performance of applications and infrastructure.
- Observability Practices: Establish observability best practices, leveraging tools such as Datadog, Prometheus, Grafana, or similar to provide actionable insights.
- Alerting Strategies: Develop and maintain effective alerting strategies to ensure prompt incident response while minimizing noise.
- Incident Management: Lead incident response efforts, conducting thorough postmortems and root cause analyses to prevent future occurrences.
- Performance Optimization: Analyze system performance metrics and logs to identify bottlenecks and implement solutions for optimization.
- Collaboration: Work closely with development, operations, and product teams to integrate observability into the development lifecycle and improve system reliability.
- Documentation: Create and maintain comprehensive documentation of monitoring setups, incident responses, and SRE best practices.
- Capacity Planning: Collaborate on capacity planning efforts to ensure the infrastructure can scale to meet growing demands.
- Tooling and Automation: Identify opportunities for automation in monitoring and alerting processes to improve efficiency and reliability.
- Mentorship: Provide guidance and mentorship to our new SRE team and to the Software group as a whole, sharing expertise in monitoring, observability, and incident management.
Minimum Experience
- 8+ years SRE experience. BA/BS in related field or equivalent experience.
Required Skills
- Strong knowledge of AWS services and managing cloud-based infrastructure at scale.
- Strong experience with monitoring and observability tools (e.g., Datadog, Grafana, Prometheus).
- Strong proficiency with log management and analysis tools (e.g., Datadog Logs, ELK Stack, Splunk).
- Skills in scripting languages (e.g., Python, Bash) for automation and custom monitoring solutions.
- Strong experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Strong proficiency with Kubernetes, Helm Charts, and Helm deployment patterns.
- Understanding of key performance metrics and monitoring aspects (e.g., CPU usage, memory consumption, latency, error rates).
- Expertise in setting up alerts, handling incidents, and performing root cause analysis.
- High attention to detail for accurate monitoring, alert configuration, and performance tuning.
- Experience with monitoring databases (e.g., MySQL, PostgreSQL, MongoDB) and understanding related performance metrics.
- Effective communication skills to collaborate with cross-functional teams and report on system health and incidents.
- Excellent problem-solving skills and a proactive mindset.
Desired Skills and Experience
- AWS certifications (e.g., AWS Certified Solutions Architect, AWS Certified DevOps Engineer).
- Experience with other cloud platforms (Azure, Google Cloud Platform).
- Knowledge of networking fundamentals, including DNS, load balancing, and content delivery networks (CDNs).
- Ability to anticipate potential issues and implement proactive monitoring strategies.
Physical Requirements
- Work is performed on a computer and requires ability to operate a keyboard and other peripheral devices.
Location: This is a hybrid position in Alameda, CA. Our waterfront office offers beautiful views of San Francisco Bay in always sunny Alameda. Even our walls have good karma, our offices mix software development with a hardware production line in the former airplane hangar used to film 'The Matrix'.
Benefits:
- Medical, dental and vision plans for you and your dependents.
- Short and relaxing ferry ride from the Ferry Building for SF residents
- Enhanced Parental Leave Programs
- Competitive benefits including excellent medical, life insurance, 401k plan
Catch up on the latest news about us:
East Bay company Saildrone provided essential weather tool for Hurricane Milton - CBS News
The Tiny Craft Mapping Superstorms at Sea – The New York Times
TIME 100 Most Influential Companies 2024: Saildrone
An Underwater Mountain was Newly Discovered off California Coast – San Francisco Chronicle
Hacking the Anthropocene with Survivalist Robots [VIDEO] – Freethink
An Unprecedented View Inside a Hurricane – EOS
Saildrone’s First Aluminum Surveyor Autonomous Vessel Splashes Down for Navy Testing – TechCrunch
USVs Could Deter IUU Fishing – USNI Proceedings
Saildrone Vehicles Track Whales around Offshore Wind Power – Workboat
Mullen, Former Joint Chiefs Chairman, to Lead Board for Unmanned Tech Firm Saildrone – Breaking Defense
The Navy Is Using Robot Ships to Deter Human Smuggling out of Haiti – Defense One
Saildrone's Quiet Voyage: Autonomous Vehicle Aids Great Lakes Fish Stock Study – Up North Live
Saildrone Featured Videos Playlist
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
At Saildrone, we value diversity and are committed to creating an inclusive workplace that welcomes people from all backgrounds, experiences, and perspectives. We believe that a diverse and inclusive team leads to innovation and better problem-solving. We encourage applications from candidates of all genders, ethnicities, races, sexual orientations, disabilities, and backgrounds.
Individual compensation packages are based on geographic location, scope of the role, relevant experience, and the ability to deal with complexity and problem solve within our organization, among other factors.
All employees are required to provide proof of authorization to work in the U.S. within their first 3 days of work. Please note that the Company does not sponsor employees for work visas or permanent resident cards to work in the U.S. If you need sponsorship for a work visa or green card, you will not be qualified for employment with Saildrone.
Any unsolicited resumes/candidate profiles submitted through our website or to personal email accounts of employees of Saildrone are considered property of Saildrone and are not subject to payment of agency fees.
#LI-Hybrid
#LI-LP1