The Role
We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform.
This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.
This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.
About Lyrebird Health
Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time‑consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day.
They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.
What You'll Do
* Reliability & Production Engineering
* Own reliability outcomes across core services and customer‑facing systems
* Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
* Lead initiatives to improve uptime, latency, and overall system resilience
* Proactively identify reliability risks and drive mitigation plans to completion
* Observability & Incident Response
* Improve end‑to‑end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
* Lead incident response for high‑severity events and guide teams through calm, effective mitigation
* Drive post‑incident reviews that result in measurable, lasting improvements
* Build a culture of operational excellence: fewer incidents, faster recovery, better learning
* Platform Enablement
* Develop internal tooling and paved paths that make “doing the right thing” the easiest option
* Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
* Partner with engineers to uplift production‑readiness across new and existing services
* Infrastructure & Automation
* Improve infrastructure reliability and maintainability using Infrastructure as Code
* Strengthen deployment workflows and reduce operational toil through automation
* Help shape architecture decisions with a reliability and scalability lens
* Security & Compliance Support
* Embed security and compliance principles into platform practices (access controls, auditability, safe‑by‑default designs)
* Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery
What We’re Looking For:
* 8+ years of engineering experience, with significant depth in SRE / platform/production systems
* Strong experience operating and improving systems in production (including incident response)
* Proven ability to lead cross‑team initiatives and influence engineering standards
* Technical Strength You don’t need to tick every box, but you should be strong across most: Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
* Infrastructure as Code (Terraform)
* Observability
* Strong grasp of monitoring and alerting principles
* Experience with logs + metrics + tracing and building meaningful dashboards
* Familiar with OpenTelemetry and modern observability tooling
* Systems & Operational Excellence
* Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
* Strong debugging instincts across distributed systems
* Practical approach to risk management and tradeoffs
* Software Engineering
* Ability to build tools and automation (TypeScript, Go, Python, or similar)
* Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Bonus Skill (Nice to Have):
* Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA‑style environments)
* Experience with service mesh patterns, multi‑account AWS environments, or multi‑region design
* Experience working with healthcare or regulated domains
* Experience scaling engineering org practices as the company grows
Who You Are:
* You’re deeply accountable - you take ownership of outcomes, not just tasks
* You value simplicity and reliability over cleverness
* You’re calm and effective in incidents, and you raise the quality bar afterward
* You communicate clearly across engineering and non‑engineering stakeholders
* You’re pragmatic: you know when to move fast, and when to slow down to reduce risk
Why This Role Is Different:
* Staff‑level scope with real influence across engineering
* Direct impact on reliability for a product clinicians depend on every day
* Work on meaningful problems where security, performance, and trust matter
* High ownership environment with room to shape how the company operates at scale
At Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them.
We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.
#J-18808-Ljbffr