Site Reliability Engineer
TrainlineSite Reliability Engineer
TrainlineOriginal Advert
About us
We are champions of rail, inspired to build a greener, more sustainable future of travel. Trainline enables millions of travellers to find and book the best value tickets across carriers, fares, and journey options through our highly rated mobile app, website, and B2B partner channels.
Great journeys start with Trainline 🚄
Now Europe's number 1 downloaded rail app, with over 125 million monthly visits and £5.9 billion in annual ticket sales, we collaborate with 270+ rail and coach companies in over 40 countries. We want to create a world where travel is as simple, seamless, eco-friendly and affordable as it should be.
Today, we're a FTSE 250 company driven by our incredible team of over 1,000 Trainliners from 50+ nationalities, based across London, Paris, Barcelona, Milan, Edinburgh and Madrid. With our focus on growth in the UK and Europe, now is the perfect time to join us on this high-speed journey.
Introducing Reliability & Operations Engineering 👋
Trainline is a fast-growing tech company powering world-class digital journeys for millions of customers. Our platform runs primarily on AWS, built on cloud-native architecture, modern CI/CD pipelines, and strong DevOps and SRE practices.
The Reliability & Operations Engineering team (ReliabilityOps) brings together SRE, Incident Management, and Database Reliability to keep our platform observable, reliable, scalable, and resilient. We partner closely with product engineering teams to enable safe delivery, respond to incidents, and continuously strengthen system reliability.
We're looking for a mid-level Site Reliability Engineer to help drive this forward. You'll bring solid production experience, a growth mindset, and a willingness to challenge and be challenged - contributing to platform reliability while developing broader technical ownership with support from senior engineers.
As an SRE at Trainline, you'll be working on...🚄
Developing an understanding of system architecture, dependencies, and failure modes across the Trainline platform
Participating in production incident response, supporting investigation, mitigation, communication, and coordinated service restoration
Contributing to post-incident reviews and follow-up actions to improve reliability, scalability, and resilience
Taking part in the SRE on-call rotation
Designing, building, and maintaining observability using metrics, logs, events, and traces to support effective detection and diagnosis
Improving monitoring and alerting by aligning signals to business and customer impact, reducing noise and improving mean time to detection (MTTD)
Ensuring relevant operational data is surfaced quickly and clearly during live incidents
Making informed tooling and technology choices using SRE principles, balancing team and business needs
Supporting AWS-hosted infrastructure and shared platform services using infrastructure-as-code and CI/CD tooling
Collaborating with product engineering teams to ensure services are operationally ready and deployed safely
Advising on reliability and resilience practices
Writing and maintaining reliable, well-structured code and scripts to support reliability and observability goals
Prioritising work effectively and collaborating using agile processes to deliver against team and business goals
Our Tech Stack 🔑
AWS
New Relic
ELK stack
Grafana
Incident.io
Docker, ECS
Terraform
Github Actions
We'd love to hear from you if you have...🔍
Experience of SRE concepts such as SLI, SLO and error budgets.
Hands-on experience with observability tooling such as New Relic, Elastic (ELK Stack), Influx, Grafana or similar
Experience working with cloud providers (preferably AWS).
Experience troubleshooting Linux operating systems.
Experience of scripting in at least one language (preferably Python)
Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
Application architecture concepts (threading, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff, throttling).
Experience building, maintaining and evolving time series data, retention, cardinality, deviation, moving averages and other functions.
Experience with build, deployment & configuration management tooling such as GitHub Actions and Terraform.
More information:
Enjoy fantastic perks like private healthcare & dental insurance, a generous work from abroad policy, 2-for-1 share purchase plans, an EV Scheme to further reduce carbon emissions, extra festive time off, and excellent family-friendly benefits.
We prioritise career growth with clear career paths, transparent pay bands, personal learning budgets, and regular learning days. Jump on board and supercharge your career from day one!
We're operate a hybrid model to work and ask that Trainliners work from the office a minimum of 60% of their time over a 12-week period. We also have a 28-day Work from Abroad policy.
Our values represent the things that matter most to us and what we live and breathe everyday, in everything we do:
💭 Think Big - We're building the future of rail
✔️ Own It - We focus on every customer, partner and journey
🤝 Travel Together - We're one team
♻️ Do Good - We make a positive impact
We know that having a diverse team makes us better and helps us succeed. And we mean all forms of diversity - gender, ethnicity, sexuality, disability, nationality and diversity of thought. That's why we're committed to creating inclusive places to work, where everyone belongs and differences are valued and celebrated.
Interested in finding out more about what it's like to work at Trainline? Why not check us out on LinkedIn, Instagram and Glassdoor!
Application managed by Trainline