Site Reliability Engineer (SRE) Lead

September 27, 2024
Apply Now

Job Description

About us.

Trumid is a fast-growing financial technology company and fixed-income electronic trading platform, bringing efficiency to credit trading through data, technology, and intuitively designed products. Founded in 2014 by a team of fixed-income market experts, we have become one of the top three corporate bond e-trading platforms. 1,300+ traders transact on Trumid monthly from an extensive and expanding client network of 850 buy-and sell-side institutions.

With a rich history of innovation, we pride ourselves on staying nimble and agile as we grow. A tech-first, client-driven approach in which we collaborate closely with our users, iterating quickly toward optimal solutions. With client engagement at its highest levels and our pace of product development faster than ever, this is an exciting and transformative time at Trumid.

Our business model thrives on participation and connection, and so does our culture. We work together towards common goals – constantly pushing into unexplored areas and new ways of thinking. To succeed at Trumid, you must be curious, passionate about your craft, ambitious, collaborative, and driven. Learn more at www.trumid.com.

The opportunity.

Trumid is looking for a Lead Site Reliability Engineer (SRE) to ensure our systems’ reliability, scalability, and performance as we continue to grow. This role offers a unique opportunity to shape our fast-growing firm’s reliability practices and infrastructure. You will be crucial in optimizing our existing infrastructure, implementing new technologies, and enhancing our incident response capabilities.

As a Lead SRE, you will oversee the stability and performance of our trading platform, which serves a large and growing client base. You’ll work closely with development and DevOps teams to build scalable solutions and automate processes to enhance system reliability. You will also play a critical role in incident management, problem resolution, and capacity planning, ensuring that our systems meet our users’ high expectations.

This role is ideal for someone passionate about reliability, automation, and efficiency. You will have the chance to lead initiatives that directly impact our platform’s stability and user experience, ensuring that we maintain the highest levels of service availability.

Responsibilities will include:
  • Transform the SRE function to evolve, simplify, and scale existing solutions. Innovate and create new solutions and practices where needed.
  • Drive improvements in system reliability, scalability, and performance through innovative solutions and industry best practices.
  • Lead incident response efforts, including troubleshooting, resolution, and conducting post-mortem analysis to prevent future incidents.
  • Automate repetitive tasks to reduce manual intervention and improve operational efficiency.
  • Collaborate closely with software development, DevOps, and infrastructure teams to embed reliability into the development lifecycle.
  • Design, implement, and maintain highly available, scalable, and resilient infrastructure to meet the demands of our growing client base.
  • Develop and maintain monitoring, logging, and alerting frameworks to ensure system health and to identify and resolve issues preemptively.
  • Conduct capacity planning and performance tuning to support future growth.

About you.

  • SRE expert with foundation knowledge of SRE best practices.
  • Demonstrated hands-on experience managing large-scale and highly-available cloud-based systems.
  • Deep understanding of cloud components in at least one of the major cloud providers (eg, AWS, GCP, Azure), including infrastructure, services, and tooling.
  • Expertise in containerization and orchestration tools (e.g., Docker, Kubernetes) and experience with deployment strategies such as blue-green and canary deployments.
  • Strong knowledge of CI/CD pipelines and experience in integrating reliability practices within CI/CD processes.
  • Proficient with monitoring and observability tools (e.g., Prometheus, Grafana, Alertmanager) to ensure system health and to create effective alerting mechanisms.
  • Experience with Infrastructure as Code (IaC) tools like Terraform and Ansible and experience automating infrastructure deployment and management.
  • Excellent problem-solving skills, focusing on diagnosing complex issues in large-scale distributed systems.
  • Strong scripting and programming skills in Python, Bash, Go, or similar languages.
  • Strong communication and collaboration skills, capable of working effectively with cross-functional teams in a fast-paced environment.
  • Passion for reliability, automation, and continuous improvement.
  • Bachelor’s degree in computer science (or equivalent) and at least 5 years of professional experience at a fast-paced tech oriented company.  Experience with financial and trading systems is a plus but not required.

Employee Benefits.

  • Highly competitive compensation
  • Fully paid medical, dental, and vision coverage
  • Remote work
  • Team-oriented and collaborative company culture

Trumid is an equal-opportunity employer.

In compliance with New York City Pay Transparency Law, the base salary range for this role in New York City is between $220,000 and $300,000. This range does not include discretionary bonuses or other compensation or benefits offered with this job. Several factors are considered when determining a candidate’s salary.