Are you passionate about open source, search and recommendation systems, and AI integration? Do you thrive on running large systems for global applications, automating every aspect of it, and interfacing with the major cloud providers? If so, we want you to join our team at Vespa.ai as a Site Reliability Engineer!
About Vespa.ai:
Vespa.ai is a team of passionate builders. We maintain and develop the Apache 2.0 licensed open-source project Vespa. Vespa lets our users run big data + AI, online. At any scale, with unbeatable performance.
Vespa is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query. Integrated machine-learned model inference allows to apply AI to make sense of data in real-time. Together with Vespa’s proven scaling and high availability, this empowers to create production-ready search applications at any scale and with any combination of features. Our users and customers are #1 in e-commerce, content, and financial services globally and used by companies like Spotify, Yahoo, Wix, and many more.
In addition to our open-source platform, Vespa.ai runs Vespa Cloud, a robust SaaS offering that allows businesses to harness the power of our technology with ease.
At Vespa.ai, we are extremely focused on automating whatever we do to be able to grow fast with high quality. In all roles, we scale using technology, not simply larger teams. We take pride in being small, nimble, and the most productive.
Position Overview:
As a Site Reliability Engineer at Vespa.ai, you will play a crucial role in ensuring the availability, reliability, and performance of our SaaS platform for customers around the world. You will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions that meet our high availability requirements. The ideal candidate will have a strong background in system architecture, automation, and a passion for optimizing and scaling systems.
The Vespa Services Team develops the infrastructure for Vespa Cloud. This includes AWS/GCP/Azure automation using Terraform as well as auth and security integration with services like Auth0 and Teleport. Other examples are billing integration with credit card providers, compliance automation, and custom maintainer modules with code written in Java.
An ideal candidate dislikes doing things twice, and automates using Java or scripts, with proper monitoring like creating alerts, badges and dashboards. Experience with monitoring and alerting tools such as Grafana and OpsGenie is a big plus.
Responsibilities:
System Architecture and Design:
- Collaborate with development and operations teams to design scalable and reliable infrastructure solutions.
- Evaluate and implement best practices for system architecture to ensure high availability and performance.
Automation and Infrastructure as Code:
- Develop and maintain automation scripts and tools for provisioning, configuration, and deployment.
- Implement Infrastructure as Code (IaC) practices to manage and version infrastructure components.
Monitoring and Incident Response:
- Implement and maintain robust monitoring and alerting systems to proactively identify and address potential issues.
- Participate in a 24×7 on-call rotation to ensure continuous availability and performance of Vespa Cloud services and act as escalation point for customer production incidents.
Capacity Planning and Performance Optimization:
- Conduct capacity planning to anticipate and address future growth.
- Optimize system performance and resource utilization to meet or exceed service level objectives.
Security and Compliance:
- Work closely with the security team to implement and maintain security best practices.
- Ensure compliance with industry standards and regulations.
Collaboration and Documentation:
- Collaborate with cross-functional teams, including development, operations, and support, to address system-related challenges.
- Maintain comprehensive documentation for system architecture, processes, and procedures.
Qualifications:
- Proven experience as a Site Reliability Engineer, DevOps, or similar role.
- Strong programming skills
- Strong knowledge of system architecture, cloud infrastructure, and networking.
- Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Ansible, Terraform).
- Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Familiarity with monitoring and logging tools (e.g., Prometheus, ELK stack).
- Excellent problem-solving and troubleshooting skills.
- Familiarity with distributed systems.
Why Join Us:
- Be part of a cutting-edge team working on innovative search and recommendation technology.
- Contribute to the development of a high-performance, open-source platform with a global impact.
- Collaborate with a talented team of engineers, product experts and sales.
- Competitive salary, benefits, and opportunities for professional growth.
If you are excited about the intersection of open source, search and recommendation systems, AI integration, and have a genuine passion for quality and automation, we would love to hear from you! Apply now to join the Vespa Team and play a key role in shaping the future of our industry.
Note: Vespa.ai is an equal-opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We believe in fostering a collaborative and inclusive environment where every team member has the opportunity to make a significant impact.