Senior AI Ops Engineer
Senior AI Ops Engineer
Senior AI Ops Engineer
Requirements
What Skills Do I Need?
-
Technical Background: Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience).
-
5+ Years of Experience: Experience in ML Engineering, Data Science, AI Engineering, Platform Engineering, or related roles with proven responsibility for ensuring the reliability, performance, and operational excellence of AI and Machine Learning systems in production environments.
-
AI & LLM Expertise: Practical experience working with LLMs, RAG architectures, embeddings, prompt engineering, evaluation frameworks, and inference trade-offs. Deep familiarity with LLM concepts such as tokenization, temperature, context windows, and latent representations.
-
Operations & Observability: Experience with CI/CD for AI systems, model versioning, experiment tracking, performance monitoring, and incident response.
-
Analytical & Comparative Mindset: Strong ability to evaluate competing AI systems, vendors, and architectures using measurable performance indicators.
-
Data-Driven Mindset: Experience with AI observability and evaluation platforms (e.g., LangSmith, Arize, HoneyHive) to drive improvements through structured metrics rather than intuition.
-
Builder Mentality: Comfortable maintaining and customizing AI information systems, including prompt repositories, skills libraries, orchestration tools, no-code/low-code platforms (e.g., n8n, Zapier, Replit, Glean), and system integrations.
-
Security & Compliance Awareness: Understanding of secure system design, data privacy, and operational controls within financial or regulated environments.
-
Collaborative Spirit: Proven ability to act as the connective layer between Engineering, Product, Data, and business stakeholders.
Original Advert
The AI Ops Engineer is a highly technical role responsible for the reliability, scalability, and continuous improvement of dLocal's AI ecosystem across the entire organization. As AI adoption expands company-wide, we need a hands-on expert who can operate, maintain, evaluate, and evolve our AI systems in production-ensuring consistent quality, robustness, security, and measurable business impact for teams across dLocal.
In this role, you will act as both an enabler and a technical authority. You will work alongside Engineering and cross-functional teams to ensure our AI systems-built both in-house and leveraging best-in-class third-party tools-operate at the highest standards of performance and efficiency. You will maintain and customize AI-powered information systems, including prompt libraries, skills frameworks, orchestration layers, integrations, evaluation pipelines, and observability tooling.
This position requires strong analytical capabilities, deep systems thinking, and the ability to rigorously compare models, architectures, and tools to consistently achieve the best possible outcomes
What Will I Be Doing?
-
Operate & Maintain AI Systems: Ensure reliability, scalability, and observability of AI-powered services deployed on AWS, including LLM-based systems and agentic workflows used across the organization.
-
Architect Agent Behavior: Design and version-control complex system prompts, ensuring agents have clear personas, robust guardrails, and precise tool definitions.
-
Curate Knowledge Context: Manage the "Golden Corpus" for our agents and RAG systems, optimizing data chunking and metadata strategies to ensure accurate information retrieval and proper execution context.
-
Architect & Optimize AI Workflows: Design and continuously improve prompt libraries, skills repositories, orchestration frameworks, and automation pipelines that power internal AI tools.
-
Model Evaluation & Benchmarking: Evaluate and compare AI models (LLMs and foundation models) across quality, latency, cost, safety, and robustness.
-
Implement Automated Evals: Build "Ground Truth" datasets and design "LLM-as-a-Judge" pipelines to rigorously test agent performance before deployment.
-
Enablement & Best Practices: Provide reusable components, documentation, and operational standards that empower engineering teams and internal stakeholders to safely leverage AI capabilities.
-
Experimentation & Continuous Improvement: Drive structured experimentation cycles (A/B testing, offline evals, shadow testing) to iteratively improve system performance.
-
Governance & Guardrails: Implement versioning strategies, access controls, auditability, and responsible AI guardrails aligned with a regulated fintech environment.
Application managed by dLocal