The Search Conversational Experiences team builds Elastic’s new conversational (agentic) platform that lets customers chat with their own data in Elasticsearch. We own the quality layer for RAG, agents and tools, retrieval/citations, streaming, memory, and—crucially—the evaluation signals that turn open-ended questions into grounded, reliable answers. As a Senior Data Scientist, you’ll be part of a cross-functional team (backend, DS, PM, UX) driving chat quality end-to-end: designing and running evaluation pipelines, improving prompts and tool behaviors, and turning measurements into product decisions that customers can feel. You’ll help tackle frontier problems—folding RAG and vector search into an agent’s knowledge base, dynamically enriching model context to boost groundedness, shaping agent routing and tool selection policies, lighting up agent-driven visualizations on top of Elasticsearch data, and exploring multimodality and reasoning strategies where they truly move the needle. This is an applied role: you will prototype, evaluate, and partner with engineers to ship DUTIES • Design and maintain offline/online evaluation pipelines for conversational search: golden sets, rubric/LLM-as-judge calibration, groundedness/citation checks, and A/B tests. • Build and compare retrieval & re-ranking baselines (sparse dense), query understanding, and semantic rewrites; land improvements with clear metrics. • Use results to drive product decisions: model selection, efficient agent routing, tool gating, and agent customization for Elastic use cases in search and beyond. • Instrument dashboards and telemetry so helpfulness, faithfulness, latency, and cost trade-offs are visible and trustworthy; guard against regressions in CI. • Collaborate tightly with backend engineers on contracts (ES|QL, citations, telemetry), and with PM/UX to translate findings into shipped features. • Share outcomes clearly (docs, notebooks, PRs) and mentor peers in experiment design and evaluation craft WHAT YOU BRING • 5–8 years in applied DS/ML with strong IR/NLP experience (RAG, dense/sparse retrieval, re-ranking, vector search). • Proficiency in Python, PyTorch/Transformers, Pandas; reproducible experiments (e.g., MLflow), versioned datasets, and clean, reviewable code. • Hands-on evaluation expertise: offline metrics (nDCG/MRR/Recall@k), LLM-as-judge calibration, groundedness/citation scoring, and online A/B testing. • Experience turning experimental results into clear product calls (models, routing, tools) and communicating them crisply to cross-functional partners. • Practical Elasticsearch experience (or similar); ES|QL familiarity is a plus. • comfort working in a distributed, async-first environment; strong written communication; low-ego collaboration