 
        
        Data Scientist/6-month contract can be extended/London, UK/30-40 pounds an hour
Role Overview
 * Join a dynamic team driving a Firm-wide GenAI initiative aimed at advancing our people solutions. Collaborate with engineers, data scientists, designers, product managers, and stakeholders to deliver a critical product that supports the development, engagement, and retention of exceptional talent. As the DS in charge of testing, you’ll ensure the quality, reliability, and performance of cutting-edge features during an exciting phase of growth, supporting a rapidly expanding user base. This role is ideally suited for candidates in US/Europe time zones.
 * As the DS for LLM testing, you will define and execute the technical vision and strategy for AI controls and testing. Your responsibilities will include continuous monitoring, evaluation, and reporting of LLM features to ensure compliance with internal standards, best practices, and external regulations. You’ll play a key role in risk assessment and mitigation, guiding the responsible development and deployment of LLMs.
 * You will design and implement test cases for LLM governance and development, enabling your team to define features and mitigate risks. Collaborating with cross-functional teams, you’ll develop tools, automation strategies, and data pipelines to support scalable LLM management. Additionally, you’ll create standardized reporting templates for both technical and senior leadership audiences, ensuring clear communication of results.
 * Your work will involve close collaboration with tool owners and senior management to present findings, assess risk implications, and propose enhancements to AI tools.
 * Responsibilities
 * Lead testing efforts for the platform, focusing on LLM output testing to ensure reliability, accuracy, and performance
 * Develop and maintain a comprehensive and representative dataset of inputs and expected outputs for each prompt in the tool (i.e., benchmark dataset)
 * Develop and maintain comprehensive testing strategies, including semantic similarity, Q&A validation, claims verification, LLM judge evaluations, and metrics like ROUGE
 * Collaborate with engineering, product, and data science teams to define testing requirements, thresholds, and standards
 * Design and implement robust test cases aligned with business goals and user needs
 * Write and maintain automated tests in Python using frameworks like pytest (prior experience with Opik is not required)
 * Monitor and improve test stability to support application changes
 * Establish and track QA KPIs, such as test coverage and stability, to measure and communicate platform quality
 * Stay updated on industry best practices for GenAI/LLM testing and integrate them into QA processes
Skills
 * Python Proficiency: Strong experience in writing and maintaining Python code
 * LLM/GenAI Testing Expertise: Experience in testing LLM outputs, including semantic similarity, Q&A validation, claims verification, LLM judges, and evaluation metrics like ROUGE
 * Testing Frameworks: Understanding of automated testing tools (e.g., pytest)
 * Test Strategy Development: Proven ability to design and implement test strategies for complex systems
General
 * Leadership: Demonstrated ability to lead own workstream and drive quality initiatives in fast-paced environments
 * Stakeholder Collaboration: Strong communication skills to align technical and non-technical stakeholders on testing needs and standards
 * Execution-Oriented: Self-driven with a “get stuff done” mindset, able to work independently and adapt quickly
 * Agile Mindset: Familiarity with agile principles and product development processes
 * Global Collaboration: Comfortable working with a global team and accommodating occasional early or late meetings
Education
Bachelor's degree in quantitative field like Computer Science, Engineering, Statistics, Mathematics or related field required. Advanced degree is a strong plus