Evaluation & Insights Engineer
London, England, United Kingdom Machine Learning and AI
Imagine what you could do here. At Apple, great new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish! Are you passionate about music, movies, and the world of Artificial Intelligence and Machine Learning? So are we! Join our Human-Centered AI team for Apple Media Services. In this role, you'll represent the user perspective on new features, review and analyze data, and evaluate AI models powering everything from search and recommendations to other innovative features. Collaborate with Data Scientists, Researchers, and Engineers to drive improvements across our platforms.
Description
We are looking for an Evaluation & Insights Engineer for the Human-Centered AI team to help evaluate and improve AI systems by combining data science, model behavior analysis, and qualitative insights. In this role, you will analyze AI outputs, develop evaluation frameworks, design qualitative, and translate findings into actionable improvements for product and engineering teams. This role blends deep technical expertise with strong analytical judgment to assess, interpret, and improve the behavior of advanced AI models. You will work cross-functionally with the Engineering and Project Managers, Product, and Research teams to ensure that AI experience is reliable, safe, and aligned with human expectations.
Responsibilities
* AI Evaluation & Data Analysis - Lead complex evaluations of model behaviour, identifying issues in reasoning, factuality, interaction quality, safety, fairness, and user alignment.
* Build evaluation datasets, annotation schemas, and guidelines for qualitative assessments.
* Develop qualitative + semi-quantitative scoring rubrics for measuring human-perceived quality (e.g., helpfulness, factuality, clarity, trustworthiness).
* Run structured evaluations of model iterations and summarize strengths/weaknesses based on qualitative evidence.
* Translate qualitative findings into clear loss patterns and actionable insights
* Data Science & Modeling - Collaborate with model developers to refine model behavior using findings from qualitative outputs.
* Use statistical and computational methods to identify patterns in qualitative data (e.g., assigning loss patterns, error taxonomies, thematic categorization).
* Integrate qualitative evaluations with quantitative metrics (e.g., Precision@k, MRR, perplexity, accuracy, performance KPIs).
* Build dashboards, scripts, or workflows that codify evaluation metrics and automate portions of qualitative assessments.
* Framework & Pipeline Development - Create scalable pipelines for reviewing, annotating, and analyzing model outputs.
* Define evaluation frameworks that capture nuanced human factors (e.g., uncertainty, trust calibration, conversational quality, interpretability).
* Develop automated evaluation pipelines that collect, automatically judge, and analyze model outputs with respect to evaluation guidelines, at scale
* Develop processes to track feature quality and model performance over time and flag regressions.
* Work with product teams to ensure AI behaviors align with real-world user expectations.
* Cross-Functional Collaboration with ML and data scientists, software developers, project managers, and other teams at Apple to understand requirements and translate them into scalable, reliable, and efficient evaluation frameworks.
Minimum Qualifications
* Bachelor’s or Master’s degree in Data Science, Computer Science, Linguistics, Cognitive Science, HCI, Psychology, or a related field and 5+ years of relevant job experience
* Proficiency in Python for data analysis (pandas, NumPy, Jupyter, etc.).
* Experience working with large datasets and designing model-evaluation pipelines, taxonomies, categorization schemes, or structured rating frameworks.
* Analytical Strength: Ability to interpret unstructured data (text, transcripts, user sessions) and stitch together qualitative and quantitative findings into actionable guidance
Preferred Qualifications
* Experience working directly with LLMs, generative AI systems, or NLP models.
* Familiarity with evaluations specific to AI quality, hallucination detection, or model alignment.
* Experience building internal tools, scripts, or dashboards for evaluation workflows.
* Familiarity with prompt engineering, RAG systems, or model fine-tuning.
* Experience evaluating LLMs, multimodal models, or other generative AI systems at scale.
* Expertise in designing annotation guidelines and managing large scale annotation projects
* Background in human factors, social science, or qualitative assessment methodologies.
At Apple, we’re not all the same. And that’s our greatest strength. We draw on the differences in who we are, what we’ve experienced and how we think. Because to create products that serve everyone, we believe in including everyone. Therefore, we are committed to treating all applicants fairly and equally. As a registered Disability Confident employer, we will work with applicants to make any reasonable accommodations. Apple will consider for employment all qualified applicants with criminal backgrounds in a manner consistent with applicable law. Learn more
#J-18808-Ljbffr