AI Evaluation in Action: Lessons from Real-World Implementers

AI systems are increasingly being deployed in public-facing contexts, assisting doctors with triage, guiding users through government services, supporting legal processes, and answering a wide range of questions at scale. As their use grows, the need for rigorous evaluation becomes both urgent and complex for organizations building, funding, or deploying these technologies. Evaluating AI in practice is challenging because these systems evolve quickly, interact with users differently across contexts, influence behavior in subtle ways, and operate under tight resource and time constraints.

To address these challenges, IDinsight, in collaboration with the Center for Global Development, The Agency Fund, and with support from the Gates Foundation, is developing a living AI evaluation playbook. The aim is to create a practical resource that helps organizations evaluate AI systems rigorously and sustainably, while reflecting the realities faced by social sector organizations. An early version of this playbook was based on a four-level AI evaluation framework formulated by the Agency Fund, J-PAL, and the Centre for Global Development, which evaluates AI across model outputs, product performance, user behavior, and ultimate development outcomes.

To understand how organizations evaluate AI in practice, interviews were conducted with practitioners working in health, social protection, justice, and behavior change. Rather than a single model of evaluation, these conversations revealed recurring patterns shaped by organizational mission, practical constraints, and implementation realities. A key insight is that evaluation often begins by addressing the most consequential question or risk, whether safety, desirability, or usability, rather than strictly following a linear framework. For example, RightWalk Foundation focused on whether users could navigate a government apprenticeship portal effectively, while Cliniva prioritized early user feedback on a WhatsApp-based health support tool before fully optimizing underlying AI models. Most teams conducted model checks alongside broader evaluation, resulting in a multi-pronged approach that balances risk, feasibility, and clarity for decision-making.

Evaluation of product performance and user behavior is closely intertwined. Teams track indicators such as task completion, user adherence, and system corrections that reflect both product function and user engagement. Pinky Promise, for instance, evaluates medication adherence and symptom resolution, integrating these measures as signals of both system performance and user trust. Dalberg Data Insights observed that examining user interactions within workflows reveals whether product design, AI behavior, and automated processes effectively support meaningful outcomes. These integrated measures often provide more actionable insights than separating product and user metrics.

Domain experts play a central role in evaluating AI accuracy and safety, especially in high-stakes fields like healthcare and law. Automated methods alone are insufficient; experts define desired outcomes, identify unacceptable failures, and review AI outputs. Intelehealth leverages physician review and LLM-based evaluation to monitor clinical decision support tools, while AdalatAI relies on legal experts to curate training data. Though expert review is resource-intensive, it is critical for ensuring safe and reliable AI performance.

Impact is assessed through pragmatic proxies that support learning in evolving systems. Organizations track intermediate indicators such as diagnostic accuracy, consultation time, workflow completion, or productivity gains as early signals toward long-term goals like health improvements or access to justice. Continuous AI evolution makes traditional long-term impact evaluations impractical, so teams rely on frequent, directional signals to guide iteration and decision-making. Resource constraints and the complexity of rigorous trials mean many organizations focus on outcome-focused evaluation methods that are feasible alongside ongoing product development.

Overall, AI evaluation in practice is shaped by trade-offs between risk, feasibility, and learning priorities. Insights from practitioners indicate that effective evaluation is iterative, context-aware, and focused on generating actionable understanding rather than comprehensive certification. The AI Evaluation Playbook aims to capture these lessons, evolving alongside technological advancements and field experience. Ongoing practitioner input remains central to refining this resource, supporting organizations to evaluate AI for social impact in ways that are practical, meaningful, and sustainable.

If organizations are building or deploying AI for social impact and wish to share experiences or contribute as case studies, IDinsight encourages reaching out to Sid Ravinutala ([email protected]) or Isha Fuletra ([email protected]).