AI grant proposals have a structural problem: they are easy to write convincingly and hard to evaluate rigorously. The technology is unfamiliar to most programme officers. The terminology is new. The metrics are unclear. And the sector is moving fast enough that everyone feels behind.
The result is that a lot of AI funding is chasing signals of sophistication — chatbots, large language models, voice AI — rather than evidence of impact. Maggie Johnson, who runs Google.org, calls this the under-built side of the equation. We talk endlessly about whether grantees are "AI-ready." We rarely ask whether we are.
These five questions will not make AI grant evaluation easy. They will help you tell the difference between an organisation that has thought this through and one that has not.
What specific workflow does this AI intervention improve — and how will you measure whether it improved?
The weakest AI proposals describe what the technology does. The strongest describe what changes in the real world as a result. "We are building an AI Teacher Coach" is a technology description. "We are reducing the time teachers spend on lesson planning by 40%, freeing up three hours per week for direct student support" is a workflow improvement with a measurable outcome.
This distinction matters because adoption metrics are not outcome metrics. Reach is not change. A million users of a tool that nobody acts on is not impact — it is engagement.
A named workflow. A specific metric. A baseline measurement already taken. A plan for measuring change at three, six, and twelve months.
The proposal describes capabilities, not changes.
What does the data architecture look like — and who controls the data?
AI systems are only as good as the data they operate on. And in social sector contexts, data ownership is a governance question as much as a technical one. Who controls the data? What happens to it if the nonprofit's funding changes? What are the consent frameworks for beneficiary data? Does the eventual government partner have requirements that have not yet been mapped?
Ownership should track to accountability — whoever is responsible when the system fails should also own it. This sounds obvious until you watch what happens in practice. A nonprofit builds an AI system on a foundation grant. The grant ends. The model costs are unaffordable, the engineering capacity is gone, and the government partner who was meant to absorb the system is on a different fiscal cycle. The system goes dark. The data goes nowhere. A useful sanity check is to ask: if the grantee disappeared tomorrow, would the system still be usable by the people who depend on it? If not, you funded a tool, not infrastructure. That may be the right call — but it should be a deliberate one.
A clear description of where data lives, who owns it, what consent frameworks are in place, and what data sharing agreements look like with technology partners and government partners. A defined plan for data stewardship beyond the grant horizon — nonprofit stewardship, joint stewardship with a transfer plan, Digital Public Good architecture, or government ownership from day one.
Vague answers about data ownership, or no mention of government data requirements in a government-adjacent deployment.
What is the evaluation plan — across all four levels?
The Agency Fund's GenAI Evaluation Playbook is the most useful framework in circulation right now. It maps four levels of evaluation:
- Level 1. Does the AI system perform as intended? (Technical performance.)
- Level 2. Does the product engage and retain users?
- Level 3. Does it change user knowledge and behaviour?
- Level 4. Do users with access to the product actually improve development outcomes?
Most AI proposals include Level 1 evaluation. Very few include Levels 3 and 4. But Levels 3 and 4 are the only ones that matter for social impact. Insist on evaluation frameworks — evals, in the technical vocabulary — being defined at the proposal stage, not retrofitted at the report stage.
An evaluation plan that includes all four levels, with specific metrics for each, a timeline, and an independent evaluation component for Level 4.
"We will track engagement metrics and user feedback." That is Level 2 at best.
What is the government adoption pathway — and have you had that conversation yet?
In the Indian social sector — and across most of the Global South — most AI interventions only achieve exponential impact if they are eventually adopted into government delivery systems. But government adoption requires a completely different architecture than nonprofit-run pilots. Data sovereignty compliance. Procurement-compatible infrastructure. Government IT integration. A transfer plan with conditions and timeline.
Organisations that treat government adoption as an afterthought face painful renegotiations or quiet shutdowns when the grant cycle ends. Organisations that design for it from day one move faster and have more durable impact. The Indian Digital Public Goods stack — Sunbird, ABHA under ABDM, the ONEST registry — has made this design discipline easier to adopt than at any previous moment. An intervention designed to be DPG-compatible from day one inherits national reach. An intervention designed in isolation has to build it.
A named government counterpart who has been in the conversation. A clear articulation of what government adoption requires technically and politically. A data architecture designed for transfer, not just for pilot operation. Where appropriate, DPG compatibility from day one.
"We will figure out government adoption once we have proven the model." That is almost always too late.
What does failure look like — and what is the redress mechanism?
AI systems in social sector contexts fail in specific ways that are different from conventional programme failures. They produce incorrect outputs. They perform differently across demographic groups — often worst for the populations the programme exists to serve. They fail silently — continuing to operate while producing wrong answers that no one catches. And when they fail in high-stakes contexts — developmental flagging for children, medical triage support, citizen feedback systems used to allocate resources — the harm is real and often invisible.
The funder who does not ask about failure modes is implicitly accepting that the grantee has not thought about them either.
Specific failure scenarios identified and named. Human-in-the-loop checkpoints for any system making consequential decisions, with the design discipline that the failure mode is "ask a human" rather than "guess." A redress process for people affected by incorrect outputs. Regular performance auditing in deployment context, not just at launch.
"The model is accurate to X%." Accuracy on a benchmark is not the same as safety in deployment. Push for deployment-context testing, disaggregated by gender, geography, and any other relevant population segment.
A note on your own readiness
Google.org's Maggie Johnson makes a point most funders skip past: your internal AI capability is a prerequisite for supporting grantees' AI capability.
You cannot evaluate what you do not understand. You cannot guide a portfolio through a technology transition you have not navigated yourself. You cannot credibly ask portfolio organisations to embed AI into their core workflows if you have not done it in your own — and the same applies to data governance, evaluation discipline, and failure-mode planning.
Before asking these questions of grantees, it is worth asking them of your own organisation.
The cost of getting this right is modest. Usually one technical hire, one rethink of the grant lifecycle, one honest audit of internal practices. The cost of getting it wrong is paid downstream, by the parents waiting for an autism evaluation, the children whose teachers were promised support that never arrived, the citizens whose voice data sat in a database no one could maintain.
If you are a funder reading this and you cannot answer these five questions in the context of your own portfolio, that is the work to do before the next cheque.
Further reading
The Agency Fund — GenAI Evaluation Playbook. Google.org's AI Readiness Playbook for Funders. Kevin Starr (Mulago Foundation), Scale Really Matters (Stanford Social Innovation Review).
Tilted Ground companion frameworks: Three Kinds of Scale, AI Readiness for Funders, and Funding AI in the Global South: Six Case Studies.