Skip to main content
Back to Resources
AI in Education

Stop Grading AI Output. Start Grading the Thinking Behind It.

By Nathan Critchett · December 24, 2025

Last semester, a sophomore turned in the best essay his English teacher had seen in twenty years. Clear thesis. Elegant transitions. Evidence woven into argument like thread into fabric. (This is a composite scenario drawn from multiple educators' experiences.)

She gave it an A. Then she asked him to explain his second paragraph.

He couldn't.

Not "couldn't articulate it well." Couldn't. He didn't know what the paragraph said. He'd read it, sure, the way you read a menu. He consumed the words. He didn't construct the thinking. He typed a prompt, received an output, changed the font, and submitted.

This is not a story about cheating. This is a story about a grading system that has become structurally incapable of measuring what matters.

The Quiet Crisis That Looks Like Productivity

Walk through any school right now and you'll see something that looks like progress. Essays come in polished. Lab reports are immaculate. Presentations have better slide design than most corporate decks.

Everything looks better. And a growing number of students can't explain their own work.

This is the crisis that doesn't trigger alarms. There's no spike in failing grades. No dramatic behavioral red flag. Just a slow, steady hollowing: students who produce impressive artifacts and build zero cognitive capacity in the process.

The outputs have never looked better. The thinking has never been thinner.

The Dependency Trap

As we detail in our whitepaper Cognitive Offloading: How AI Is Simultaneously Enhancing and Eroding Student Thinking, the core risk isn't that students use AI. It's that AI use without cognitive demand atrophies the thinking muscles students need most.

Here's the loop, and once you see it, you'll see it everywhere:

Student uses AI. Output looks professional. Teacher grades the product. Student gets an A. Student learns that AI equals good grades. Cognitive muscles never engage. Repeat.

This isn't a bug in student behavior. It's a feature of the system. We built a grading architecture that rewards products. AI produces better products than most humans. Students (who are rational actors optimizing for the incentives we set) did exactly what the system told them to do.

The student isn't cheating. The student is optimizing. The system is the problem.

Every teacher who's caught a student "using AI" has actually caught a measurement failure. The assignment measured the wrong thing. The rubric scored the wrong thing. The grade reflected the wrong thing. And the student, following the incentives perfectly, delivered exactly what was asked for.

What You're Actually Grading

Here's the question that reframes everything: if the AI can produce the output, what does grading the output tell you?

Nothing. It tells you the AI works.

A well-written essay proves that GPT-4 can write well. A correct lab report proves that Claude can format data. A polished presentation proves that an LLM can organize information coherently. None of these prove that a student can think.

And thinking is supposed to be the point.

The uncomfortable truth is that most assessment rubrics, the ones teachers have used for years, the ones that predate AI entirely, were always proxies. We graded the essay as a proxy for the thinking behind it. We graded the lab report as a proxy for scientific reasoning. We graded the presentation as a proxy for synthesis and communication.

The proxy worked when students had to do the cognitive work to produce the artifact. AI broke the proxy. The artifact no longer requires the thinking. So the artifact no longer measures the thinking.

From Product to Architecture

Assessment has to shift. Not eventually. Now. From grading the product to grading the architecture: the quality of the thinking that directed the AI and evaluated its results. For the full assessment framework, see our whitepaper The Centaur Classroom: Designing Human-AI Learning That Builds Thinkers, Not Dependents.

What does that mean in practice?

Old rubric: "Is this essay well-written?" New rubric: "Can this student explain why they structured the argument this way? What alternatives did they consider? Why did they reject them?"

Old rubric: "Is the answer correct?" New rubric: "Can this student identify when AI-generated answers are wrong? Can they explain how they'd check?"

Old rubric: "Did they complete the assignment?" New rubric: "Did they demonstrate judgment the AI couldn't provide?"

The shift is from evaluating outputs to evaluating decisions. Why this thesis and not that one? Why this evidence and not that one? Where did the AI get it wrong, and how do you know? What did you change, and what was your reasoning?

These questions can't be answered by prompting an AI. They require the student to have actually done the cognitive work, to have thought about thinking. To have made choices and understood why they made them.

The Cheating Conversation Dissolves

Here's the part that should make every administrator's life easier: this framework eliminates the cheating problem entirely.

Not by catching cheaters. Not by deploying detection software. Not by threatening consequences. It eliminates cheating by making the concept irrelevant.

When "use AI well" IS the assignment, there's nothing to cheat on. The adversarial dynamic (students sneaking AI use past integrity systems) disappears. You can't secretly use AI on an assignment that requires you to use AI. You can't shortcut the thinking when the thinking is the deliverable.

Every dollar spent on AI detection software. Every hour spent on academic integrity hearings. Every ounce of trust eroded between teachers and students over "did you or didn't you." All of it becomes unnecessary when you stop grading outputs and start grading architecture.

The detection arms race (AI detectors vs. AI humanizers vs. better detectors vs. better humanizers) is an infinite loop with no winner. Step off the treadmill. Change what you measure.

This Is Harder

Let's be honest about the cost: grading architecture is harder than grading products.

Evaluating a finished essay takes minutes. Evaluating the thinking behind it takes a conversation. It requires teachers who can assess reasoning processes, not just final artifacts. It requires rubrics that capture decision quality, not just output quality. It requires time: more time per student, more cognitive effort per assessment.

This is real. It's not a reason to avoid the shift. It's a reason to support teachers through it.

A teacher who can evaluate whether a student's editorial judgment is developing, whether they're getting better at spotting AI errors, structuring arguments, choosing evidence, that teacher is doing something no AI can do. They're assessing the one thing that matters: the quality of a human mind's engagement with a problem.

That's teaching. The rest is processing.

One Thing This Week

You don't need a committee. You don't need board approval. You don't need a new platform.

Pick ONE assessment. Any subject, any grade. And redesign the rubric around four questions:

  1. Prompting quality: Did the student give the AI good instructions, or vague ones? Can they explain why they prompted the way they did?
  2. Error detection: Can the student identify where the AI got something wrong, made something up, or produced shallow reasoning?
  3. Editorial judgment: Did the student make deliberate changes to the AI output? Can they defend those changes?
  4. Reasoning about choices: Can the student explain the decisions they made: what they kept, what they cut, what they restructured, and why?

Grade those four things. Weight them heavily. Make the AI output itself worth almost nothing.

Then watch what happens. Students will start reading AI outputs critically instead of copying them. They'll start thinking about structure instead of accepting whatever the machine produces. They'll start developing the exact cognitive muscles that AI threatens to atrophy.

Not because you banned AI. Because you made thinking the assignment.

The cheating conversation is a symptom. The grading system is the disease. Fix the system and the symptom disappears.

Related Reading

Want to see this in action?

We'll walk you through a real report and recommend the right starting point for your team.