Can AI tools assess coding assignments?

Credit: Creative Images Lab/Getty

One evening, my partner Boyan Li sat at the kitchen table marking student submissions for a coding course he was teaching as part of his PhD at Harvard Medical School in Boston, Massachusetts. The assignment required students to implement a computational-biology algorithm on a given data set. Each submission demanded more than a quick check. He ran the code, examined the output and traced the logic line by line. Some submissions were clearly correct; others were clearly wrong. But many fell into a grey zone: they were partly right, but uneven in their execution or reasoning. These were the hardest to assess, and the most time-consuming.

As a higher-education researcher, I watched this process with professional interest. What seemed to be a purely technical task — running code and checking outputs — was revealed to be deeply interpretative. Assessing coding assignments involves deciding what counts as understanding, what counts as error and how much variation is acceptable. This resonated with my own research on student learning and development, which views educational activities as inherently relational: even something as seemingly mechanical as marking becomes a dialogue between the examiner and the learner.

Seeing this interplay of technical skill and human judgement led me to ask: can generative artificial intelligence (genAI) assist in assessing without erasing the interpretative work that makes it meaningful?

Experimenting with AI

Coding assignments seem to be especially well-suited to AI tools. Unlike essays, computer code follows clear structures and strict rules, making it easier to evaluate. My partner tested this idea using OpenAI’s ChatGPT 5.4. He gave it the assignment prompt alongside the reference solution and asked it to assess a student’s code for accuracy. In practice, ChatGPT mainly compared the student’s code with the reference solution and struggled to recognize valid alternative approaches. It often focused on minor issues — such as lower computational efficiency — rather than evaluating whether the student understood the underlying algorithm, which was the main learning objective.

Observing my partner’s frustration, I realized that ChatGPT was missing important context. I suggested that he provide information about common student mistakes and clarify which minor issues could be ignored.

Chatbots in science: What can ChatGPT do for you?

His existing workflow proved especially helpful here: before marking, he writes his own code and then looks at the instructor’s reference solution. This helps him to anticipate what students might struggle with, which are often the same parts that he initially makes mistakes on. Patterns also emerged during meetings with students. Students often came to him with similar questions, and some brought AI-generated answers that they did not fully understand. These recurring points of confusion revealed key bottlenecks in the process of correctly implementing the whole algorithm — insights that would have been difficult to identify from the reference solution alone.

Integrating these insights improved the AI tool’s usefulness. It could suggest further test cases, probing whether a student’s solution passed the marking-rubric checkpoints but failed on ‘edge cases’ — in which, for instance, an algorithm might be given extreme (but valid) input values. For one assignment, students implemented an algorithm to align a genome sequence. One student submitted lengthy, hard-to-read code that passed all three rubric checkpoints. ChatGPT, however, identified a flaw in the program’s logic and, after extended reasoning, proposed an edge case in which it would yield incorrect results. Without AI, this mistake might have gone unnoticed or required hours of manual inspection.

... continue reading