Testing MiniMax M2.7 via API on three real ML and coding workflows

I recently got access to some MiniMax M2.7 API credits, so I decided to plug this model directly into Claude Code and run it on three workflows I do regularly. The same tasks were run using Claude Opus 4.7 as the comparison baseline.

The three workflows: scaffolding an entry for an active Kaggle competition, drafting and auditing knowledge-base notes for my Obsidian vault, and updating an old PyTorch project that became outdated. I wanted to find out how well M2.7 works inside an agentic loop when the task has clear boundaries. The results were consistent across the three runs: M2.7 was useful when the constraints were explicit, and the output format was concrete. It stumbled when important context was left implicit, though some of the same gaps appeared with Opus 4.7 as well.

For the more open-ended cases, I would still keep a human review pass in the loop.

Setup

I added a claude-mm command that points Claude Code at the MiniMax API and ran M2.7 with thinking set to max in the CC interface. I ran on MiniMax’s Plus tier (High-Speed, $40/month), where the context window and per-day throughput no longer became bottlenecks for multi-step agentic work.

claude-mm () { ANTHROPIC_BASE_URL = "https://api.minimax.io/anthropic" \ ANTHROPIC_AUTH_TOKEN = " $MINIMAX_API_KEY " \ ANTHROPIC_MODEL = "MiniMax-M2.7" \ ANTHROPIC_DEFAULT_SONNET_MODEL = "MiniMax-M2.7" \ ANTHROPIC_DEFAULT_OPUS_MODEL = "MiniMax-M2.7" \ ANTHROPIC_DEFAULT_HAIKU_MODEL = "MiniMax-M2.7" \ ANTHROPIC_SMALL_FAST_MODEL = "MiniMax-M2.7" \ API_TIMEOUT_MS = "3000000" \ CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1" \ claude " $@ " }

In agentic work, the harness can be as important as the model itself. Most of the failures I describe below had similar reasons: the prompt did not explicitly state a constraint the task depended on, and the model filled the gap with a plausible default. In practice, model quality and harness design are hard to separate. A stronger model may infer missing constraints; a better harness may make those constraints explicit. I treated this as a workflow test, not a pure model benchmark.

Refactoring an old PyTorch project

The first workflow was a refactor: my pytorch_tempest repo is a framework for training neural nets using Hydra + PyTorch Lightning. I wanted to update dependencies, modernize the tooling, and clean up the code issues that had accumulated over time. The merged result is PR: refactoring old code and updating dependencies.

... continue reading