The promise of AI-driven software engineering has always been the ability to offload the “grunt work” of coding to an intelligent agent, allowing humans to focus on high-level architecture. However, a stark warning from one of the industry’s leading AI practitioners suggests that this trust may be misplaced for the most demanding tasks.
Stella Laurenzo, AI director at AMD, has raised significant concerns regarding the reliability of Anthropic’s Claude Code, claiming the tool Claude cannot be trusted to perform complex engineering tasks. Her assessment follows months of internal testing and a perceived decline in the tool’s capabilities following an update in February 2026.
Laurenzo’s critique is not based on anecdotal frustration alone but on a rigorous data-driven analysis. The AMD team examined more than 6,800 coding sessions, which included nearly 235,000 tool calls and approximately 18,000 reasoning blocks. The findings suggest a measurable dip in performance that has affected senior engineers across her team.
As a former software engineer, I have seen this pattern before: the “honeymoon phase” of a new tool where early wins mask deep-seated reliability issues that only emerge during complex, multi-step engineering projects. For a senior developer, a tool that is 90% accurate is often more dangerous than one that is 0% accurate, because the remaining 10% of errors can be subtly catastrophic.
The rise of ‘stop-hook’ violations
One of the most concerning trends identified in the AMD analysis is the increase in “stop-hook violations.” These occur when the AI prematurely gives up on a task, dodges responsibility for a specific fix, or requests unnecessary permissions to proceed rather than solving the problem at hand.
According to Laurenzo, these violations were virtually non-existent in early March but surged to approximately 10 per day shortly after the February update. This suggests a shift in the model’s behavioral boundaries, where the AI becomes more prone to avoidance than resolution.
Beyond these violations, Laurenzo observed a fundamental shift in how the tool approaches problems. She noted a transition from “research-first” behavior—where the AI analyzes the codebase and plans its approach—to “edit-first” behavior. This aggressive approach to editing often results in lower-quality code and a failure to adhere to established project conventions, reducing the overall reliability of long-duration coding sessions.
Reasoning as a ‘load-bearing’ structure
The technical core of the dispute centers on “thinking redaction,” specifically a setting identified as redact-thinking-2026-02-12. Laurenzo argues there is a strong correlation between the introduction of this redaction and the decline in performance on complex tasks.
In advanced software engineering, extended reasoning is often “load-bearing.” This means the internal “chain of thought” the model undergoes is not just a byproduct of the process, but the very mechanism that allows it to handle complex dependencies and edge cases. When this reasoning is curtailed or altered, the final output suffers.
Anthropic has responded to these findings, with a representative named Boris clarifying that the redaction setting is designed to hide the reasoning process from the user interface for a cleaner experience, rather than actually reducing the amount of reasoning the model performs.
Adaptive Thinking and the ‘Effort’ Trade-off
To address these concerns, Anthropic highlighted the introduction of “adaptive thinking” within Opus 4.6. This system allows the model to dynamically decide how long it needs to “think” about a problem to balance performance with efficiency.
However, this efficiency comes with a trade-off. The current default for users is set to “medium effort” (represented as effort=85). While this is sufficient for many tasks, it may not be enough for the complex engineering work Laurenzo describes.
| Setting | Primary Goal | Trade-off | Target User |
|---|---|---|---|
| Medium (Default) | Efficiency and speed | Lower intelligence on complex tasks | General users |
| High | Maximum intelligence | Increased tokens and latency | Power users / Engineers |
Anthropic suggests that users who require higher intelligence for complex tasks should manually set their configuration to effort=high via the /effort command or within their settings.json file. The company has also indicated it is testing higher effort defaults for Teams and Enterprise users, acknowledging that these professional cohorts are often willing to accept higher latency and token costs in exchange for extended thinking and better accuracy.
The broader impact on AI coding assistants
This friction between AMD and Anthropic underscores a growing tension in the AI industry: the struggle to balance operational costs (tokens and latency) with the uncompromising requirements of professional software engineering. For a hobbyist, a slightly suboptimal code snippet is a minor inconvenience; for a company like AMD, it is a productivity drain and a potential risk to system stability.
The incident highlights the necessity for “observability” in AI tools. Without the scale of data Laurenzo’s team collected—thousands of sessions and hundreds of thousands of calls—this performance dip might have been dismissed as a series of isolated anecdotes rather than a systemic regression.
As AI agents move from simple autocomplete tools to autonomous collaborators capable of managing entire repositories, the definition of “trust” will shift. It will no longer be about whether the tool can write a function, but whether it can maintain the integrity of a complex system over a long-term engagement.
Anthropic has expressed appreciation for the depth of Laurenzo’s analysis, suggesting that this kind of high-level feedback is essential for refining their models. The next step for professional users will be monitoring whether the “high effort” settings and the adaptive thinking of Opus 4.6 can truly restore the reliability required for enterprise-grade engineering.
We would love to hear from other engineers using Claude Code—have you noticed a shift in performance or reliability in your recent sessions? Share your experiences in the comments below.
