Claude Code Benchmark: Dynamic Languages are Faster and Cheaper

by Priyanka Patel

For developers integrating artificial intelligence into their daily workflows, the choice of programming language may now influence not just the runtime performance of the application, but the actual cost and speed of writing the code. A comprehensive new Claude Code programming language benchmark reveals that dynamic languages are significantly more efficient—and cheaper—to generate using AI than their statically typed counterparts.

The study, conducted by Ruby committer Yusuke Endoh, analyzed how Claude Code (powered by Opus 4.6) handled the implementation of a simplified version of Git across 13 different languages. Over 600 individual runs, the data showed a consistent trend: dynamic languages like Ruby, Python, and JavaScript were the fastest and most stable to produce, while statically typed languages were between 1.4, and 2.6 times slower and more expensive to generate.

As a former software engineer, I’ve seen the industry lean heavily toward static typing for the sake of maintainability and safety in large-scale systems. Still, this benchmark suggests a “type tax” when it comes to LLM-assisted development. The overhead isn’t just about the extra characters in a type annotation; it appears to be a cognitive burden on the model itself, increasing the number of “thinking tokens” required to reason through type constraints before producing a final answer.

The Efficiency Gap: Dynamic vs. Static

The experiment was designed to isolate language-level differences by removing the variable of external libraries. To achieve this, Endoh used a custom hash algorithm instead of the industry-standard SHA-256, ensuring the AI couldn’t simply rely on varying levels of library support across different ecosystems. The task was split into two phases: an initial build implementing basic Git functions (init, add, commit, and log), followed by an expansion to include more complex features like status, diff, checkout, and reset.

The results placed Ruby, Python, and JavaScript in a tier of their own regarding speed and cost-effectiveness. Ruby emerged as the most efficient, averaging $0.36 per run with a completion time of 73.1 seconds. Python and JavaScript followed closely behind, both staying under 82 seconds and costing less than $0.40 per iteration.

Average Cost and Speed per Implementation Run
Language Avg. Cost (USD) Avg. Time (Seconds) Stability
Ruby $0.36 73.1 High (0 failures)
Python $0.38 74.6 High (0 failures)
JavaScript $0.39 81.1 High (0 failures)
Go $0.50 101.6 Moderate
Rust $0.54 Variable Low (Test failures)
C $0.74 Variable Moderate

Beyond these top three, the benchmark saw a sharp increase in both cost and variance. Go averaged $0.50 per run but exhibited a standard deviation of 37 seconds, suggesting less predictability in how the AI approached the language. C was the most expensive mainstream language tested at $0.74 per run, largely because the AI generated 517 lines of code—more than double the 219 lines required for the Ruby implementation.

Quantifying the ‘Type Tax’

One of the most revealing aspects of the research is how the introduction of strict type checking affects AI performance. When the benchmark added mypy strict checking to Python, the generation process slowed down by 1.6 to 1.7 times. The penalty was even more severe for Ruby; adding Steep type checking made the process 2.0 to 3.2 times slower than plain Ruby.

Quantifying the 'Type Tax'

A similar trend appeared in the web ecosystem. TypeScript was notably more expensive to generate than JavaScript, averaging $0.62 compared to $0.39, despite the two producing similar total line counts. This suggests that the cost increase is not driven by the volume of text generated, but by the computational effort the model spends on reasoning about type constraints during the “thinking” phase.

For engineering teams, this creates a trade-off between the speed of the AI-assisted “flow” and the long-term rigor of the codebase. While a 30-second difference between a Ruby and a Go run might seem negligible in isolation, those gaps compound during iterative prototyping, where a developer might prompt the AI dozens of times per hour.

Prototyping Scale and Technical Constraints

The benchmark has not been without its critics. Because the generated programs were roughly 200 lines of code, some developers argue the results are only applicable to minor-scale prototyping. In larger, enterprise-grade codebases, the safety nets provided by static typing often reduce the time spent debugging, which could potentially offset the slower generation speed.

There were also rare instances of instability in stricter languages. Out of 600 total runs, only three produced failures: two in Rust and one in Haskell. In one particular Rust failure, the AI hallucinated, claiming the tests themselves were incorrect despite the fact that all other Rust trials had passed successfully.

Endoh has been transparent about the limitations of the study, noting his own role as a Ruby committer and acknowledging that the experiment—supported by the Anthropic Claude for Open Source Program—measured generation speed and cost rather than the ultimate quality or maintainability of the code.

What This Means for AI Workflows

As AI agents move from simple autocomplete tools to autonomous coders capable of managing entire repositories, the “cost of reasoning” becomes a primary operational concern. The findings suggest that for rapid iteration and early-stage prototyping, dynamic languages provide a smoother, more cost-effective experience.

However, the industry is likely to see a shift as models evolve. If future iterations of LLMs can reduce generation times to sub-second levels, the speed gap between Ruby and Rust may become irrelevant. Until then, the choice of language remains a strategic decision involving a balance of developer velocity, token expenditure, and software reliability.

The complete dataset, including execution logs and the full source code for all 13 languages, is available for public review in the benchmark repository on GitHub.

We will continue to monitor how upcoming model updates from Anthropic and other AI labs impact these efficiency metrics. Stay tuned for further updates as larger-scale benchmarks are developed to test these theories on enterprise-sized projects.

Do you prioritize generation speed or type safety in your AI coding workflow? Share your experiences in the comments below.

You may also like

Leave a Comment