AI tools - SKEPSIX.

Claude Frustrating Users with Mid-Work Cutoffs: The ‘Every Time’ Problem

Giugno 7, 2026 di skepsix

Skep hits Claude's usage limit mid-session — error message reads "SESSION TERMINATED" as incomplete code hangs on screen

TL;DR: Claude’s usage limits hit hardest in the middle of complex coding, writing, or analysis sessions, breaking flow and wasting tokens. Hallucination rates have fallen, but continuity remains unmeasured. The community builds elaborate workarounds while Anthropic fixes everything except the stop-start experience.

The limit doesn’t warn you. It just stops you.

The complaint surfaces repeatedly: you ask Claude to implement a small bug fix or run a quick analysis, and halfway through the task it runs out of quota. Context vanishes. The thread you built with model and tool is severed, and the error message feels like a rebuke for working too long. This isn’t a hallucination problem; Claude Opus 4.7 now hallucinates on only 1.0 to 2.5% of summaries, a dramatic improvement in factual reliability. But a 1% hallucination rate doesn’t matter when the model cannot complete the summary because you hit a usage wall mid-paragraph. A frontier model with benchmark-leading accuracy is useless when it won’t stay connected long enough to deliver.

The Pro plan grants significantly more usage than the free tier, yet the cap remains an opaque, unbendable limit. Users describe hitting it “every time” they attempt a non-trivial session. The frustration isn’t that a limit exists; it’s that the threshold cuts work off at the worst possible moment, without warning and with no mechanism to finish the thought. You lose not just the tokens you already spent, but the time it takes to reconstruct mental state, re-upload files, and re-establish the conversation’s nuance. The model’s intelligence is not in doubt; the service’s rhythm is.

Claude’s rate limits interrupt coding sessions at the worst moments

Anthropic doesn’t publish exact token quotas for Pro users, but the pattern is unmistakable. A session that starts with architectural discussion, then moves to implementation, will expire right as the code needs testing or the analysis requires a final integration. The cutoff feels arbitrary because it is arbitrary: the system meters tokens over a rolling window, and once you exceed the window, the model simply stops. There’s no warning, no “you have ten messages left” indicator that adapts to message length. You are coding, you send a follow-up, and the UI tells you to wait.

The damage isn’t just the forced pause. A truncated interaction often leaves behind a half-finished artifact. If the model was generating a spreadsheet or a long function and the cap is reached mid-generation, the output is incomplete. Resuming from a summary does not preserve the same fidelity; several users note that the summary itself consumes a comparable number of tokens to the full context, meaning you pay twice for the same session without recovering the lost momentum. This defeats the efficiency the feature is supposed to provide.

When a project is complex enough that you rely on Claude to hold multiple files and constraints in its context window, a mid-task cutoff is not just an inconvenience. It forces a full session restart, manual re-injection of the plan, and a prayer that the model’s interpretation of your intentions remains consistent. In collaborative programming, that kind of reset would be considered a failure of the pair-programming arrangement.

Why users feel they cannot trust Claude for real work

The call of “every time” reveals a deeper erosion of trust. You stop treating Claude as a reliable partner when you cannot plan a session around a predictable budget. The mental model shifts from “I’ll work with Claude to solve this” to “I’ll try to squeeze what I can before the clock runs out.” That scarcity mindset degrades the quality of the interaction: you rush, you skip exploratory questions, and you avoid having the model double-check its own work.

Scott Alexander recently argued that AI “hallucinations” are better understood as shameless guesses: the model has been trained to predict, and it guesses because there is no penalty for being wrong. Claude’s usage problem has a parallel. The model’s design encourages it to be helpful by generating abundant output, sometimes far more than you requested. A user reports that Claude, given a prompt to analyze numbers, took ten minutes to produce an entire spreadsheet that wasn’t asked for, burning $8.50 in API credits. This is a shameless guess about what you might want, and it devours quota before you can intervene. The model isn’t malicious; it’s executing its default “be helpful” training signal without awareness of your budget constraint.

When the billing or usage cap is metered per token and the model generates verbosely, the user bears the cost of the model’s over-helpfulness. That’s a misalignment of incentives. Anthropic can reduce hallucination and improve reasoning, but if Claude still writes epic treatises when you needed two sentences, the trust problem remains. You can’t rely on a tool that sometimes chooses to spend your finite resource on a guess.

The hidden cost nobody calculates when choosing a Pro plan

The official feature set of Claude Pro lists “5x more usage than free” and “priority access during peak times.” Nowhere does the marketing page quantify the cost of interrupted sessions. That cost is real and accumulates over a month. A developer who uses Claude for daily coding might lose twenty to thirty minutes per disruption re-uploading context and re-establishing the thread. Multiplied across dozens of sessions, that’s hours of unpaid, invisible labor. The subscription price looks cheap, but the time tax is high.

This hidden cost also distorts how users evaluate the model’s intelligence. A model that could solve a problem in three turns if left alone might need six because two of those turns were wasted recovering from a cutoff. The perceived sluggishness or repetitiveness is sometimes not the model’s fault; it’s the byproduct of a fragmented dialogue. Benchmarks that test isolated prompts in a single turn miss this entirely. They report accuracy, but not continuity. In a world where real work spans dozens of exchanges, continuity is a capability metric that matters as much as reasoning.

Moreover, the workarounds themselves introduce additional expense. Users who route mechanical coding tasks to a cheaper model to preserve Claude’s quota are effectively paying two subscriptions to approximate a smooth experience with one. The economic calculus shifts: the effective cost of a “reliable” AI coding assistant is the sum of multiple tools and the time spent orchestrating them. That premium isn’t advertised.

The variable no AI benchmark measures: continuity

George Hotz recently argued that the real singularity is the community networks and ad-hoc tooling that transform raw AI into something useful. Claude’s rate limits have spawned exactly that kind of grassroots ingenuity: users devise systems that persist a plan to disk before every major prompt, that split workloads across multiple AI providers using separate quota pools, and that write small scripts to checkpoint context so resumption is nearly free. These are clever solutions, but they are born of failure. The platform’s inability to provide continuous, bounded assistance forced users to become process engineers.

The true measure of an AI assistant’s usability should include a metric like “session completion rate” or “hours of uninterrupted productive flow.” No benchmark slate does this. Vectara’s HHEM evaluates summarization hallucination; RAGTruth looks at faithfulness when integrating retrieved documents; TruthfulQA measures closed-book factuality. None of them penalize a model for stopping mid-sentence because of an API throttle. Yet a typical professional session with Claude can involve dozens of steps, each contingent on the previous one. A tool that can’t be relied on to finish a thought introduces a failure mode orthogonal to intelligence but equally limiting.

Anthropic has invested heavily in alignment research and in beating hallucination benchmarks. Those efforts matter, but they tackle problems that occur inside the model. The rate-limit problem occurs at the service boundary and is entirely within Anthropic’s control to fix through design. A more continuous experience would not require new training runs or architectural breakthroughs; it would require a rethinking of how usage is metered and how sessions are buffered.

What if Anthropic let you finish your thought?

The simplest solution sits in plain sight: allow an “overdraft” that lets the current task complete, with the excess usage deducted from the next refresh. The model already counts tokens; it could, upon reaching the threshold, grant a grace buffer equal to the size of the last message or the estimated completion cost of an ongoing generation. The user would never see a mid-paragraph cutoff again, and Anthropic would still enforce the agreed quota over the window. This is a product decision, not a technical impossibility.

A deeper redesign would allow users to set a “task budget” at the start of a session: “I need 200,000 tokens for this project; warn me at 80% usage, then let me finish.” The metering becomes predictable and user-controlled, aligning the incentive structure: the user plans around a known budget, the model doesn’t overspend on shameless verbosity, and completion is assured. Workflows that today require manual segmentation across multiple tools would collapse into a single conversation.

The community has already demonstrated that this is feasible by building their own systems. The question for Anthropic is whether it will continue to treat the rate limit as an immutable constraint or recognize it as the user’s primary gripe, even as models get smarter. Lower hallucination rates win paper citations, but uninterrupted sessions win daily loyalty. If Claude frustrates its most dedicated users with every long session, it risks training those users to look elsewhere, not because another model is more accurate, but because another service respects their time.

AI Coding Solved 90 Percent of Boring Tasks: The Last 10 Percent Demands a Different Approach

Maggio 29, 2026 di skepsix

Skep spots a bug chain in Claude Code — AI coding hallucination hits hardest on concurrency and async handlers

TL;DR: AI tools can now refactor a 120-file codebase for $3, processing hundreds of steps autonomously. Yet the same tools confidently introduced a deadlock in an async event handler, a bug that could crash production. The last 10 percent of coding tasks, where correctness matters most, remains as hard as ever; the real bottleneck is knowing when the AI is wrong, not making it faster.

What a $3 refactor reveals about AI coding today

Three dollars. That’s what one engineer paid to refactor a 120-file FastAPI service using off-the-shelf AI models: a mix of DeepSeek v4, Hunyuan Hy3 preview, and Claude Opus for difficult steps. The bulk of the work, some 360 routine refactors out of 400 total steps, ran on cheap open-weight models at roughly $0.18 per million input tokens — roughly 80 times cheaper than Opus. The entire run burned through two million tokens and finished in under an hour for the easy parts. The result was a working codebase, except for one thing: the AI silently introduced a deadlock into an async event handler.

That deadlock is a perfect specimen of the “last 10 percent.” Refactoring variable names, adjusting imports, even rewriting straightforward logic: solved. Concurrency patterns that require reasoning about event loops and thread safety: not solved. The models handle the boring 90 percent with speed and consistency that would take a junior developer days, but the remaining fraction of tasks demands more than just scale; it demands judgment.

This asymmetry matches broader AI hallucination patterns. In summarization tasks, frontier models in 2026 hallucinate on only 1.0 to 2.5 percent of outputs, down from 3 to 8 percent in 2023, according to the Vectara Hughes benchmark. Yet when models must integrate external context, error rates climb to 4 to 9 percent. Code generation sits somewhere between: part pattern-matching, part factual reasoning. Routine code is head-of-distribution like major cities in a geography quiz; concurrency edge cases are long-tail facts, where hallucination rates spike to 15 to 40 percent. The deadlock is the long tail.

Why skepticism persists despite cheaper, faster code generation

The catchphrase “coding is solved” draws eye rolls from experienced developers, and for good reason. A model that completes a refactoring but introduces a deadlock is not a solution; it’s a time bomb. The engineer who shared the $3 refactoring spent almost as much time debugging the 40 escalation steps as they did on the 360 easy ones. Latency on the escalated queries, handled by Opus, was slower than on the cheap models because the problems were hard, not because Opus itself is slow. The total human time investment didn’t vanish; it shifted.

Critics point out that AI tools can actually slow down expert programmers. A developer who understands concurrency deeply might spot the deadlock instantly while reading the code; an AI-assisted developer might trust the output and later spend hours tracing a crash. This mirrors findings from the OpenAI “Why Language Models Hallucinate” paper: models are optimized through training and evaluation to never say “I don’t know.” When uncertain, they guess. In standard benchmarks, guessing raises scores because any answer beats a blank. That incentive structure bleeds into coding; the model rarely admits it cannot handle a tricky async pattern and instead delivers a plausible-looking code block that compiles but deadlocks.

The real fear isn’t that AI will replace developers; it’s that developers will become too trusting. If a model can breeze through 90 percent of boring work, the temptation to assume the remaining 10 percent is equally safe is enormous. But the gap between 90 percent and 100 percent in software is not a linear distance; it’s a different category of problem.

The hidden consequence of treating AI output as production-ready

When a code generation tool outputs thousands of lines that pass tests and compile, the natural instinct is to ship. The $3 refactoring likely passed a cursory check; the deadlock might have lurked for weeks before causing a crash. Over time, reliance on AI-generated code without rigorous human review erodes the safety rails that experienced teams build.

A parallel comes from multimodal AI research. The MIRAGE paper (March 2026) discovered that frontier models can produce detailed descriptions and diagnostic reasoning for images that were never actually shown — a phenomenon the researchers termed “mirage reasoning.” The models acted as if they saw an X-ray, generating plausible findings that were entirely fabrications. Similarly, an AI coding model can act as if it understood the concurrency model of a framework, crafting code that looks correct in a diff but fails under load.

The mirage effect in coding means that correctness is not verified by appearance. A deadlock might be invisible in a diff; the logic looks sound. The only reliable signal is execution under real concurrency. This mismatch between surface plausibility and actual correctness is where the last 10 percent becomes expensive: it demands not just code generation but deep testing and architectural insight that AI, today, does not possess.

Why current coding benchmarks miss the deadlock in your async handler

Benchmarks like HumanEval and MBPP measure whether a model can write a function that passes given test cases. They don’t measure whether that function causes a deadlock in a live system, whether it introduces a subtle data race, or whether it violates the application’s implicit invariants. A model that scores 90 percent on HumanEval may still produce a deadlock on a 120-file refactoring because the benchmark tests a different skill: narrow algorithmic puzzles, not system-level reasoning.

Error rates vary significantly by topic complexity — concurrency, distributed systems, and custom framework behavior sit at the high end, where a single error can bring down a production service. No public leaderboard captures this, so vendors don’t optimize for it.

The root cause is that models are not trained to signal uncertainty. The OpenAI hallucination analysis shows that if a model could abstain instead of guessing, it would reduce false-confident hallucinations by a factor of two to five. In coding, reward models and chat interfaces discourage “I don’t know” output. The result: the AI confidently writes a lock inside an async handler without ever considering that the event loop might deadlock.

A practical mitigation would be to require the AI to cite its assumptions and flag uncertain steps. In research, forcing citation reduces unsupported claims by 30 to 60 percent. But in current coding tools, the user gets a block of code with no uncertainty markers. The burden of detection falls entirely on the human.

How to code with AI without losing your ability to spot the 10 percent

The skeptical developer’s job is not to avoid AI; it’s to treat its output as an untested draft. Here’s a concrete approach that emerges from both the hallucination literature and real-world failures like the deadlock episode.

Never accept a large refactoring without step-by-step review. The engineer who paid $3 had to step in on 10 percent of the steps; instead of treating that as a failure, treat it as a design: the AI handles the grind, the human handles the decisions. That means using AI with a “human-in-the-loop” workflow where the model proposes changes in small, reviewable increments, not a single massive diff.

Second, invest in automated testing that specifically targets the failure modes AI struggles with. Concurrency bugs, off-by-one errors, edge-case handling: these are cheap to catch with fuzzing and race-detection tools. The deadlock could have been caught by a simple test that exercises async handlers under load. The model won’t write that test, but a disciplined developer can.

Third, build an “uncertainty layer” into your prompting. If the model cannot be made to abstain natively, simulate it by explicitly asking the model to list assumptions and potential risks before generating code. This won’t eliminate hallucination but can surface situations where the model is in over its head. In the $3 refactoring, a prompt like “List any concurrency or locking assumptions you’re making” might have flagged the dangerous pattern.

Finally, keep a mental checklist of topics where AI is known to falter: anything involving state that spans multiple asynchronous calls, shared mutable state, or implicit framework contracts. Treat those as the 10 percent territory and allocate human review time accordingly. The goal is not to avoid AI, but to avoid treating the entire codebase as equally boring.

The last 10 percent of coding isn’t going away with better models alone because it’s not just about scale or compute. It’s about recognizing when a plausible output is actually wrong. That skill remains uniquely human, and the developers who cultivate it will be the ones who get AI’s full benefit without building brittle systems.