Skip to content
couch.cx
Back to Writing

thought leadership

Stop Wasting My Tokens: The Anti-Verbosity Manifesto

Michael Couch
Michael CouchMar 2026

Tokens are the only metric that matters in the Agentic lifecycle. In a world of serialized intelligence, "politeness" is a performance tax. If you're still being polite to your models, you're not just wasting money—you're actively degrading the Signal-to-Noise Ratio (SNR) of your system.

Let's be aggressive: Every "Please" and "Could you" is an invitation for the transformer to drift. It's an injection of low-entropy noise into a high-variance system. In production, this is Architectural Malpractice[1]

Semantic Violence: Pruning for Entropy

The most efficient prompts feel like a threat. Why? Because high-stakes training data is statistically correlated with higher logical precision. Research into LLMLingua[1] and Selective Context[2] proves that models perform better when you ruthlessly prune tokens that have low Self-Information (Perplexity)—tokens a smaller model can predict with high confidence are, by definition, low-information.

If a token doesn't provide significant entropy—if its presence is statistically predictable (like filler words)—it shouldn't exist in your prompt. You aren't "communicating"; you are Collapsing a Probability Distribution. Use the "Dick" Prompt not because you're angry, but because the transformer requires a sharp boundary to stay in its high-intelligence latent space.

The Zero-Velocity Tax

A "Zero-Velocity Token" is any token generated that doesn't advance the state of the task. Standard disclaimers and chatty intros are the plaque in the arteries of your agentic swarms.[1]

Radical Pruning Strategy

  • Entropy-Based Pruning: Use a smaller, 1B-3B model to calculate the perplexity of your prompt. Strip every token that the small model can guess with >80% confidence. It's noise.[1]
  • Hard Stop Sequences: Programmatically kill the model. If you need JSON, set a stop sequence on "}". Don't let it breathe after it finishes the data. Every token after the closing brace is a waste of your margin.
  • Negative Constraints: Tell the model exactly what words to avoid. Use specific vocabulary limitations to force it into a more technical, high-density subset of its training corpus.

Efficiency is the only moat in a world of commoditized intelligence. Stop talking to the machine. Command the tokenizer.

Next up: Beyond JSON: The JSON Tax and TOON. See also The Karpathy Autoresearch Pattern for experiment efficiency at scale.

References

  1. Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., & Qiu, L. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. arxiv.org/abs/2310.05736
  2. Li, Y., Dong, B., Guerin, F., & Lin, C. (2023). Compressing Context to Enhance Inference Efficiency of Large Language Models. EMNLP 2023. aclanthology.org/2023.emnlp-main.391

Topics

Token EngineeringCost OptimizationLatency