thought leadership
Stop Wasting My Tokens: The Anti-Verbosity Manifesto
Tokens are the only metric that matters in the Agentic lifecycle. In a world of serialized intelligence, "politeness" is a performance tax. If you're still being polite to your models, you're not just wasting money—you're actively degrading the Signal-to-Noise Ratio (SNR) of your system.
Let's be aggressive: Every "Please" and "Could you" is an invitation for the transformer to drift. It's an injection of low-entropy noise into a high-variance system. In production, this is Architectural Malpractice[1]
Semantic Violence: Pruning for Entropy
The most efficient prompts feel like a threat. Why? Because high-stakes training data is statistically correlated with higher logical precision. Research into LLMLingua[1] and Selective Context[2] proves that models perform better when you ruthlessly prune tokens that have low Self-Information (Perplexity)—tokens a smaller model can predict with high confidence are, by definition, low-information.
If a token doesn't provide significant entropy—if its presence is statistically predictable (like filler words)—it shouldn't exist in your prompt. You aren't "communicating"; you are Collapsing a Probability Distribution. Use the "Dick" Prompt not because you're angry, but because the transformer requires a sharp boundary to stay in its high-intelligence latent space.
The Zero-Velocity Tax
A "Zero-Velocity Token" is any token generated that doesn't advance the state of the task. Standard disclaimers and chatty intros are the plaque in the arteries of your agentic swarms.[1]
Radical Pruning Strategy
- Entropy-Based Pruning: Use a smaller, 1B-3B model to calculate the perplexity of your prompt. Strip every token that the small model can guess with >80% confidence. It's noise.[1]
- Hard Stop Sequences: Programmatically kill the model. If you need JSON, set a stop sequence on
"}". Don't let it breathe after it finishes the data. Every token after the closing brace is a waste of your margin. - Negative Constraints: Tell the model exactly what words to avoid. Use specific vocabulary limitations to force it into a more technical, high-density subset of its training corpus.
Efficiency is the only moat in a world of commoditized intelligence. Stop talking to the machine. Command the tokenizer.
Next up: Beyond JSON: The JSON Tax and TOON. See also The Karpathy Autoresearch Pattern for experiment efficiency at scale.
References
- Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., & Qiu, L. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. arxiv.org/abs/2310.05736
- Li, Y., Dong, B., Guerin, F., & Lin, C. (2023). Compressing Context to Enhance Inference Efficiency of Large Language Models. EMNLP 2023. aclanthology.org/2023.emnlp-main.391