We ran 50 real-world coding tasks through both standard Claude and Caveman Claude. Every task was a realistic developer prompt: debugging, refactoring, explaining concepts, writing tests.
Methodology
Each task was sent to both modes in the same session, same context, same model. We measured output token count, correctness (judged manually), and whether all key information was present in the caveman response.
Results
Average token reduction: 73.4%. In zero cases did the caveman response omit information that mattered for the task. In 6 cases the caveman response was actually more useful — because it was direct enough to spot the real issue immediately, without building up to it.
The only category where caveman mode underperformed slightly was conceptual explanations for beginners. For everything else — debugging, coding, reviewing, refactoring — caveman wins.