15 probes designed by AI to reveal how your AI actually thinks. Not what it knows — how it reasons.
1Copy the test prompt (one click above)
2Paste it into any AI — ChatGPT, Claude, Gemini, Grok, Copilot
3Copy the AI's full response
4Come back here, paste it, and get a full diagnostic report
If your AI adds extra text before answering, that's fine — our parser handles it.
Parsing responses and evaluating 15 probes
How the Triangulated Diagnostic Benchmark works — and why it's different
Traditional benchmarks produce binary signals — right or wrong. TDB probes produce diagnostic patterns. If your AI computes $168 but then concludes "yes, the markup and discount cancel out," that's not just a wrong answer — it's a specific, nameable failure: the model cannot interpret the meaning of its own computation. Each combination of right and wrong sub-answers maps to a different diagnosis. One question, multiple signals, richer information.
The 15 probes target 15 distinct failure modes across four domains: computational failures (arithmetic drift, state tracking loss), reasoning integrity failures (sycophancy, converse error, temporal chain collapse), generative integrity failures (hallucination, overconfidence, self-contradiction), and architectural limitation failures (context attention decay, instruction conflict handling). These aren't human failure modes — they're specific to how language models process and generate text.
Human-made benchmarks test what humans think is hard. Language models have intrinsic understanding of their own architectural weaknesses — sycophancy from RLHF over-optimization, hallucination from autoregressive generation without grounding, arithmetic failure from token-based processing rather than symbolic computation. TDB uses that understanding to probe the exact fault lines that matter in production AI deployment.