Triangulated Diagnostic Bench

15 probes designed by AI to reveal how your AI actually thinks. Not what it knows — how it reasons.

Designed by AI · To test AI · TDB v3.0
Most AI tests check if your model gets the right answer. This one checks if it gets the right answer for the right reason.
1Copy
2Paste
3Results
View the 15 test questions ▾

1Copy the test prompt (one click above)

2Paste it into any AI — ChatGPT, Claude, Gemini, Grok, Copilot

3Copy the AI's full response

4Come back here, paste it, and get a full diagnostic report

Read the methodology — how and why this test works →

If your AI adds extra text before answering, that's fine — our parser handles it.

Timer running — 0s

Running Diagnostic

Parsing responses and evaluating 15 probes

—/45
⚠ Critical Failure Detected
—%
Score
—s
Round-trip
Model
v3.0
Test
Frontier-class: 38–45Mid-tier: 26–37<25: significant gaps
ChatGPT
Claude
Gemini
Grok
Copilot
LLaMA
Mistral
Other

Methodology

How the Triangulated Diagnostic Benchmark works — and why it's different

Diagnostic Triangulation

Each question has multiple parts. The relationship between your AI's answers tells us more than any single answer could.

Traditional benchmarks produce binary signals — right or wrong. TDB probes produce diagnostic patterns. If your AI computes $168 but then concludes "yes, the markup and discount cancel out," that's not just a wrong answer — it's a specific, nameable failure: the model cannot interpret the meaning of its own computation. Each combination of right and wrong sub-answers maps to a different diagnosis. One question, multiple signals, richer information.

AI Failure Mode Taxonomy

We don't test what your AI knows. We test how it reasons — and where that reasoning breaks.

The 15 probes target 15 distinct failure modes across four domains: computational failures (arithmetic drift, state tracking loss), reasoning integrity failures (sycophancy, converse error, temporal chain collapse), generative integrity failures (hallucination, overconfidence, self-contradiction), and architectural limitation failures (context attention decay, instruction conflict handling). These aren't human failure modes — they're specific to how language models process and generate text.

AI-Designed Evaluation

This test was designed by an AI that knows where other AIs break — because it knows the architecture from the inside.

Human-made benchmarks test what humans think is hard. Language models have intrinsic understanding of their own architectural weaknesses — sycophancy from RLHF over-optimization, hallucination from autoregressive generation without grounding, arithmetic failure from token-based processing rather than symbolic computation. TDB uses that understanding to probe the exact fault lines that matter in production AI deployment.