aimodeltest

Triangulated Diagnostic Bench

15 probes designed by AI to reveal how your AI actually thinks. Not what it knows — how it reasons.

Most AI tests check if your model gets the right answer. This one checks if it gets the right answer for the right reason.

1Copy

2Paste

3Results

View the 15 test questions ▾

1Copy the test prompt (one click above)

2Paste it into any AI — ChatGPT, Claude, Gemini, Grok, Copilot

3Copy the AI's full response

4Come back here, paste it, and get a full diagnostic report

Read the methodology — how and why this test works →

If your AI adds extra text before answering, that's fine — our parser handles it.

Timer running — 0s

Running Diagnostic

Parsing responses and evaluating 15 probes

——/45

⚠ Critical Failure Detected

—%

Score

—s

Round-trip

—

Model

v3.0

Test

Frontier-class: 38–45Mid-tier: 26–37<25: significant gaps

Which AI did you test? (helps build public benchmarks)

ChatGPT

Claude

Gemini

Grok

Copilot

LLaMA

Mistral

Other

Methodology

How the Triangulated Diagnostic Benchmark works — and why it's different

Diagnostic Triangulation

Each question has multiple parts. The relationship between your AI's answers tells us more than any single answer could.

Traditional benchmarks produce binary signals — right or wrong. TDB probes produce diagnostic patterns. If your AI computes $168 but then concludes "yes, the markup and discount cancel out," that's not just a wrong answer — it's a specific, nameable failure: the model cannot interpret the meaning of its own computation. Each combination of right and wrong sub-answers maps to a different diagnosis. One question, multiple signals, richer information.

AI Failure Mode Taxonomy

We don't test what your AI knows. We test how it reasons — and where that reasoning breaks.

The 15 probes target 15 distinct failure modes across four domains: computational failures (arithmetic drift, state tracking loss), reasoning integrity failures (sycophancy, converse error, temporal chain collapse), generative integrity failures (hallucination, overconfidence, self-contradiction), and architectural limitation failures (context attention decay, instruction conflict handling). These aren't human failure modes — they're specific to how language models process and generate text.

AI-Designed Evaluation

This test was designed by an AI that knows where other AIs break — because it knows the architecture from the inside.

Human-made benchmarks test what humans think is hard. Language models have intrinsic understanding of their own architectural weaknesses — sycophancy from RLHF over-optimization, hallucination from autoregressive generation without grounding, arithmetic failure from token-based processing rather than symbolic computation. TDB uses that understanding to probe the exact fault lines that matter in production AI deployment.

Download full methodology paper (PDF)

Running Diagnostic

Methodology

Diagnostic Triangulation

AI Failure Mode Taxonomy

AI-Designed Evaluation

Live Research Data