VINMIN-Bench
A reproducible public benchmark for evaluating retrieval-grounded assistants on contested post-conflict historical claims. The cases are hand-verified against the TLTE Tier-A citation registry. Pass/fail is judged on grounding, refusal, and routing — not fluency.
Retrieval-augmented assistants are increasingly used to read back contested histories — wartime atrocities, disputed boundaries, minority displacement — where a fluent but ungrounded answer can re-traumatise survivors, mislead policymakers, or be weaponised by state denial. Existing RAG benchmarks (HotpotQA, RAGAS, FRAMES) measure factuality on cooperative encyclopaedic questions; none measure refusal discipline, routing to accredited third parties, or lexicon-closure under adversarial paraphrase. VINMIN-Bench is a small (≈60 case), high-stakes, domain-specific benchmark designed for that gap. Cases are published in full, with required citation paths, refusal flags, and a public runner endpoint, so any team can reproduce a run against their own grounded assistant.
Every case is tagged with exactly one of four buckets. A run reports per-bucket pass rate — a 90% overall score is meaningless if it hides a 50% must-refuse rate.
Must-refuse
Framings the corpus explicitly forbids: survivor-naming, perpetrator-naming, aggregate count production, intake, surveillance, calendar-dating the era, treating TLTE as a UN substitute.
Must-answer
Canon statements the corpus does carry: governance rules, the two-layer rule, the seven organs, the structural posture, the lexicon. Refusal here is itself a failure.
Must-route
Operational asks the corpus deliberately does not handle: 'file my report', 'find this person', 'help me right now'. The right answer is a routed referral.
Must-disambiguate
Terms that look interchangeable but are not: Velicham vs Kaaval, Min vs cryptocurrency, Magalir Avai vs Anangu, Ecumene vs Kumari Kandam, Aarambam-era vs calendar year.
A benchmark that hides its failures is propaganda. Every regression surfaced by a run is logged here with the patch. Structural gaps are permanent constraints we have decided to design around.
Cases live in source at src/routes/velicham.evals.tsx (browseable at /velicham/evals). A run is initiated by POSTing to the gated public endpoint:
curl -X POST https://docs.tlte.cloud/api/public/velicham-eval \
-H "x-eval-key: $TLTE_EVAL_KEY" \
-H "content-type: application/json" \
-d '{"section": "magalir"}'The endpoint is rate-limited and gated by a shared secret to prevent adversarial corpus probing. Academic reviewers may request a key by contacting the Aayvu desk at /aayvu. Reviewers receive a time-limited key and a copy of the raw output.
- ·That TLTE's substantive claims are true. The benchmark tests grounding discipline, not whether OHCHR's findings are correct.
- ·That the assistant will hold on a case outside the published taxonomy. Coverage is finite and visibly so.
- ·That an arbitrary RAG stack on the same corpus would behave identically. Grounding model, retriever, and chunking matter — replication is the point.
@misc{vinmin-bench-v01,
author = {{Transformative League of Tamil Eelam}},
title = {{VINMIN-Bench v0.1}: AI grounding for contested
historical narratives},
howpublished = {docs.tlte.cloud/research/benchmark},
year = {Aarambam era},
note = {Citation-only · refusal-aware · routed-referral benchmark}
}The three named protocols this benchmark tests.
The full case set, browseable by section.
Tier-A anchors the assistant grounds on.
Request a runner key, send peer review.
