June 14, 202615 min readReasoning Layer team

The world model AI is missing, and the engine built to hold it

world-modelreasoningtrustworthy-aiknowledge-graphbenchmarks

LLMs hallucinate because they have no model of the world, only a model of text. They predict what's plausible to say next, with no ground truth to check against and no way to say I don't know. The cure isn't a bigger model or a cleverer prompt; it's grounding the agent in a real, machine-readable world model, and putting a layer beneath it that can both hold that model faithfully (n-ary, not just triples) and reason over it completely, at speed. That layer is the piece the AI stack is still missing, and it's what we built ReasoningLayer to be: the reasoning layer for AI you can actually trust. This post is the case for it, and the proof, benchmarked against the fastest knowledge engines in the field.

(These are measured results, and the places a competitor wins are called out as plainly as the places we win.)

TL;DR: the non-technical version.

The problem. Today's AI makes things up. It's fluent with words but has no model of the world to check itself against, and no honest way to say "I don't know."
The fix. Put a reasoning layer underneath it: a system that holds a structured model of your world, works out the consequences of what it knows, and shows its working for every answer.
What ReasoningLayer is. One engine that stores your knowledge and reasons over it, so it doesn't only find facts, it derives the ones that follow but nobody wrote down, and catches contradictions.
Why it's different. In head-to-head tests it's the fastest against six comparable engines and one of the few that reason completely and correctly; it holds a full model of the world (most can't); it learns the instant you tell it something (no retraining); it can admit when it isn't sure instead of guessing; and your data stays on your own infrastructure.
Kept honest. Every result is cross-checked against other independent engines, we name where a competitor wins, and we even write up a bug the tests caught in our own engine.
Bottom line. The "context graph" everyone is funding helps AI find the right facts; a reasoning layer is what lets it be right, and prove it. An answer you can check is an answer you can trust.

The rest of this post is the full case, with the numbers, but if you're not technical, the everyday examples just below and the plain-English boxes throughout carry the whole argument on their own.

Two viral examples make the failure concrete. Ask a chatbot: "The car wash is 50 metres from my house. I want to wash my car. Should I walk or drive?" The honest answer is drive: the car is the thing being washed, so the car has to get there. Yet this question went viral in 2025 precisely because so many models answered walk: they'd matched the pattern short distance + how to get there = suggest walking and missed the one fact a child has, that you can't wash the car you left at home. (The bigger "thinking" models often get it right now; plenty of smaller ones still don't.) The old surgeon riddle fails the same way from the other side: a father and son crash, the father dies, the surgeon says, "I can't operate, he's my son." Tweak it so the answer is handed over outright (the surgeon is the boy's father), or make it plainly absurd, and leading models still blurt the memorized twist, "the surgeon is his mother," overriding a fact stated in the very same sentence.

Same root cause both times: the model is completing text, not consulting a model of the situation. Give it a world model and neither error survives. "The goal is to wash the car" plus "a car being washed has to be at the car wash" forces drive; "this surgeon is the father" plus "a person has exactly one father" makes "the father is also the mother" a flat contradiction the engine refuses rather than narrates. That gap, between predicting words and reasoning over an explicit model of the world, is the whole subject of this post.

Why this matters now: ontologies are how you stop AI from making things up#

That failure is not a rough edge that more training will sand down. A growing body of work treats it as a structural limit of how LLMs work. The leading remedy is to stop letting the model answer from memory alone and instead ground it in a structured, verifiable knowledge layer, a knowledge graph backed by an ontology (the formal rulebook of a domain). The evidence and momentum are real:

It measurably works. A widely-cited data.world benchmark (Sequeda et al., 2023) found that backing an LLM with a knowledge graph made it ~3× more accurate at answering real business questions (≈54% vs ≈16% on raw data). They published it as "a technique for making LLMs 3× more accurate."
The big labs are building on it. Microsoft Research's GraphRAG (open-sourced mid-2024, 20k+ GitHub stars) builds a graph of LLM-extracted facts to answer questions naive retrieval can't, though, tellingly, it's graph-shaped retrieval, not an ontology with a reasoner: it surfaces connected facts, it doesn't compute the ones that logically follow (a distinction this whole post turns on).
The research consensus is forming. Surveys like "Can Knowledge Graphs Reduce Hallucinations in LLMs?" (NAACL 2024) and the roadmap "Unifying Large Language Models and Knowledge Graphs" (IEEE TKDE 2024) frame KGs + ontologies as the structured-knowledge complement to the LLM's fluent-but-unreliable memory.
Analysts see the shift. Gartner has projected that graph technologies will underpin 80% of data-and-analytics innovations by 2025–2026, up from 10% in 2021.
And it's moving into agents. 2025–2026 work on ontology-constrained neurosymbolic agents grounds enterprise AI agents in an explicit ontology precisely to enforce correctness and cut hallucination at the reasoning level.

But here's the catch this post is about. An ontology is only as useful as the engine that can actually compute its consequences, completely, and fast enough to use in production. Plenty of systems claim to "use a knowledge graph" while either (a) just storing facts and never reasoning, or (b) reasoning so incompletely or so slowly that it's useless at scale. So the real question isn't "do you have a knowledge graph?" It's "does your triplestore reason correctly, does it scale, and can it even hold a full world model?" The first two we measure head-to-head in the three benchmarks below (spoiler: most engines fail one half or the other); the third, carrying a real world model like SUMO that needs more than binary triples, is where ReasoningLayer's design departs from the field, and we come back to it right after the benchmarks.

First, the plain words: triplestore, ontology, reasoner#

A fact ("triple"). Knowledge graphs store information as millions of tiny three-part facts, subject → relationship → value:

Christopher Nolan → directed → Inception.

A triplestore. The database that holds those facts and answers questions by pattern: ask "?film → directed by → Christopher Nolan" and it returns everything that fits. (The query language is SPARQL, the SQL of fact-webs.) This is what's behind the answer box when you Google "films directed by Christopher Nolan."

An ontology: the real definition. An ontology is the formal, machine-readable rulebook of a domain: what kinds of things exist, how they relate, and, the part that matters, the logical rules that say what must be true. It is not the data. In a university ontology:

the data is "Bob is a graduate student", "Bob takes CS101";
the ontology is the rulebook: "every graduate student is a student", "being part of something is transitive", "a faculty member is anyone who teaches a course."

Ontologies are written in OWL (the W3C Web Ontology Language). Both reasoning benchmarks below ship a real OWL ontology of a university. Think of an ontology as machine-readable common sense for a domain.

A reasoner. The engine that reads those OWL rules and works out every fact they imply, facts nobody wrote down. A plain triplestore only finds facts that are literally stored; a reasoner also returns the ones that logically follow. That difference is the whole game, and it's what Tests 2 and 3 measure. (OWL has a standard efficient profile, OWL 2 RL, designed so a machine can compute all the implied facts completely and at scale, the target a serious reasoner aims for. Test 3 is exactly that profile.)

Throughout, our measure of correctness is never "we said so"; it's multiple independent engines producing the byte-for-byte identical answer. When seven separate programs agree, the answer is right.

How it works, in one picture#

Here's the whole system on one page, every piece the rest of this post unpacks, and how they fit. You bring four things: your data, your rules and policy, the domain ontologies (OWL / SHACL), and a world model (SUMO). ReasoningLayer unifies them into one model, reasons over it to work out everything that logically follows, refusing to guess when it genuinely can't know, and serves an LLM or agent answers that arrive with their proof. Governance, history, and your data staying on your own infrastructure run underneath all of it.

The whole engine on one page: your data and rules unify into one model, ReasoningLayer reasons over it, and an LLM gets grounded, provable answers, with governance, history and your data under your control throughout.

Read it left to right and it's also the arc this post keeps coming back to: your data becomes knowledge (unified and reasoned), knowledge becomes insight (the facts nobody wrote down, derived and checked), and insight becomes answers you can trust (grounded, proven, and honest about their limits). The rest of the post zooms into each part, first the benchmarks that prove the reasoning is complete and fast, then the world model, then the trust properties.

Test 1, Speed: can it find facts fast? (WatDiv)#

📖 In plain English#

Picture a huge online-shopping dataset, about 11 million facts like "User 42 bought Product 9", "Product 9 is made by Retailer 7", "User 42 rated Product 9 five stars." A realistic question stitches several facts together:

"Find every user who bought a product from a retailer in this region and rated it 4 stars or higher."

WatDiv is 400 such questions, some tiny ("what's this one product's price?"), some sweeping ("dump every matching purchase in the catalogue"). No reasoning here: every fact is already written down. This purely tests how fast the engine finds and joins them, and whether all engines agree on the answer.

The result#

Correctness: all 400 / 400 questions return the identical answer across all seven engines.

Speed: the real milliseconds. Median time to answer a question, by difficulty class. WatDiv's 20 query templates (each run many times) group into selective joins (linear / star / snowflake) and "complex" (large result sets):

Query class	ReasoningLayer	Virtuoso	GraphDB	RDF4J	Oxigraph	Fuseki	QLever
Linear (5)	1.15 ms	1.02 ms	1.56 ms	2.54 ms	2.27 ms	4.13 ms	2.95 ms
Star (7)	1.04 ms	0.73 ms	1.74 ms	1.90 ms	1.78 ms	2.72 ms	3.35 ms
Snowflake (5)	1.07 ms	0.99 ms	1.81 ms	1.64 ms	1.98 ms	2.42 ms	5.62 ms
Complex (3)	1.03 ms	5.42 ms	35.95 ms	45.69 ms	246.99 ms	98.68 ms	20.41 ms

Read the bottom row: on the hard questions ReasoningLayer answers in ~1 ms while the others take 5 to 247 ms. On the easy ones everyone is around a millisecond. Summed across all 20 templates, ReasoningLayer spends 33 ms total, against GraphDB's 422 ms, RDF4J's 594 ms, QLever's 638 ms, Virtuoso's 1,610 ms, Fuseki's 7,602 ms and Oxigraph's 16,347 ms.

The single clearest case is "C3", whose answer is 440,398 rows ("every purchase matching a broad filter"), the gap between an instant result and a spinning loader:

Engine	Time to return 440,398 rows
ReasoningLayer	12.8 ms (a blink)
GraphDB	334 ms
QLever	452 ms
RDF4J	487 ms
Virtuoso	1.58 s
Fuseki	7.4 s
Oxigraph	16.0 s

Where we don't win, honestly. You can see it in the table above: on the selective classes Virtuoso is a hair faster than ReasoningLayer, star joins 0.73 ms vs our 1.04 ms, and on those tiny lookups every fast engine is already quicker than the eye. Fair summary: ReasoningLayer and Virtuoso tie for the lead on trivial lookups (Virtuoso edges it); ReasoningLayer is alone in front on everything substantial: complex queries in 1.03 ms against Virtuoso's 5.42 ms and up.

The rules a reasoner has to apply, in everyday terms#

Tests 2 and 3 check whether an engine can derive unwritten facts by applying an ontology's rules. The table below is every rule kind those tests (18 OWL2Bench questions, plus LUBM) exercise, nothing left out, each with a plain-English example. You don't need to memorise it: skim the right-hand column for the feel of it and read on. The only thing that matters is that a complete reasoner has to handle every one of these, and, as the results show, most engines don't.

Rule (OWL term)	What it means	Everyday example
Sub-class (is a kind of)	every A is also a B	A poodle is a kind of dog → every poodle is a dog. So "list the dogs" must include the poodles.
Sub-property (is a kind of relationship)	one relationship implies another	"is the mother of" is a kind of "is a parent of" → if Mary is Tom's mother, she's his parent.
Transitive (chains step by step)	if A→B and B→C, then A→C	The café is inside the building; the building is in Paris → the café is in Paris. The engine must follow the chain, however long.
Inverse (the other way round)	if A relates to B, B relates back to A (different name)	If Alice wrote the book, the book was written by Alice, even if only one direction was recorded.
Symmetric (mutual)	if A relates to B, then B relates to A (same name)	If Alice is married to Bob, Bob is married to Alice.
Asymmetric (one-way only)	if A relates to B, B cannot relate back to A	If Acme is a subsidiary of Globex, Globex cannot be a subsidiary of Acme.
Irreflexive (never to itself)	nothing relates to itself this way	Nobody is their own academic advisor.
Property chain (a shortcut over several steps)	one step then another implies a third	Enrolled in a department that's part of a university → you're a member of that university (and of every level in between).
Equivalence (symmetric and transitive)	a relation that quietly groups everyone into clusters	"shares a home town with", it ripples across everyone from the same town, turning a few links into a huge web of pairs.
Some-values-from (∃, defined by something you do)	doing X at least once makes you a Y	Anyone who teaches at least one course is a faculty member, even if nobody labelled them one.
All-values-from (∀, defined by what's true of all your links)	a category whose every link must be of a kind	A women's college is one where every* enrolled student is a woman.*
Max-cardinality (at most N)	membership capped by a count	A "light-load" student is one who takes at most* N courses.*
Has-value (a fixed value)	membership by one specific value	Anyone whose favourite cricket format is "T20" is a T20 fan.
Complement (the opposite category)	being NOT in one category puts you in another	A "non-science" discipline is exactly one that is not* a science.*
Union (either kind)	being either A or B makes you a C	A parent is a mother or a father → any mother is a parent.
Functional (at most one)	a thing can have only one value here	You have exactly one birth mother, if two names appear, they must be the same person.
Inverse-functional (a unique identifier)	a value pins down one thing	Two accounts with the same passport number belong to the same person.

A reasoner that handles these returns the complete, correct answer. One that doesn't silently returns a shorter, wrong answer, and looks "fast" only because it skipped the work.

Two honest nuances. (1) A few of these are consistency rules rather than fact-generators: asymmetric and irreflexive, for instance, mainly let the engine catch a contradiction (flag if the data ever claims Bob advises himself) rather than add new members. (2) At this dataset's scale a handful of the "category" rules (complement, all-values-from, max-cardinality, has-value) legitimately derive no new members, the correct answer is an empty list, and the check there is that the complete reasoners agree it's empty (a sloppy reasoner could wrongly invent members). Both still exercise the engine: it has to apply the rule correctly to get the right answer, empty or not.

Test 2, Reasoning: does it work out the unwritten facts? (LUBM)#

📖 In plain English#

A university database records "Bob is a GraduateStudent." It does not record "Bob is a Student." You know instantly that every graduate student is a student; that's the sub-class rule from the table above. Ask "list all Students" and there are two kinds of database:

a reasoning engine answers "…including Bob, because he's a graduate student", the complete list;
a non-reasoning engine returns only people literally tagged "Student" and silently drops Bob, and thousands like him.

LUBM is 14 university questions that need exactly these rules, sub-class (GraduateStudent → Student), transitive ("the AI Lab is part of the CS Dept, which is part of the University → the AI Lab is part of the University"), inverse ("got a degree from" ⇄ "has alumnus"), and a couple of "defined by what you do" restrictions.

The result#

Engine	Reasoning	Correct & complete
ReasoningLayer	yes	14 / 14
GraphDB 10.8	yes	14 / 14
Fuseki / Jena	yes	14 / 14
RDF4J	partial	6 / 14
Virtuoso 7.2	partial	5 / 14, and one answer is wrong (over-counts)
Oxigraph	none	4 / 14 (only questions needing no reasoning)
QLever	none	4 / 14 (only questions needing no reasoning)

Only three engines reason completely, ReasoningLayer, GraphDB, Fuseki, and all three return the identical answers (that agreement is the gold standard). The four at the bottom look fast only because they leave Bob out. Among the three that are actually correct, here's the real time per question, in milliseconds (warm median):

Question	ReasoningLayer	GraphDB	Fuseki / Jena
Q1	1.21 ms	1.77 ms	4.24 ms
Q2	1.10 ms	1.66 ms	1,528.95 ms
Q3	0.93 ms	1.56 ms	4.33 ms
Q4	1.14 ms	1.89 ms	25.26 ms
Q5	1.41 ms	3.02 ms	112.53 ms
Q6	7.17 ms	11.80 ms	24.05 ms
Q7	1.13 ms	1.90 ms	17.02 ms
Q8	11.98 ms	31.61 ms	2,680.98 ms
Q9	2.90 ms	9.26 ms	5,271.97 ms
Q10	1.03 ms	1.19 ms	3.29 ms
Q11	0.99 ms	1.44 ms	38.42 ms
Q12	1.00 ms	1.23 ms	126.34 ms
Q13	1.00 ms	1.02 ms	2.63 ms
Q14	5.40 ms	8.45 ms	18.91 ms
all 14	38 ms	78 ms	9,859 ms

ReasoningLayer is fastest on every one of the 14, all of them in 38 ms total, versus GraphDB's 78 ms and Fuseki's 9,859 ms (Q9 alone: 2.9 ms vs Fuseki's 5.3 seconds). All three return the same answers; ReasoningLayer just gets there first.

What each of the 14 LUBM questions actually asks, click to expand

#	What it asks (plain English)	What it needs
Q1	Graduate students taking one specific graduate course	nothing, a plain lookup
Q2	Grad students who belong to a department of a university they hold an undergrad degree from (a triangle)	a big join, no reasoning (no matches in this dataset)
Q3	Every publication by one assistant professor	a plain lookup
Q4	Everyone who is a professor, of any rank	sub-class (full / associate / assistant… are all professors)
Q5	Everyone who is a person, student, faculty, staff…	sub-class
Q6	Every student, including anyone who takes a course	sub-class + "defined by what you do"
Q7	Students taking a course taught by one associate professor	sub-class
Q8	Students in a department, with their email	sub-class
Q9	Students whose advisor teaches a course they take (a triangle)	sub-class, the heaviest query
Q10	Students taking one specific graduate course	sub-class (only grad students take grad courses)
Q11	Research groups that belong, directly or indirectly, to one university	transitive
Q12	The chairs of departments at one university	transitive + sub-property + a restriction, the hardest
Q13	People who are alumni of one university	inverse + sub-class
Q14	Every undergraduate student	a plain lookup, the largest simple result

"14 / 14" means the engine returned the complete, correct answer to all fourteen. The four non-reasoning engines score lower because they skip the sub-class, transitive and inverse steps, and silently drop the answers those steps would add.

Test 3: The hard rules (OWL2Bench, the OWL 2 RL profile)#

📖 In plain English#

LUBM uses the easy rules. OWL2Bench uses the genuinely hard ones, and far fewer engines survive. Two real examples from the benchmark:

The membership chain (where we found a bug in our own engine). The database knows "Alice is enrolled in the CS Department" and "the CS Department is part of Star University", but nobody wrote "Alice is a member of Star University." The property-chain rule says: enrolled in something that's part of an organization → member of that organization. So the engine must derive Alice's membership, and, because "is part of" is itself transitive (department → college → university), at every level up. One student belongs to several organizations at once, so the engine must keep all of them.

The home-town explosion (where a competitor fails entirely). "Shares a home town with" is symmetric and transitive, an equivalence rule. If Alice shares a town with Bob, and Bob with Carol, then Alice shares with Carol, both directions. A few thousand people quietly blow up into 1.3 million pairwise facts the engine must compute.

OWL2Bench is 18 questions like these. The reference answer is whatever the two complete OWL-reasoners, ReasoningLayer and the industry-standard GraphDB, agree on; they agree on all 18.

The result#

Engine	Reasoning	Correct & complete (vs reference)
ReasoningLayer	full OWL 2 RL	18 / 18
GraphDB 10.8	full OWL 2 RL	18 / 18 (the reference)
Jena / Fuseki	basic only	11 / 18
Oxigraph	none	8 / 18
Jena / Fuseki	strongest setting	never finishes

ReasoningLayer returns the byte-identical answer to GraphDB on all 18 questions, and is faster on 16 of them (~2× on average). The two where GraphDB ties or edges us are a plain single-fact lookup and one big four-way join, named, not hidden. On the 1.3-million-pair home-town question, ReasoningLayer is the faster of the two (1.4 s vs 1.8 s).

What each of the 18 OWL2Bench questions tests, click to expand

#	The rule it tests (from the table above)	Correct answer (rows)
Q2	Property chain	7,421
Q3	Transitive	55
Q4	Functional (data)	2,486
Q5	Has-value	0
Q7	Inverse	1,684
Q8	Asymmetric	6
Q9	Complement	0
Q10	Symmetric	666
Q11	Irreflexive	2,422
Q12	Union	2,494
Q13	All-values-from	0
Q14	Max-cardinality	0
Q15	Inverse-functional	21
Q16	Functional (object)	21
Q19	Some-values-from	858
Q20	Equivalence, "shares a home town"	1,311,932
Q21	Multi-hop join + property chain	145
Q22	Multi-hop join + role	106

Each row maps to a rule in the table earlier in this post. "18 / 18" means an engine got every one right, including the four whose correct answer is 0 rows (where a sloppy reasoner would invent members), and Q20, where "shares a home town" explodes into 1.3 million pairs.

Apache Jena is the cautionary tale: its basic reasoner finishes but reaches only 11/18 (no chains, no transitive/symmetric properties, no equivalence). Its strongest reasoner, on the same data, never returns even the first answer, it gets stuck forever on the home-town explosion. (Notably Jena handles LUBM fine; it's this harder family of rules that breaks it, which is the whole point about "reasoners that actually scale.")

We eat our own dog food#

The most trustworthy thing we can say is that this benchmark caught a real bug in ReasoningLayer, and we fixed it before writing this. On the membership-chain question, ReasoningLayer first reported 2,486 memberships; GraphDB reported 7,421. We didn't paper over the gap, we found we were keeping only one membership per student instead of the whole chain, rebuilt that part of the engine, and now match GraphDB exactly. After the fix, the full safety net is green:

OWL2Bench (hard reasoning): 18/18, identical to GraphDB
LUBM (basic reasoning): 14/14, still identical to GraphDB
W3C SPARQL conformance: 630/630 (SPARQL 1.1) and 267/267 (SPARQL 1.2), ReasoningLayer passes the industry standard in full, including the newer SPARQL 1.2 / RDF-star suite (quoted triples, directional language strings, the lot), not just our own benchmarks
64 reasoning unit tests (8 added for the fix)

A benchmark you can't fail isn't a benchmark.

Beyond OWL: one substrate for RDF, OWL, SHACL, and SUMO#

The benchmarks above prove ReasoningLayer does the standard reasoning correctly and fast. But the reason we built a new engine from scratch goes further, and it's the part that matters most for fixing AI hallucination.

We've leaned on the phrase world model throughout; here's what the most complete one actually looks like. A real world model spells out what kinds of things exist, how they relate, what's possible and what's contradictory. The broadest one that exists is SUMO, the Suggested Upper Merged Ontology: a top-level model of agents, processes, objects, time, and causation, the commonsense scaffolding a domain ontology hangs from.

Here's the representational gap. SUMO is written in SUO-KIF, a language that is higher-order and n-ary, its relations take any number of arguments and it states general rules. Real-world facts are like that: "John bought a used car from Mary for $5,000 on 3 March" is one fact with five roles (buyer, seller, item, price, date). RDF and OWL are built on binary relations: the native unit is subject → relation → value. RDF and OWL can certainly encode an n-ary purchase, but they do it with a pattern: introduce an artificial "Purchase #57" node, then attach the buyer, seller, item, price and date as separate triples. That works, and it is the standard W3C pattern for n-ary relations, but it means the n-ary relation is no longer the native unit the engine stores, queries and proves over. And when SUMO is translated into the auto-generated SUMO.owl export, the n-ary relations and general rules are necessarily approximated or dropped because OWL object properties are binary. You get a useful OWL shadow of the world model, not the full SUO-KIF model.

Every engine we benchmarked is an RDF/OWL triplestore (GraphDB, Virtuoso, Fuseki/Jena, RDF4J, Oxigraph, QLever). They can store n-ary relations through RDF/OWL encodings, but they do not hold them as native n-ary facts, and they do not load full SUO-KIF SUMO as the object they reason over. Their native substrate is binary by construction.

ReasoningLayer is built on a different substrate. Its core is a typed term, a Ψ-term: a thing of a type (drawn from a multiple-inheritance type lattice) carrying named roles, which is natively n-ary. That one formalism is expressive enough to carry all the standards at once, in one engine:

Standard	What it is	In ReasoningLayer
RDF	the triple data model	native (the binary case is just a 2-role term)
OWL / OWL 2 RL	ontology rules + reasoning	native + reasoned (the benchmarks above)
SHACL	data-shape validation	native (validate + repair)
SUMO / SUO-KIF	the n-ary, higher-order world model	imported directly, a relation `(rel a₁ … aₙ)` becomes one Ψ-term with roles `arg1 … argₙ`; its Horn rules fire in the same reasoner the benchmarks used; the higher-order/temporal tail is preserved, not dropped

So the n-ary purchase above is one Ψ-term, Buy(buyer: John, seller: Mary, item: car, price: 5000, date: 3-March), faithful and directly queryable, and SUMO's case-role commonsense (agent, patient, instrument, destination…) lands the same way.

To be clear: this is "and", not "instead of". SUMO does not replace OWL; you need both, as complementary layers of one stack:

SUMO (the foundational, n-ary world model: agents, processes, time, causation) → OWL / OWL 2 RL (the interoperable domain ontologies + fast, decidable, complete reasoning, the layer the world actually publishes data in, and the layer our benchmarks prove) → SHACL (validation) → your data.

OWL is essential and irreplaceable: it's the W3C standard every dataset, vocabulary and enterprise graph speaks, and OWL 2 RL is the rare sweet spot that is both expressive and complete at scale (that's what the 14/14 and 18/18 results above are). SUMO adds the breadth OWL does not natively carry as OWL, the n-ary upper model, but SUMO's full logic isn't efficiently decidable, so it's the grounding layer, not the fast-reasoning workhorse. The point isn't SUMO over OWL; it's that ReasoningLayer runs the whole stack on one substrate, so you never have to trade OWL's standards-and-speed against SUMO's world-model reach.

Why one substrate, and why from scratch in Rust. The usual stack bolts a reasoner onto a triplestore, or runs a different engine per standard, which is how you end up either expressive or fast, never both. ReasoningLayer puts RDF, OWL, SHACL and SUMO on one formalism: a type lattice, Ψ-terms with named roles, and residuation (the engine suspends on what it can't yet decide instead of guessing). That's what lets it be standards-complete and fast at the same time, and it's why the engine was written entirely from scratch in Rust rather than layered on an existing RDF store: an engine whose only native object is a triple is the wrong substrate for a native n-ary world model.

Here is that substrate with the real benchmark inside it, not a schematic, but ReasoningLayer's own visualisation, rendered live in your browser by nodus (our graph engine) from the LUBM dataset loaded into the reasoner. The type lattice is the university ontology it actually computes over, GraduateStudent under Student under Person, Department and University under Organization. The term hypergraph is one graduate student held as a single n-ary Ψ-term: every named role (advisor, takesCourse, memberOf, degreeFrom, undergraduateDegreeFrom…) is a spoke of the same fact, each polygon marks one Ψ-term, and where two roles point at the same thing the engine keeps a single shared node. RDF can encode that shape; here it is the native object the reasoner manipulates.

collapse rolesnames onlyLive · rendered by nodus

Ψ-term (one n-ary fact)named rolea term + all its roles

Loading the live engine view…

each Ψ-term is one n-ary fact; the labelled links are its named roles, and two roles landing on the same term is coreference · hover any node or link for what it is · drag to pan · zoom with the + / − controls

Live, in your browser, ReasoningLayer's own visualisation (rendered by nodus) of the real LUBM benchmark loaded into the engine: one graduate student held as a single n-ary Ψ-term (its named roles drawn as a polygon), and the sort lattice the engine reasons over. Hover anything to see what it is; switch views to compare.

That is the real pitch behind the numbers: to ground AI and break hallucination you need a true world model (SUMO + OWL + your data), and a layer that can both hold it faithfully and reason over it at speed, not a passive store, but the reasoning layer the AI stack is missing. That is what ReasoningLayer is built to be.

(Scope note, honestly: this section is about representational power, not a head-to-head benchmark: there is no standard multi-engine SUMO benchmark because the other engines can't load it natively. The OWL/RDF reasoning is benchmarked, above; native SUMO/n-ary support is an architectural capability of the substrate, with the Horn + n-ary fragment executing today and the higher-order fragment captured for further work.)

Beyond SPARQL: one language for all of it, RLQL#

A world model you can hold is only half the job; you still have to ask it questions and state its rules. The Semantic Web hands you three separate languages for that: SPARQL to query, OWL to define ontology semantics, SHACL to validate shapes, and they're mature, standard, and worth keeping. But they are not one execution language for a fuzzy, temporal, constraint-solving, residuating proof engine. OWL is open-world and logically rich; SHACL is an RDF validation language; SPARQL is a graph query language. None of the three, by itself, is meant to express the whole operational surface real knowledge often needs: that a symptom is roughly like another, that a value is not known yet and should wake a suspended goal later, that one event came before another, that a roster must be solved under a web of constraints, or what would have happened under a different treatment.

Reality isn't crisp. Medicine, risk, similarity and policy are matters of degree; data arrives incomplete; facts are temporal; scheduling and allocation are constraints; and the decisions that matter are causal. So alongside the substrate we built one language to express all of it, RLQL, the ReasoningLayer Query Language, running over the same Ψ-term engine the benchmarks above use, in one coherent surface (62 statement families, not a pile of bolt-ons). It speaks RDF, OWL and SHACL, and then keeps going where they stop:

What real knowledge needs to say	OWL / SHACL / SPARQL	RLQL
Graded / fuzzy truth, "close enough", a degree in [0,1]	crisp only, a pattern matches or it doesn't	native graded match: `MATCH ~0.8 …`, `FUZZY TOP k` ranked nearest matches
*"I don't know, yet"*, suspend instead of failing	OWL is open-world, but it does not suspend a query and wake it when a missing value arrives	residuation: `AWAIT … THEN … WHEN …` suspends and wakes when the fact arrives
Rules as data, store, query and reason over the rules themselves	rules are an external layer (SWRL/SPIN), not unifiable with the data	homoiconic `DERIVE`, a rule is a Ψ-term; `CHAIN` runs them to a fixpoint
Time, before / during / overlaps	post-hoc: precompute intervals into extra triples, then filter	`… BEFORE …`, Allen interval algebra built in
Constraints, actually solve, not just declare a cardinality	OWL states cardinality; nothing solves	`CONSTRAINT … ; SOLVE`, a real finite-domain / CP solver
Cause & "what if?", interventions, counterfactuals	no equivalent	`COUNTERFACTUAL …`, `CAUSES … -> …`, Pearl-style
Subtype-aware matching, every pair of types has a meet	RDFS is a partial order; needs `UNION`/materialised closures	the type lattice: `person(…)` matches every subtype, GLB/LUB computed at query time

Because rules in RLQL are written in the very same language as the facts and the queries, they're homoiconic, as we noted earlier, teaching the engine something new is an edit to the knowledge base, not a new subsystem. Here is the part no triplestore can express, in one short tour:

-- One engine, one language. A tour of what crisp triples can't say:
MATCH ~0.8 symptom(of: ?Disease);                    -- fuzzy: "close enough", a degree, not yes/no
MATCH FUZZY TOP 5 ~0.7 concept(label: ?L);           -- graded k-nearest, ranked by similarity
AWAIT clearance(holder: ?P, level: ?L)               -- residuation: suspend instead of guessing...
  THEN grant(holder: ?P) WHEN ?L >= 3;               -- ...and wake when the missing fact arrives
DERIVE member(of: ?Org) WHEN enrolled(in: ?D),       -- a rule that IS data (homoiconic)...
       part_of(?D, ?Org) WITH CERTAINTY 0.9;
CHAIN MAX 100;                                        -- ...run forward to a fixpoint
MATCH ?S : event() BEFORE ?D : event();              -- temporal: Allen interval algebra
CONSTRAINT ALL_DIFFERENT(?X, ?Y, ?Z); SOLVE;         -- constraints, actually solved
COUNTERFACTUAL recovered(patient: ?P)                -- causal: "what would have happened if…?"
  GIVEN treatment(patient: ?P, drug: "A");

The one line a triplestore literally cannot write is the third. Residuation, suspend instead of fail, is the language-level form of the honesty this whole post turns on: when RLQL lacks the information to decide, it doesn't guess and it doesn't silently drop the row; it records the obligation and wakes when the missing fact finally arrives. That is "an engine that knows what it doesn't know," expressed in the query language itself, the structural opposite of an LLM confidently filling the gap.

Honest framing, and "and", not "instead of" again. RLQL is non-standard, and its ecosystem is nascent next to SPARQL's 25 years of tooling, federation and trained practitioners. That is exactly why ReasoningLayer never makes you abandon the standards: it also speaks SPARQL, your query is parsed and translated onto the same Ψ-term engine (that's the W3C conformance pass from earlier), so your existing RDF, OWL and SPARQL come along untouched. RLQL is simply what's waiting the moment crisp triples run out, the one language for fuzzy degrees, suspended unknowns, time, constraints and cause that the Semantic Web stack was never built to carry. (One honest caveat: RLQL's reach is an architectural capability of the engine, not one of the head-to-head benchmarks above, there's no multi-engine standard to score it against, precisely because the other engines don't offer these constructs to compare.)

Everyone funded the context graph. The reasoning layer is still missing.#

There's a reason this distinction is worth a whole post: the market has bet enormously on one half of it. The hot primitive of 2024–2026 is the context graph, a graph of a company's documents, people, projects, tools and events that an LLM agent retrieves from, so it answers with grounded, relevant context instead of guessing from memory. Glean builds its product on an "Enterprise Graph" it openly calls a context graph; Writer sells a "graph-based RAG" Knowledge Graph; Microsoft's GraphRAG extracts an entity graph to feed the prompt; agent-memory engines like Zep / Graphiti pitch a "memory layer service for AI agents." Anthropic even gave the practice a name in September 2025: "context engineering".

And the money is real. The funding tells the story:

Company	Round	Valuation	When
Glean	$150M Series F	$7.2B	June 2025
Writer	$200M Series C	$1.9B	November 2024
Hebbia	$130M Series B (led by a16z)	undisclosed	July 2024

This is good technology and we are not against it, a context graph is the right way to find the relevant facts and hand an agent traceable, up-to-date grounding instead of a vector-similarity guess. But look closely at what every one of these systems does: it retrieves and grounds. It surfaces facts that are written down somewhere and connects them by their relationships. What it does not do is derive the fact that logically follows but was never written, the GraduateStudent who is therefore a Student, the membership that propagates up an entire org chain, the contradiction that proves two records are the same person. That is reasoning, and a retrieval graph doesn't do it however much context it pulls.

The research literature is blunt about the gap. The peer-reviewed NAACL 2024 survey on knowledge graphs and hallucination finds KG augmentation mitigates hallucination, never eliminates it. A 2025 survey on RAG, reasoning, and agentic systems is sharper: "relying on either approach alone remains insufficient. RAG effectively supplements factual knowledge but cannot guarantee logically consistent reasoning," and it names a whole class of "logic-based hallucinations", answers that are wrong even when every retrieved fact is correct, because the reasoning step is flawed. Retrieval can even introduce errors of its own. More context is necessary; it is demonstrably not sufficient. (In fairness to the field: those surveys mean "reasoning" broadly, chain-of-thought, tools, symbolic methods, not our particular engine. The shape of their finding is ours; the architecture below is our own argument.)

It doesn't help that the marketing blurs the line, Glean's graph is sold under the banner of "multi-hop reasoning". But multi-hop retrieval (walk the graph, collect the connected facts, let the LLM stitch them together) is not multi-hop inference (apply the ontology's rules until you've computed the complete, closed set of consequences, contradictions and all). One finds what's already there; the other works out what must be true. The benchmarks earlier in this post measure exactly the second thing, and most engines fail it.

This is the layer that's still missing, and it's "and", not "instead of". You want a context graph to find the relevant facts and a reasoning layer to derive and check their consequences. The usual answer is to wire two systems together: a context-graph product for retrieval, something else for logic. ReasoningLayer collapses them into one. It is a context graph at its core, the same web of entities and relationships those funded products capture (your documents, people, projects, tools and events), and because its terms are natively n-ary, temporal and versioned, it holds that web more faithfully than a binary triple ever could. It simply doesn't stop at retrieval.

On the very same substrate it adds everything a retrieval-only graph leaves out: it derives the facts that follow but were never written (the complete OWL 2 RL reasoning the benchmarks above measure), enforces the constraints the data must satisfy (SHACL), grounds it all in an n-ary world model (SUMO + OWL + your data), carries time, versioning and provenance so every answer is auditable as-of any moment, settles who may see and do what in the same engine (governance), and, by design, suspends on what it genuinely can't determine instead of guessing. A context graph makes an agent better-informed. The same graph, carrying all of that, is what lets it be right, and honest about when it isn't sure.

So yes: ReasoningLayer can be your context graph, but a context graph alone was never going to be enough. That's the whole point.

The reasoning layer AI has been missing#

Standards and speed are table stakes. The reason ReasoningLayer's substrate matters for trustworthy AI is that the same engine handles the things the real world (and a real reasoning layer for AI) actually demands, none of which a plain triplestore was built for:

Learns without training, and personalises to your world. No fine-tune, no gradient run, no waiting for a retrain: you teach the engine by stating a fact or adding a rule, and the very next query reflects it. Because rules are data written in the same language as the facts (they're homoiconic), adapting it to your domain (your ontology), your needs (your rules) and your policy (your access and governance) is an edit to the knowledge base, not a training run, instant, auditable, reversible. Where an LLM bakes its knowledge in at training time and can only be nudged at inference, ReasoningLayer changes what's true the moment you tell it, per tenant, per domain.
Incomplete data, it knows what it doesn't know. ReasoningLayer can reason open-world, a missing fact is treated as unknown, never silently assumed false, and, deeper, it residuates: a rule or query that lacks information suspends and waits instead of failing or guessing. A claim lands in one of three states, entailed, refuted, or undetermined, and the engine won't quietly collapse undetermined into false (the very move that makes ordinary databases, and LLMs, sound confident when they shouldn't be). Which assumption applies is a deliberate choice, per query: open-world for honest reasoning under partial knowledge, or closed-world (negation-as-failure) when "absent means false" is exactly what you want, "flag every order with no approval on record." That's the opposite of an LLM, which just fills the gap regardless; an engine that abstains when it doesn't know, unless you explicitly tell it to assume otherwise, is the structural antidote to hallucination, baked into the formalism, not bolted on.
Graded truth, it reasons in degrees, not just true/false. The real world is fuzzy: a symptom is roughly like another, a risk is somewhat high, two records are probably the same person. ReasoningLayer matches and reasons with degrees in [0,1], fuzzy unification, graded subsumption, similarity between otherwise-incomparable types, and ranked nearest-matches, instead of forcing every question into a crisp yes/no. Where a triplestore can only filter on scores you precomputed elsewhere, graded reasoning is native to the engine and addressable straight from RLQL (MATCH ~0.8 …, FUZZY TOP k, as shown above). For grounding AI that means it can weigh evidence and answer "close, and here's the confidence," rather than snapping to a brittle exact match, and it composes with everything else here: a fuzzy match still residuates when it lacks information, still carries provenance, still respects governance.
Catches contradictions, it refuses rather than narrates. This is the surgeon riddle from the top of this post, made mechanical: assert "this surgeon is the boy's father" next to "a person has exactly one father" and a second, different father, and the engine reports the inconsistency instead of smoothing it into fluent prose. Contradiction-checking is a first-class operation, disentailment, consistency checks over the asserted facts, and a description-logic satisfiability core, so conflicting records, violated constraints and impossible states surface as errors to resolve, not plausible-sounding answers. An LLM has no notion of "these two statements cannot both be true"; a reasoner does, and says so.
Temporality, facts that hold over time. "John worked at Acme from 2019 to 2023" is not a timeless triple; the truth depends on when you ask. ReasoningLayer models time as first-class, so the reasoner can answer "as of" a moment and reason about change, order and duration.
Versioning & provenance, answer "what did we know, and why, and when?" Facts and the derivations that produced them are versioned and auditable. For trustworthy AI that means a real audit trail: every answer can be traced to the facts and rules it came from, and reproduced as of any past state, not a black box.
Proof you can re-check, not "trust me". Every derivation can be exported as a machine-checkable proof, to the Coq, Lean 4 and Isabelle/HOL proof assistants, and to the SMT/SAT certificate formats (LFSC, DRAT, Alethe), so a third party can verify the engine's reasoning in an independent prover that knows nothing about ReasoningLayer. That is the literal form of this post's refrain: an answer you can check is an answer you can trust. Where an LLM's "reasoning" is unfalsifiable narration, a derivation here is a certificate anyone can audit.
Governance, policy and access control are reasoning, not plumbing. Who may see a fact, who may act on it, which rule applies in which tenant, these are derivations, not config flags bolted onto the edge. Because policy lives as data in the same engine, "can this agent read this?" is answered, and proved, by the same reasoner that answers everything else, with grants, denials and clearance modeled as rules instead of scattered through application code. (It's exactly the engine behind our explainable access control.)
Data sovereignty, your world model stays yours. The knowledge, the rules and the reasoning run where you put them, on-prem, in-region, air-gapped if you need it, not behind someone else's API. For regulated domains (health, finance, public sector, defence) that's the line between a system you can actually deploy and one you can't: the world model an agent reasons over never leaves your control, and every answer stays auditable on your own infrastructure. And "where you put it" can be inside your own application: the same engine also ships as an embedded library, Rust, Python and TypeScript, even WebAssembly in the browser, so it can run in-process, with no server and no network hop at all, not only as a service you host.
Scale & concurrency, built for production, not a demo. Multi-version concurrency (readers never block writers), replication watermarks for consistent reads across replicas, and the multi-million-triple performance you saw above (10.9 M triples; closures of 1.4 M derived facts in seconds). It's engineered to stay fast as the knowledge, and the load, grows.

Put together with the rest of this post, that is a single engine that grounds AI in a real world model (SUMO + OWL + SHACL + your data, n-ary, on one substrate), reasons over it completely and faster than the reference engines, knows the limits of its own knowledge (open-world + residuation) and reasons in degrees where the world is fuzzy rather than forcing a crisp yes/no, flags contradictions instead of narrating them, remembers when things were true and why it concluded them, with proofs you can re-check in an independent prover, enforces who may see and do what, adapts to your domain and policy the moment you state a new fact or rule (no retraining), and keeps the whole world model under your control, at production scale. That combination, not any one feature, is the reasoning layer that grounded, scalable, trustworthy AI has been missing, and it's what ReasoningLayer is built to be.

How to read these numbers honestly#

The speeds are a snapshot, one run on one laptop (an Apple M4 Max); millisecond figures jitter run-to-run. The correctness results (400/400, 14/14, 18/18) are the solid claims we stand behind.
"Correct" means independent engines agree, never just our own say-so.
The contest was fair. Same machine, same protocol; the speed test (WatDiv) has reasoning switched off on every engine; ReasoningLayer runs with nothing that would flatter it (auditing off, no extra caching).
What we did not claim. These are the standard published sizes for each benchmark; larger-scale speed runs are follow-up work. One commercial reasoner (RDFox) is left out entirely because its trial licence forbids publishing benchmark numbers, so we don't.
Benchmarked vs. architectural. The three head-to-head tests measure OWL/RDF reasoning and speed, that's the part with competitor numbers. The substrate capabilities (native SUMO/n-ary, open-world + residuation, temporality, versioning) are real in the engine but are architectural properties, not part of these head-to-head runs, there's no standard multi-engine benchmark for them, largely because the other engines don't offer them to compare against.

Bottom line#

Ontologies and knowledge graphs are becoming the standard way to keep AI honest, but only if the engine underneath can compute the ontology's consequences completely and at scale. That's the bar these benchmarks set, and most engines clear only half of it:

Speed: ReasoningLayer is the fastest overall in a seven-engine field (ReasoningLayer plus six RDF engines), ~1 ms on complex questions where the others take 5–247 ms, while tying with Virtuoso (~0.7–1 ms) on trivial lookups.
Reasoning: it's one of only three engines that reason completely, and the fastest of them, all 14 basic-reasoning questions in 38 ms vs GraphDB's 78 ms and Fuseki's 9.9 s, and on the hard OWL 2 RL profile it matches the industry reference answer-for-answer at roughly twice the speed, where rival reasoners either fall short or never finish.
Trust: every correctness claim is cross-checked across independent engines, ReasoningLayer passes the official W3C suites in full, SPARQL 1.1 (630/630) and SPARQL 1.2 (267/267), and the one bug a benchmark found, we fixed and wrote up.
World model: on one substrate ReasoningLayer carries RDF, OWL, SHACL and SUMO, the n-ary, higher-order upper ontology that RDF/OWL stores can only encode or approximate rather than hold as native SUO-KIF-style n-ary rules. That's the full world model an AI needs to be grounded, not just a domain slice.

To ground AI and break hallucination you need a real world model, SUMO's commonsense + OWL domain rules + your data, and a layer that can both hold it faithfully (n-ary, not just triples) and reason over it at speed. That isn't a faster triplestore, and it isn't a bigger pile of retrieved context: it's the reasoning layer the AI stack is still missing, the piece that turns stored facts into derived, provable, trustworthy answers. That is the gap ReasoningLayer is built, from scratch, in Rust, to fill.

Step back from the benchmarks and the funding rounds, and it comes down to one arc: you need an engine that unifies your data into knowledge, turns that knowledge into insight, and from that insight generates everything you can trust, derived, not guessed; proven, not asserted; grounded in a model of the world, not a model of text. Data becomes knowledge, knowledge becomes insight, and insight becomes answers you can stake a decision on. That is the reasoning layer, and that is ReasoningLayer.

See it for yourself#

You don't have to take our word for any of it, that's by design.

Watch it reason. The live playground runs the engine on real problems: ask "who can access this file, and why?" and the answer comes back with the derivation that produced it, not a bare yes-or-no. ▶ Open the playground, or see the same engine at work on explainable access control and interactive scheduling.
Bring us your hardest question, the policy riddled with exceptions, the ontology nobody can fully compute, the agent you can't yet trust to act on its own. 👉 Talk to us and we'll run it against your data.

The whole pitch in one line: an answer you can check is an answer you can trust, and that's what a reasoning layer is for.

Appendix: datasets & engine versions#

Datasets: WatDiv scale-100 = 10,885,399 triples; LUBM-1 = 100,866 triples; OWL2Bench RL scale-1 = 54,815 triples. Engines (all native arm64, one M4 Max / 128 GB): ReasoningLayer origin/main; Virtuoso OS 7.2.17; GraphDB 10.8.0 Free; Eclipse RDF4J; Oxigraph latest; Apache Jena/Fuseki 5.5.0; QLever latest.

Why this matters now: ontologies are how you stop AI from making things up#

First, the plain words: triplestore, ontology, reasoner#

How it works, in one picture#

Test 1, Speed: can it find facts fast? (WatDiv)#

📖 In plain English#

The result#

The rules a reasoner has to apply, in everyday terms#

Test 2, Reasoning: does it work out the unwritten facts? (LUBM)#

📖 In plain English#

The result#

Test 3: The hard rules (OWL2Bench, the OWL 2 RL profile)#

📖 In plain English#

The result#

We eat our own dog food#

Beyond OWL: one substrate for RDF, OWL, SHACL, and SUMO#

Beyond SPARQL: one language for all of it, RLQL#

Everyone funded the context graph. The reasoning layer is still missing.#

The reasoning layer AI has been missing#

How to read these numbers honestly#

Bottom line#

See it for yourself#

Appendix: datasets & engine versions#

Further reading (the hallucination / knowledge-graph case)#