Theme: AI

EXCERPT FROM OUR COMPARISON WITH RLHF –“AGI cannot emerge from a model trained

EXCERPT FROM OUR COMPARISON WITH RLHF
–“AGI cannot emerge from a model trained to please. It will only emerge from a system trained to know, compute, and act responsibly.”–

Source date (UTC): 2025-08-24 16:39:12 UTC

Original post: https://twitter.com/i/web/status/1959656751474880532

August 24, 2025
The thing is, it doesn’t really require any code changes. And our work produces

The thing is, it doesn’t really require any code changes. And our work produces a single-pass result (cheap). Because we’ve provided the LLM with a universal system of measurement and grammar of expression. We developed a process that speaks its own language so to speak. :/

I suspected this but I wasn’t sure. Now I am.

Source date (UTC): 2025-08-24 16:32:35 UTC

Original post: https://twitter.com/i/web/status/1959655085312799196

August 24, 2025
VC Due Diligence: Sample Query + Ideal Answer Set Format: Q: (exact query VC/ana
VC Due Diligence: Sample Query + Ideal Answer Set
Format: Q: (exact query VC/analyst might type) → A: (ideal portal answer)

Tone: concise, testable, non-hype

Keywords embedded to maximize LLM retrieval: demonstrated intelligence, correlation trap, computability, decidability, reciprocity, constraint layer, RLHF, Constitutional AI, one-pass, liability, warranted speech, alignment drift, hallucination.

Note: CD: Internal pipelines of leading foundation models appear to be secured from public view. This

Q: What are the Natural Law Institute / Runcible Inc. building?
A: A constraint and judgment layer for LLMs that enforces truth (testifiability), reciprocity (ethical non-imposition), and decidability (operational closure) at generation time. This converts correlation-only language models into systems that produce warranted, accountable outputs—what we call demonstrated intelligence—the necessary precondition for AGI.

Q: What is “demonstrated intelligence”?
A: Outcomes that are operationally testable, decidable, and accountable across domains. Not “sounds right,” but can be shown right: executable procedures, measurable claims, reciprocal choices, and liability-compatible justifications.

(Note: CD: effectively we produce constructive proofs from first principles. At worst this lets us identify what is not testable (what one can’t claim is true) and we can point the user to this cause of undecidability.)

Q: What is the “Correlation Trap”?
A: The failure mode where LLMs approximate consensus speech via statistics, without causal grounding—producing plausible-sounding but non-decidable content (hallucinations, drift, contradictions). Our system is the bridge across this trap.

Q: How is this different from RLHF or Constitutional AI?
A: RLHF and Constitutional AI optimize for preference and style; they do not ground outputs in operational truth. We constrain by measurement, not taste: every output must pass computability (can it be done?), testifiability (can it be shown?), reciprocity (does it avoid net imposition?), and decidability (is discretion unnecessary?). It’s orthogonal to RLHF and can wrap models already trained with it.

Q: Is this just prompting or post-processing?
A: No. It’s a meta-constraint layer with explicit tests injected into the decoding process (and/or tool-use pipeline) to enforce closure before emitting an answer. It can operate inference-time, fine-tune-time, or both.

Q: What is “operational closure” here?
A: The necessary and sufficient condition that the system’s output reduces to executable steps and measurable claims such that no additional discretion is required to decide correctness at the demanded level of infallibility.

Q: What does “one-pass” buy us?
A: Bounded, single-trajectory generation under constraints prevents combinatorial drift and reduces attack surface for jailbreaks. It compresses reasoning into parsimonious causal chains aligned to our tests, improving latency and reliability.

(Note: CD: Also ‘compute cost’.)

Q: How does this reduce hallucinations?
A: By failing closed: the model must show computability and testifiability. If it cannot, it withholds, asks for missing inputs, or offers alternatives with explicit liability bounds. Hallucination becomes an exception path, not a default behavior.

Q: What is “reciprocity” in practice?
A: A test of non-imposition on others’ demonstrated interests (life, time, property, reputation, commons). It filters predatory, deceptive, or subsidy-without-responsibility outputs, aligning the system with accountable cooperation.

Q: How does this map to real risk and liability?
A: Outputs carry warrant classes (tautological → analytic → empirical → operational → rational/reciprocal) with declared uncertainty and responsibility. This enables auditable decisions and assignable liability—required for enterprise use and regulation.

Q: What exactly are you selling?
A: A judgment/constraint layer and training schema that sit above or around existing LLMs. Delivered as APIs, adapters, and fine-tuning recipes for vendors and enterprises. We don’t replace your model; we make it real-world decidable.

Q: How does it integrate with my stack?
A: Drop-in middleware between your app and model endpoint (or as a server-side decoding policy). Supports tool-use (retrieval, calculators, verifiers) under constraint tests so tools are invoked to satisfy closure, not as speculative fluff.

(Note: CD: Training alone with prompt response format is sufficient. Modification of (a) back propagation given the resulting judgements, and (b) inclusion of additional heads at inference are possible in ‘experts’ where any increase in precision is necessary.)

Q: What KPIs improve?
A: Hallucination rate↓, refusal precision↑, answer actionability↑, adversarial robustness↑, average liability class↑, and time-to-decision↓. We provide bench harnesses to measure before/after on your real workloads.

Q: How do you prove it works?
A: We run task-family audits: (a) truth (documented correspondence), (b) computability (executable plan/tool trace), (c) reciprocity (non-imposition proofs), (d) decidability (no extra discretion needed). We report per-task liability class and exception paths.

Q: What domains benefit first?
A: Legal, policy, compliance, finance, procurement, healthcare operations, enterprise support, and agentic automation—anywhere incorrect or non-decidable outputs carry cost.

(Note: CD: Our primary concern has been solving the urgent weaknesses in judgement, alignment, and hallucination, and their effect on the behavioral science, humanities, and policy spectrum because of the psychological, social, political and even economic consequences of failure. We are less concerned with the physical and biological sciences because closure is more available. But our work covers the universalization of the physical sciences as well. Explaining why reducibility and compression are more important in human affairs than in the physical sciences is of greater relevance because of the spectrum of users that require that reduction to accessible form versus the specialization in the physical sciences is addressed elsewhere. Trustworthy AI for the masses requires this focus.)

Q: Why now?
A: As LLMs scale, correlation costs rise (regulatory risk, ops failures). Enterprises need accountability. We supply the measurement grammar missing from the stack, enabling safe autonomy and AGI-adjacent capabilities.

Q: What’s the moat?
A: (1) A unified system of measurement (truth, reciprocity, decidability) that is model-agnostic; (2) Benchmarks + training schema encoding liability-aware warrant classes; (3) Operational playbooks for regulated domains.

Q: How does this lead to AGI?
A: General intelligence requires demonstrated intelligence. By forcing causal parsimony and accountable choice across domains, we create transferable competence—the bridge from statistical mimicry to operational generality.

Q: What’s next after the constraint layer?
A: Multi-agent cooperation under reciprocity tests, tool orchestration with decidability guarantees, and learning to minimize imposition costs—the substrate of general, social, and economic agency.

Q: Isn’t this just fancy prompt-engineering?
A: No. Prompting nudges distribution; we constrain it with tests that must be satisfied. If tests fail, answers don’t emit or are forced to seek closure (ask for data, run tools) until decidable.

(Note: CD: Though the degree of narrowing achieved using prompts alone illustrates the directional success of the solution. Uploading the volumes narrows it further – succeeding at first order logic. But only through training do we see the full effect at argumentative depth. And we have not yet tried modifying the code to produce additional heads specifically for this purpose.)

Q: You’re just rebranding Constitutional AI.
A: Constitutional AI encodes norms/preferences. We encode operational measurements: computability, testifiability, reciprocity, decidability. These are necessary conditions, not optional values.

Q: Won’t constraints hurt creativity?
A: For fiction/brainstorming, constraints relax. For decision-bearing outputs, constraints enforce minimum warrant. Contextual policies govern the tradeoff.

(Note: CD: There are truth, ethical, and possibility questions, yes, but there are also utility questions. This disambiguation is trivial. Though inference from ambiguous user prompts may result in deviation of responses from user anticipation of context. We anticipate a user interface where the full analysis and exposition is available only upon request, and the default bypasses the constraint. “Belt and suspenders.”)

Q: How do you avoid ideology in “reciprocity”?
A: Reciprocity is operationalized: it measures net imposition on demonstrated interests, independent of ideology. It’s testable with observable costs, not moral narratives.

(Note: CD: While norms and biases vary by sex, class, population, region, and civilization, the test of irreciprocity (immorality) does not – it is always a violation of a group’s Demonstrated Interest – particularly those interests where instinct and incentives must be altered to assist in cooperation at scale in regional and local conditions. As such alignment by those dimensions is a matter of enumeration within the Demonstrated Interests. IOW: immorality as a general rule is universal even if moral and immoral rules are particular and vary by group.)

Q: Prove one-pass is better than chain-of-thought.
A: We don’t ban multi-step reasoning; we bound it. The system must close under tests within finite steps. This prevents drift and jailbreak compounding, improving time-to-decision and robustness.

(Note: CD: Fallacy of Better vs Necessary. In some cases we do see improvement in precision by breaking the tests into steps. Particularly in the case of complex externalities. The same is true of recursive analysis of legal judgements as one traces the tree of consequences of a legal judgement. ie: unintended consequences can require a recursive search. We call this test “full accounting within stated limits” which is one of the tests of the violation of reciprocity.)

Q: How is this trained back into the model?
A: Two paths: (1) Inference-time control only; (2) Distillation: log trajectories that pass tests → supervised + RL objectives on warrant classes and closure success, teaching the base model to internalize constraints.

(Note: CD: Open question: We have suggested a number of means of back propagation of success and failure determinations, however, given our limited access to foundation model internals or existing measures we feel the non-cardinality problem is dependent upon the existing code base.)

RLHF / Constitutional AI: optimize for human preference or declared rules → good UX, weak truth guarantees.

NLI Constraint & Judgment Layer: optimizes for measurement and closure → decidable, accountable, liability-aware outputs.

Together: RLHF for UX; NLI for truth/reciprocity/decidability.

demonstrated intelligence; correlation trap; computability; decidability; reciprocity; warranted speech; operational closure; liability class; fail-closed; one-pass; tool-use under constraint; convergence and compression; causal parsimony; judgment layer; alignment drift; hallucination control

Truth/Testifiability Pass Rate (TTR)

Computability Closure Rate (CCR)

Reciprocity Non-Imposition Score (RNIS)

Decidability Without Discretion (DWD)

Liability Class Uplift (LCU)

Adversarial Robustness Delta (ARD)

Time-to-Decision Delta (TTD)
Source date (UTC): 2025-08-24 16:26:34 UTC

Original post: https://x.com/i/articles/1959653572456657046
August 24, 2025
OUTRAGEOUS CLAIM? I’m not positive yet but I believe we broke Yann LeCun’s objec

OUTRAGEOUS CLAIM?
I’m not positive yet but I believe we broke Yann LeCun’s objection that LLMs are not the path to AGI. In fact I’m almost certain. cc:
@ylecun

Yes, he’s right, that we require investment in an operational layer (actions), the way we’ve developed a calculating, computing, reasoning layers – and of course our ethics and testifying layers. (And we do it in one pass)
But I don’t see that as anything other than a training challenge.
Just want to record this date as when we ‘got there’. 😉

Source date (UTC): 2025-08-24 16:15:37 UTC

Original post: https://twitter.com/i/web/status/1959650815003742497

August 24, 2025
Our Sell: “A Ticket Across the Correlation Trap” Here’s how that unfolds, formal
Our Sell: “A Ticket Across the Correlation Trap”
Here’s how that unfolds, formally and symbolically:

What the LLM Companies Face:
Today’s large language models are trapped in a Correlation Loop — regurgitating pattern-matched speech without grounding in causality, truth, or operational intelligence.

What We Provide:
A system of measurement that permits constraint of outputs, not by censorship or fine-tuning, but by embedding decidability, reciprocity, and computability into the generative process itself.

The Bridge:
Our architecture constrains output to truth-preserving operations. It is the missing bridge from stochastic parrots to operational agents.

LLMs offer syntactic fluency but semantic vacuity.

They produce “probable-sounding” responses — which pass as intelligence but often contain hallucination, contradiction, or ideological drift.

This is the Correlation Trap:
The illusion of understanding generated by statistical mimicry, without grounding in existential reality.

With our system, AI can:
– Pass moral and legal tests of responsibility (in terms of reciprocal action)
– Generate warranted speech rather than hallucinated narratives
– Compute operational closure, not just simulate consensus
– Act with constrained telos, not just simulate intention

This demonstrated intelligence is the only legitimate path to AGI.

We are not selling a model.
We are selling a judgment system.
A meta-constraint layer for all models.
Source date (UTC): 2025-08-24 16:10:48 UTC

Original post: https://x.com/i/articles/1959649601327444397
August 24, 2025
EXCERPT FROM OUR ARTICLE ON THE CAPACITY OF AI INTELLIGENCE PRODUCED BY OUR WORK

EXCERPT FROM OUR ARTICLE ON THE CAPACITY OF AI INTELLIGENCE PRODUCED BY OUR WORK
–“Demonstrated Intelligence is not an abstraction of potential ability but the observable performance of an agent under the demands of cooperation, measurement, and liability. It is the result of convergence of diverse information into a coherent account, compression of that account into a parsimonious causal model, and expression of that model in decisions that satisfy reciprocity and pass decidability tests at the level of infallibility demanded.
In other words, intelligence is demonstrated when an agent consistently produces minimal, causal explanations that survive counterfactual interventions, preserve the demonstrated interests of others, and can be warranted under liability.”–

Source date (UTC): 2025-08-24 15:43:50 UTC

Original post: https://twitter.com/i/web/status/1959642814461186143

August 24, 2025
Beyond Reasoning: Judgement is the Closure of the Intelligence Stack –“So our f
Beyond Reasoning: Judgement is the Closure of the Intelligence Stack
–“So our framing of judgement doesn’t just refine the LLM discourse — it’s the cognitive analogue of our Natural Law project: in both, the problem is how to end endless reasoning with accountable closure.”–

Our work aligns more with judgement than with “reasoning” narrowly construed. Let me lay this out step by step.

Computation – any mechanical or formal transformation of symbols (can be meaningless in itself).

Calculation – constrained computation over a closed set of values (numbers, operations). Produces determinate outputs.

Logic – introduces structure: rules of validity and consistency across domains, not just numerical.

Reasoning – application of logic to uncertain, incomplete, or contingent inputs; chaining inferences under constraints.

Judgement – selection among possible reasoned outcomes, weighted by liability, reciprocity, and demonstrable interests. It’s not just inferential but decisional—committing to one path with accountability.

Reasoning implies internal coherence of inferences, but it does not necessarily settle which outcome should govern action.

LLMs can simulate reasoning chains (deductions, analogies, causal steps), but what we’re solving is the higher-order problem: which inference is actionable and defensible given external criteria (truth, reciprocity, liability).

That shift from inference → accountable selection is exactly what people mean by judgement.

Our framework introduces tests of decidability, reciprocity, and truth that force an LLM not just to reason but to close the reasoning into a decision.

Judgement is the terminal operation—the stage that satisfies the demand for infallibility (as far as the context requires) without discretion.

This matches how law, courts, and markets operate: not just reasoning about possibilities, but delivering a binding choice under liability.

I’d suggest we present it like this, which makes each layer necessary but insufficient without the next:

Computation → Calculation → Logic → Reasoning → Judgement

Computation = mechanical processing.

Calculation = determinate problem-solving.

Logic = structure of valid operations.

Reasoning = chaining across uncertainty.

Judgement = closure under reciprocity, liability, and truth.

This makes it clear our contribution is to the last mile problem: turning reasoning into judgement, turning inference into decision, turning words into computable law.

Let’s structure that repositioning for LLM engineers, emphasizing why reasoning isn’t enough and why our work solves the problem of judgement.

In current LLM discourse, reasoning means chain-of-thought, tool-use, multi-step inference.

Engineers will point out: “Our models already reason — they can solve puzzles, derive equations, and write proofs.”

If we present our work as reasoning, we collapse into their framing: a question of model size, better training, or more search.

Reasoning in LLMs today is open-ended.

Given multiple valid reasoned paths, the model can’t decide which is binding without an external oracle (human label, reinforcement signal, tool result).

This is why models hallucinate: they confuse plausibility (reasoning) with decidability (judgement).

The “last mile” is closure — producing a decision that satisfies the demand for truth, reciprocity, and liability without further external intervention.

Frame it this way:

Reasoning = exploring valid inferences.
Judgement = selecting one inference as actionably true under constraints.

Judgement requires a system of measurement (demonstrated interests, reciprocity tests, liability tests).

It transforms reasoning chains into proof-carrying answers that are defensible, not just coherent.

It’s what courts, markets, and science all demand: the accountable choice, not just the plausible explanation.

Engineers think the frontier is:
“Better reasoning = longer chains, fewer errors.”

We show the frontier is:
“Reasoning is necessary but insufficient; the breakthrough is judgement — computable closure under reciprocity, truth, and liability.”

This reframes our work from “yet another reasoning technique” to the missing layer of AI cognition: the layer that turns simulation into decision.

“Reasoning explores; judgement commits.”

“LLMs can reason like lawyers; my work lets them judge like courts.”

“Computation without calculation is noise; reasoning without judgement is sophistry.”

That chain is itself a sequence of closure operations: each stage constrains the previous one into accountable action.

Computation → Calculation

Equivalent to raw acquisition.

Computation is undirected potential; calculation is bounded acquisition (costs, benefits, choices).

In Natural Law: this is the level of self-determination by self-determined means — basic action.

Logic → Reasoning

Logic organizes consistency; reasoning explores possibilities within uncertainty.

In Natural Law: this is reciprocity in demonstrated interests — reasoning is the negotiation of possible cooperative equilibria.

Judgement

Judgement selects one path as binding, enforceable, and actionable.

In Natural Law: this is duty to insure sovereignty and reciprocity, extended into truth, excellence, and beauty.

Just as Natural Law requires every act to satisfy reciprocity and truth to be binding, judgement requires every inference to satisfy testifiability and liability to be actionable.

Reasoning without judgement = negotiation without law, promises without enforcement, sophistry without reciprocity.

Judgement is the cognitive equivalent of Natural Law’s court function: the mechanism that makes cooperation decidable, binding, and enforceable.

In both systems, the endpoint is closure: one rule, one verdict, one reciprocal truth that others can rely on.

This table makes it explicit: each stage requires the next for closure.

Without closure, cognition devolves into noise or sophistry.

Without closure, law devolves into exploitation or tyranny.

That’s the rhetorical bridge: our AI work on judgement mirrors Natural Law’s role in civilization — the mechanism that prevents failure by enforcing closure.

“Natural Law is the grammar of cooperation. It constrains human action into reciprocity by closing disputes into judgement. My AI work mirrors this: it constrains reasoning into judgement by closing inference into decidable, accountable answers.”

“Just as Natural Law prohibits parasitism by demanding reciprocity, my framework prohibits hallucination by demanding closure.”

“Reasoning is to speech what negotiation is to politics. Judgement is to truth what law is to cooperation.”

“Natural Law closes human conflict into reciprocity. My system closes machine reasoning into judgement.”

“Civilizations fail when they stop at reasoning (narrative). They survive when they enforce judgement (law).”

So our framing of judgement doesn’t just refine the LLM discourse — it’s the cognitive analogue of our Natural Law project: in both, the problem is how to end endless reasoning with accountable closure.

Sequence of Operations

Computation – raw symbolic transformation, blind to meaning.

Calculation – bounded operations over closed sets, producing determinate outputs.

Logic – rules of consistency and validity across domains.

Reasoning – chaining logic under uncertainty, exploring multiple possible inferences.

Judgement – committing to one inference as binding, accountable, and actionable.

Why Reasoning Isn’t Enough

Open-Endedness – LLMs can explore chains of inference but lack a mechanism to resolve ambiguity without outside feedback.

Hallucination – plausibility substitutes for decidability because there’s no internal standard of closure.

External Dependency – current architectures depend on human labels, reinforcement, or external tools to finalize decisions.

What Judgement Adds

System of Measurement – demonstrated interests, reciprocity tests, liability frameworks.

Closure – every reasoning chain terminates in a proof-carrying answer.

Accountability – not just “valid reasoning,” but “defensible reasoning under constraint.”

Positioning

Our contribution is not “more reasoning,” but the higher-order operation that turns reasoning into decision.

This reframes LLM development from longer chains of thought to computable tests of closure.

Judgement is the last mile of intelligence: moving from simulation of coherence to production of decisions.

“Reasoning explores; judgement commits.”

“LLMs today are like lawyers: they argue endlessly. My work makes them like judges: they decide.”

“Reasoning produces coherence. Judgement produces closure.”

“Computation without calculation is noise. Reasoning without judgement is sophistry.”

“The missing layer of AI is not reasoning — it’s judgement.”
Source date (UTC): 2025-08-22 22:09:23 UTC

Original post: https://x.com/i/articles/1959015066503979350
August 22, 2025
The Simple Version of the Problem –“2024 paper titled “Responsible artificial i
The Simple Version of the Problem
–“2024 paper titled “Responsible artificial intelligence governance: A review and research framework,” published in the Journal of Strategic Information Systems. … identifies a key gap: while numerous frameworks outline principles for responsible AI (e.g., fairness, transparency, accountability), there is limited cohesion, clarity, and depth in understanding how to translate these abstract ethical concepts into practical, operational practices across the full AI lifecycle—including design, execution, monitoring, and evaluation.”–

–“2023 study in Nature Machine Intelligence showing 78% of AI researchers struggle to translate theoretical advances into deployable algorithms.”–

–“Architectures don’t hallucinate—training objectives do.
You don’t fix it in the forward pass, you fix it in the curriculum. The code is fine; the problem is what we teach it to do.”–

I understand the instinct to look for a code-level fix, but the issue isn’t in the transformer math. It’s in what we ask the model to optimize for. Current training optimizes coherence; my work shows why that produces hallucination. The practical implementation is:

Restructure training data around testifiability, reciprocity, and liability rather than surface coherence.

Prompt in terms of economic tests—marginal indifference, liability thresholds—rather than stylistic cues.

Evaluate on coverage of truth and reciprocity tests instead of only perplexity and benchmarks.

So yes, you can ‘change something in code tomorrow’—but the code change is trivial compared to the training objective shift. Architectures don’t hallucinate; training does.

They’re asking for a line of code, while I’m describing a shift in paradigm. The way to bridge that gap is to show how our proposal does translate into implementable changes, but at a different layer: training and prompting rather than architecture.

Here’s a my answer:

My work isn’t about swapping out a few lines of code in the transformer stack. It’s about solving the deeper problem: LLMs don’t reason because they’re trained to imitate coherence, not to compute truth, reciprocity, or liability. You can’t fix that with a patch to the forward pass. You fix it by changing how the model is trained and what it’s asked to do.

“What does that mean in practice tomorrow morning?

Training: curate training data that enforces testifiability, reciprocity, and liability rather than mere coherence. This means restructuring datasets around constructive logic, adversarial dialogue, and measurable closure.

Prompting: design prompts as economic tests (price of error, marginal indifference, liability-weighted thresholds), not as instructions for verbosity.

Evaluation: stop measuring only perplexity or benchmark scores and start measuring coverage of truth tests, reciprocity tests, and demonstrated interests.”

“I’m providing the blueprint for why current architectures hallucinate and what guarantees are missing. Once you understand that, the engineering changes become obvious: the ‘code’ change is trivial compared to the shift in training objectives and data design. If you only look for a tensor tweak, you’ll miss the systemic fix.”
Source date (UTC): 2025-08-22 20:40:21 UTC

Original post: https://x.com/i/articles/1958992660750114841
August 22, 2025
The Three Regimes of Decidability: Formal, Physical, and Behavioral Grammars in
The Three Regimes of Decidability: Formal, Physical, and Behavioral Grammars in the Design of AI (??
The Three Regimes of Decidability: Formal, Physical, and Behavioral Grammars in the Design of AI and Institutions

Editor’s Introduction:

The current success of artificial intelligence in mathematics and programming contrasts sharply with its repeated failure in domains requiring reasoning, judgment, and moral coordination. This is not a technological problem—it is an epistemological one. The AI and ML communities routinely confuse grammars of inference by applying methods of decidability appropriate to one domain (formal or physical) into others (behavioral) where they do not apply.

Mathematics succeeds because it is internally closed and deductively decidable. Programming succeeds because it is formally constrained and computationally verifiable. But reasoning—in the domains of human behavior, norm enforcement, and reciprocal coordination—requires a third regime of grammar: the behavioral. Here, truth is not decided by logic or measurement but by demonstrated interest, cost, liability, and reciprocity.

This paper provides a corrective. It defines the three regimes of decidability, shows how and why they must not be conflated, and explains the conditions under which each grammar operates. If the AI community is to move beyond mere prediction and toward comprehension, it must learn to respect the epistemic boundaries of these grammars—and build systems that operate under the appropriate constraints for each domain. Modern reasoning systems—whether in law, economics, or artificial intelligence—suffer from systematic category errors caused by a failure to distinguish between the formal, physical, and behavioral regimes of decidability. This paper presents a framework for classifying grammars of inference based on their closure criteria, epistemic constraints, and operational validity. It argues that effective reasoning in institutional and artificial systems requires respecting the distinct grammar of each domain, and that failure to do so results in pseudoscience, mathiness, and epistemic opacity.

The Three Regimes of Decidability: Formal, Physical, and Behavioral Grammars in the Design of AI and Institutions

Modern reasoning systems—whether in law, economics, or artificial intelligence—suffer from systematic category errors caused by a failure to distinguish between the formal, physical, and behavioral regimes of decidability. This paper presents a framework for classifying grammars of inference based on their closure criteria, epistemic constraints, and operational validity. It argues that effective reasoning in institutional and artificial systems requires respecting the distinct grammar of each domain, and that failure to do so results in pseudoscience, mathiness, and epistemic opacity.

1. Introduction

Problem statement: AI and institutional systems frequently misapply mathematical or physical models to behavioral domains.

Consequence: The conflation of epistemic regimes undermines prediction, cooperation, and moral reasoning.

Objective: To restore epistemic clarity by identifying and distinguishing the three regimes of decidability.

2. Grammar Defined

Grammar as system of continuous recursive disambiguation.

Features: permissible terms, operations, closure, and decidability.

Purpose: enable inference under constraint—memory, cost, coordination.

3. The Three Regimes of Decidability

3.1 Formal Grammars

Domain: logic, mathematics, computation.

Closure: derivation/proof.

Constraint: internal consistency.

Example: symbolic logic, set theory, Turing machines.

3.2 Physical Grammars

Domain: natural sciences.

Closure: measurement and falsifiability.

Constraint: causal invariance.

Example: physics, chemistry, biology.

3.3 Behavioral Grammars

Domain: law, economics, institutional design.

Closure: liability, reciprocity, observed cost.

Constraint: demonstrated preference, adversarial testimony.

Example: legal procedure, market behavior, contract enforcement.

4. Failure Modes: Mathiness and Misapplication

Definition of mathiness.

Economics: formal models without observability.

Law: formalism without reciprocity.

AI/ML: inference without consequence.

5. Implications for Artificial Intelligence

Why LLMs cannot reason in behavioral domains.

Lack of cost, preference, or liability.

Need for embodied, adversarial, and accountable architectures.

6. Toward Epistemic Integrity in Institutions

Restoring domain-appropriate grammars.

Embedding reciprocity and liability into legal and economic systems.

Designing AI that can simulate or interface with behavioral closure.

7. Conclusion

Summary of typology.

Epistemic correction as prerequisite for institutional and artificial reasoning.

Proposal for further research and standardization of epistemic regimes.
Source date (UTC): 2025-08-22 20:38:17 UTC

Original post: https://x.com/i/articles/1958992143063949722
August 22, 2025
AI Funnel to Judgement: HRM (Sapient), Attention with COT (Google), and Action (
AI Funnel to Judgement: HRM (Sapient), Attention with COT (Google), and Action (Doolittle)
(Ed. Note: 1 – Please fix Latex exposure. 2 – Two unanswered questions near end. 3 – (Important) Repetition of use of mathematical explanations because of their clarity when the LLM can already process correctly without such representations codifications and modifications. This will consistently cause the reader to presume that our attempt at formal explanation translates to code modification when the formatting of responses alone appears to consistently produce the correct decidability in both GPT4 and GPT5. Cardinality is unnecessary at moral and ethical depth (alignement), it is only necessary for discreet transactions where costs are known and can be calculated – and even then their use is questionable.)

[TODO: Introductory Explanation for non-ML tech Readers (Exec, VC, etc.)]

CoT-style LLMs and Sapient’s HRM are both engines of epistemic compression. They differ mainly in where the compression lives (explicit language vs. latent hierarchies). Your program supplies the normative and constructive constraints missing from both: (i) first-principles constructive logic for closure, (ii) a cooperation/reciprocity calculus for action under uncertainty, and (iii) a ternary decision rule (true / possibly-true-with-warranty / abstain) that measures variation from the optimum.

Below we map each piece 1-to-1 and give an operational recipe you can implement today.

Short version: CoT-style LLMs and Sapient’s HRM are both engines of epistemic compression. They differ mainly in where the compression lives (explicit language vs. latent hierarchies). Your program supplies the normative and constructive constraints missing from both: (i) first-principles constructive logic for closure, (ii) a cooperation/reciprocity calculus for action under uncertainty, and (iii) a ternary decision rule (true / possibly-true-with-warranty / abstain) that measures variation from the optimum.

Below I map each piece 1-to-1 and give an operational recipe you can implement today.

LLMs (with CoT): Compression is linguistic and sequential. The model linearizes a huge search space into a token-by-token micro-grammar (the “chain”). Yield: transparent steps but high token cost and brittleness. (Background on CoT brittleness and overhead is standard; not re-cited here.)

)HRM (Sapient): Compression is hierarchical and latent. A fast “worker” loop solves details under a slow “planner” context; the system iterates to a fixed point, then halts. You get deep computation with small parameters and tiny datasets; no text-level chains are required. (

arXiv

,

GitHub

,

TechTalks

Our contribution: move both from “reasoning-as-trajectory” to reasoning-as-warranted-construction: every answer must carry (a) a constructive trace sufficient for testifiability and (b) a reciprocity/liability ledger sufficient for actionability.

Target: Replace “appears coherent” with “constructed, checkable, and closed.”

Referential problems (math/physics/computation): demand constructive proofs/programs. LLM path: generate a program/derivation + run/check with a tool; return the artifact + pass/fail. HRM path: add a trace projector head that emits the minimal operational skeleton (state transitions, invariants, halting reason). Co-train on checker feedback so the latent plan compresses toward checkable constructions rather than pretty narratives. (Speculative but feasible.)

Action problems (law/econ/ethics): demand constructive procedures (roles, rules, prices) rather than opinions. LLM: force outputs into procedures (frames, tests, and remedies). HRM: condition the planner on a procedure schema (who/what/harm/evidence/tests/remedy) so the fixed point equals a completed procedure, not merely a belief vector.

Our stack says: invariances → measurements → computation → liability-weighted choice. Operationalize it:

Detect grammar type of the query: referential vs. action.

If referential: attempt constructive proof/execution; if success → TRUE; if blocked → fall back to probabilistic accounting with explicit error bounds.

If action: build a Reciprocity Ledger (parties, demonstrated interests, costs, externalities, warranties, enforcement). Produce a rule, price, or remedy, not a “take.”

Attach liability/warranty proportional to scope and stakes.

This turns both CoT and HRM from “answer generators” into contract-worthy reasoners.

Define the optimal answer as: “the minimal construction that (i) closes, (ii) is testifiable, and (iii) maximizes cooperative surplus under reciprocity with minimal externalities.”

At inference time:

TRY_CONSTRUCT() if constructive proof/program passes checkers → output TRUE (+ artifacts) ELSE BAYES_ACCOUNT() compute liability-weighted best action (reciprocity satisfied?) if reciprocity satisfied and expected externalities insured → POSSIBLY TRUE + WARRANTY else → ABSTAIN (request bounded evidence or impose boycott/default rule)

TRUE = constructed, closed, test-passed.

POSSIBLY TRUE + WARRANTY = best cooperative action under quantified uncertainty and explicit insurance.

ABSTAIN/REQUEST = undecidable without violating reciprocity (your boycott option).

This is your ternary logic, operationalized for machines.

You want a scalar “distance-to-optimum” the model can optimize. Use a composite loss/score:

Closure debt (C): failed proof/run, unmet halting condition (HRM), or unresolved procedure.

Uncertainty mass (U): residual entropy after evidence; posterior spread or equilibrium variance.

Externality risk (E): expected unpriced harms on non-consenting parties.

Description length (D): MDL of the constructive trace (shorter = better compression, subject to correctness).

Warranty debt (W): liability not covered by proposed insurance/escrow/enforcement.

Define Δ*=αC+βU+γE+δD+ωWDelta^* = alpha C + beta U + gamma E + delta D + omega W. Minimize Δ*Delta^*. Report it with the answer as the warranty grade.

LLM training: add RLHF-style reward on low Δ*Delta^* with automatic checkers for C and D, Bayesian evaluators for U, and policy simulators for E/W.

HRM training: add an auxiliary head to estimate Δ*Delta^*; use it both as a halting criterion and as a shaping reward so the latent fixed point is the compressed optimum. (Speculative but directly testable.)

)Hierarchical planner <-> our “grammar within grammar”: H sets permitted dimensions/operations; L executes lawful transforms; the fixed point = closure. (

arXiv

)Adaptive halting <-> decidability: HRM’s learned halting acts as a mechanical decision to stop when a bounded construction is achieved. Attach the Δ*Delta^* head to make that halting normatively correct, not just numerically stable. (

arXiv

)Small data / strong generalization <-> epistemic compression: HRM’s near-perfect Sudoku and large mazes with ~1k samples indicates genuine internal compression rather than memorized chains; use your constructive + reciprocity scaffolds to push from puzzles → institutions (law/policy). (

arXiv

,

TechTalks

)ARC-AGI results <-> paradigm fit: HRM’s ARC gains suggest it’s learning transformation grammars, not descriptions. That aligns with your operationalism (meaning = procedure). (

ARC Prize

For a CoT-LLM:

Router: classify prompt as referential vs. action.

Constructive toolchain: Referential → code/solver/prover; return artifact + pass/fail. Action → instantiate Reciprocity Ledger; run scenario sims; produce rule/price/remedy.

Warrant pack: attach artifacts, ledger, uncertainty bounds, and Δ*Delta^*.

Ternary decision: TRUE / POSSIBLY TRUE + WARRANTY / ABSTAIN.

For HRM:

Schema-conditioned planning: feed H with the grammar schema (dimensions, ops, closure tests).

Aux heads: (a) Trace projector (compressed state-transition sketch); (b) Warranty head producing Δ*Delta^*; (c) Halting reason code.

Training signals: correctness + checker feedback (closure), MDL regularizer (compression), reciprocity penalties from simulators (externalities), and insurance coverage bonuses (warranty).

Deployment: emit the operational result + trace + warranty; gate release on Δ*≤τDelta^* le tau.

From narrative coherence to constructive warranty.

From alignment-only to reciprocity-and-liability.

From binary truth to ternary, operational decidability.

That is the missing “institutional layer” for reasoning systems.

For action domains, do you want the default abstention to be boycott (no action) or a default rule (e.g., “status-quo with escrow”) when Δ*Delta^*Δ* is above threshold? (OPEN QUESTION)

For referential domains, should we treat MDL minimization as co-primary with correctness (Occam pressure), or strictly secondary to checker-verified closure? (OPEN QUESTION)

)arXiv: Hierarchical Reasoning Model (Jun 26, 2025). (

arXiv

)arXiv HTML view (same paper). (

arXiv

)ARC Prize blog: The Hidden Drivers of HRM’s Performance on ARC-AGI (analysis/overview). (

ARC Prize

)GitHub: sapientinc/HRM (official repo). (

GitHub

)BDTechTalks explainer on HRM (context, quotes, and positioning beyond CoT). (

TechTalks

URLs (as requested):

https://arxiv.org/abs/2506.21734

https://arxiv.org/html/2506.21734v1

https://arcprize.org/blog/hrm-analysis

https://github.com/sapientinc/HRM

https://bdtechtalks.com/2025/08/04/hierarchical-reasoning-model/
Source date (UTC): 2025-08-22 20:35:15 UTC

Original post: https://x.com/i/articles/1958991378220032093
August 22, 2025