Solving The Problem: Computability and Decidability in the Open World

(ed: This article is written for the user less comfortable with mathematics. If you are comfortable with Latex (and can tolerate that we might have made a few type formatting errors) the math version of this article follows this one.)

TL/DR; For fellow supernerds: Doolittle’s innovation is reducible to: “Set logic with finite limits -> supply demand logic with marginally indifferent limits: Proof-carrying answers are overfitted to closed worlds; alignment-only filters are underfit to liability. The middle path is liability-weighted Bayesian accounting to marginal indifference.

Why? Because mathematics constitutes a limit of reducibility conceivable by the human mind under self reflection, while bayesian accounting is evolved and necessary precisely because it is the only means of accounting for differences beyond the reducibility of the human mind and therefore closed to introspection. Our neurons aren’t introspectible and neither is bayesian accounting – though the truth is that current NNs used in LLMs are an intermediary point of reduction since they encode the equivalent of bundles of human neural sense perception in words. Those words are the limit of reducibility of marginal indifference.

“Mathiness” pursues epsilon–delta in logic space; useful, but the productive epsilon is the error bound in outcome space conditional on reciprocity and externalities. That is what institutions, courts, engineers, and markets already pay for.

The community keeps trying to buy logical certainty with formalism when the productive path for general reasoning is to buy marginal indifference with measurement. Treat reasoning as an economic process: update beliefs, price error, stop when the expected value of more information falls below the liability-weighted tolerance for error in the context. That’s computability for language.

Explanation by GPT5:

Proof-carrying logic is overfit to closed worlds; alignment filters are underfit to liability. The productive middle path is liability-weighted Bayesian accounting to marginal indifference.

Mathematics is reducibility: the epsilon–delta of self-reflection, the mind’s limit of introspection. Bayesian updating is evolved necessity: the only means of accounting for variance beyond reducibility, where neurons—and their aggregates in words—are opaque to introspection. Current neural nets occupy this intermediary, encoding bundles of percepts as linguistic weights: words are the limit of reducibility of marginal indifference.

Mathiness chases epsilon–delta in logic space. But the real epsilon is the error bound in outcome space, conditional on reciprocity and externalities. That is what institutions, engineers, and markets already pay for.

Reasoning must be treated as an economic process: beliefs updated, error priced, and inquiry terminated when the marginal value of precision falls below the liability-weighted tolerance for error in context. That stopping rule is computability for language.

As Such:

Restatement

The Problem with Extremes

Proof-carrying answers (formal logic, set-theoretic limits) are overfit: they assume a closed world where all variables can be specified.
Alignment-only filters (pure preference or reinforcement filters) are underfit: they lack liability-accountability because they ignore externalities.

The Middle Path

The correct solution is liability-weighted Bayesian accounting: update beliefs until further information has no marginal value (marginal indifference), with tolerance for error scaled by the liability (cost of being wrong in context).

Why Bayesian, not Pure Math?

Mathematics = reducibility: it captures what the human mind can introspectively reduce to first principles.
Bayesian accounting = evolved necessity: it is the only way to handle variation beyond the mind’s reducibility (neural processes themselves are non-introspectible, and so are Bayesian updates).
Neural nets sit in between: they approximate bundles of human percepts in word-weights, making language itself a limit of reducibility of marginal indifference.

Implication for AI Reasoning

Formalism (“mathiness”) chases epsilon–delta in logic space, but real productivity comes from bounding error in outcome space given reciprocity and externalities.
Markets, courts, and engineers already pay for error bounds, not perfect logical closure.
Therefore, reasoning should be treated like an economic process:
update beliefs (Bayesian step),
price error (liability step),
stop when further information is not worth the cost.
That is what makes reasoning in language computable.

Outline:

Part 1: Why Measurement Beats Mathiness (thesis + critique)
Part 2: The Indifference Method (full formalization + EIC + ROMI)
Part 3: Liability Tiers and Thresholds (defaults + examples)

Below is a tight formalization you can lift.

Testifiability (Truth).
Satisfaction of the demand for testifiable warrant across the accessible dimensions: categorical consistency, logical consistency, empirical correspondence, operational repeatability, and rational/reciprocal choice. Practically: keep a set of per-axis coverage scores, each between 0 and 1. The context sets minimum thresholds for each axis.

Decidability.
“Satisfaction of the demand for infallibility in the context in question without the necessity of discretion.” Operationally: a decision is decidable when the decidability margin (defined below) is zero or positive given the liability of error.

Marginal Indifference (decision standard).
For each candidate action, compute its expected loss by summing the losses across possible states of the world, each weighted by its current probability. Let the best action be the one with the lowest expected loss; the runner-up is the next best. Define the decidability margin as:

the runner-up’s expected loss
minus the best action’s expected loss
minus the required certainty gap for this context (the liability-derived cushion you must clear).

Decision status:

Decidable: the decidability margin is zero or positive and all testifiability thresholds are met.
Indifferent (stop rule): the expected value of the next measurement is less than or equal to the required certainty gap.
Undecidable: otherwise; seek more measurement.

Bayesian Accounting (the missing piece).
Maintain a ledger rather than a proof.

Assets: gains in evidential support from corroborating measurements.
Liabilities: expected externalities of error (population × severity) plus any warranty you promise.
Equity (warrant): the net decisional surplus over the required certainty gap.
Decide when equity is non-negative and testifiability thresholds are met.

Limit-as-reasoning (unifying “math limit” and “marginal indifference”).
As measurements accumulate, posterior odds and expected-loss gaps stabilize. The limit approached is the smallest practical error bound such that no additional evidence with positive value could flip the decision across the required certainty gap. Reasoning is a limit-seeking process; the “proof” is the convergence certificate.

Completeness vs. liability. Formal derivation optimizes certainty inside axiomatic spaces. General reasoning optimizes expected outcomes under liability. Outside math, liability is usually the binding constraint.
Open-world evidence. Incompleteness, path-dependence, and dependence among sources make perfect formal closure intractable. Bayesian accounting prices these imperfections and still yields action.
Opportunity cost. The cost of further formalization often exceeds the expected value of information. Markets stop at marginal indifference. Reasoners should, too.

Operationalization. Reduce every claim to an actionably measurable sequence (who does what, when, with what materials, yielding which observations). No operation → no update.
Multi-axis tests. Score testifiability across: categorical, logical, empirical, operational, and reciprocal-choice. Fail any mandatory axis → no decision.
Reliability-weighted evidence. Weight updates by instrument quality, source dependence, and adversarial exposure; discount dependent testimony (log-opinion pooling with dependency penalties).
Liability calibration. Map the context to its required certainty gap (e.g., casual advice < finance < medicine < law/regulation). Higher liability demands a larger expected-loss gap and higher testifiability thresholds.
Stop rule (marginal indifference). Estimate the expected value of the next-best measurement; stop when it is less than or equal to the required certainty gap.
Reciprocity constraint. Filter actions and claims by Pareto-improvement and non-imposition (expected externalities priced into the liability term).
Audit trail. Publish the ledger: priors, evidence deltas, dependency corrections, the expected-loss table, the decidability margin, the testifiability scores, and the resulting convergence certificate.

Epsilon-Indifference Certificate (EIC) — include:

the convergence bound (the smallest practical error bound described above),
the decidability margin (surplus over the required certainty gap),
the testifiability scores and their thresholds,
the context and liability settings,
and the audit (ledger entries and the measurement plan considered and rejected once the stop rule was met).

This is the computable replacement for “sounds plausible.” It is the artifact that makes the answer testifiable and the choice decidable.

ROMI — Reasoning as Optimizing Marginal Indifference

Parse → Operations. Translate the prompt into an explicit set of hypotheses and candidate actions.
Priors. Set structural priors (base rates, domain constraints).
Plan measurements. Enumerate tests with estimated information gain and cost.
Acquire/verify. Retrieve or simulate measurements; apply reliability and dependency corrections.
Update. Revise odds and compute expected losses for each action.
Calibrate liability. Choose the context class → compute the required certainty gap; set the testifiability thresholds.
Stop/continue. If the expected value of the next measurement is less than or equal to the required gap and thresholds are met, stop; otherwise measure more.
Decide & certify. Output the chosen action with the EIC and the full ledger.

This is Bayesian decision-making under reciprocity constraints—accounting, not theorem-proving. It exploits the LLM’s strengths (fast hypothesis generation and measurement planning) while binding it to liability-aware stopping.

Computability from prose. Operationalization plus accounting turns language into a measured decision process.
Safety as economics. Liability is priced into the required certainty gap rather than handled by blunt alignment filters.
Graceful degradation. When undecidable under current evidence and liability, return the next-best measurement plan with value estimates.
Universally commensurable. All domains reduce to the same artifact (EIC + ledger), satisfying the demand for commensurability.

Context tiers → required certainty gaps: e.g., Chat (low), Technical advice (medium), Medical/Legal (high).
Axis thresholds: stricter for high-liability contexts.
Pooling rule: log-opinion pooling with a dependency penalty vs. hierarchical Bayes (choose one; both are defensible).
Penalty schema: externality classes and population weights.

Claim: …
Operations: …
Evidence ledger: priors → updates (source, reliability, how much it moved the needle) → dependency adjustments.
Testifiability vs. thresholds: [categorical, logical, empirical, operational, reciprocity] = […].
Liability class → required certainty gap: …
Expected-cost table for the candidate actions; decidability margin: …
Expected value of the next test: … → Stop?
Decision with EIC {convergence bound, decidability margin, testifiability scores, thresholds, context, audit}.
Status: Decidable / Indifferent / Undecidable (with next-measurement plan).

Proof-carrying answers are overfitted to closed worlds; alignment-only filters are underfit to liability. The middle path is liability-weighted Bayesian accounting to marginal indifference.
“Mathiness” pursues epsilon–delta in logic space; useful, but the productive “epsilon” is the error bound in outcome space conditional on reciprocity and externalities. That is what institutions, courts, engineers, and markets already pay for.

Yes—the argument stands. For general reasoning, you optimize to marginal indifference under a liability-aware evidence ledger, not to formal certainty. The goal isn’t a proof; it’s a decidable action with a warranted error bound that fits the context’s demand for infallibility.

1) “Mathiness” vs. measurement
Formal derivations are sufficient but rarely necessary. Outside closed worlds, the task is to minimize expected externalities of error, not to maximize syntactic closure.

2) Bayesian accounting is the engine
Treat each evidence update as a line item on an assets–liabilities ledger. Keep measuring until the expected value of the next measurement is lower than the required certainty gap set by the context’s liability tier. That stop rule is what delivers marginal indifference.

3) Outputs: testifiability and decidability
Require minimum scores on five axes of testifiability—categorical, logical, empirical, operational, reciprocity—and a decidability margin (best option’s advantage minus the required certainty gap) that clears the context’s threshold.

4) Limit-as-reasoning
Think of reasoning as convergence: keep measuring until additional evidence cannot reasonably flip the decision given the required certainty gap. Issue a short Indifference Certificate (EIC) documenting why further measurement isn’t worth it.

5) LLMs’ comparative advantage
LLMs excel at hypothesis generation and measurement planning; they struggle with global formal closure. Constrain them with the ledger + stop rule so their strengths are productive and their weaknesses are bounded.

Operationalization. Every claim reduces to concrete, measurable operations. No operation → no justified update.
Liability mapping. Map the context’s demand for infallibility into a required certainty gap and axis thresholds for testifiability.
Dependency control. Penalize correlated or duplicate evidence; price adversarial exposure.
Auditability. Every decision ships with the evidence ledger and the EIC.

Fat tails / ruin risks. Optimize risk-adjusted expected loss (e.g., average of the worst tail of outcomes) rather than plain expectation. Raise the required certainty gap or add hard guards for irreversible harms.
Multi-stakeholder externalities. Treat liability as a vector across affected groups. Clear the margin under a conservative aggregator (default: protect the worst-affected), so you don’t buy gains by imposing costs on a minority.
Severe ambiguity / imprecise priors. Use interval posteriors or imprecise probability sets; choose the set of admissible actions and apply the required certainty gap to break ties.
Model misspecification / distribution shift. Add a specification penalty when you suspect shift; raise the required certainty gap or fall back to minimax-regret in high-shift regions.
Information hazards / strategic manipulation. Price the externalities of measuring into the expected value of information; refuse measurements that reduce welfare under reciprocity constraints.

Liability schedule. Use discrete tiers (e.g., Chat → Engineering → Medical/Legal → Societal-risk). Each tier sets a required certainty gap and axis thresholds, with empirical and operational demands escalating faster than categorical and logical.
Risk-adjusted margin. Compute the decisional advantage using a tail-aware measure (e.g., average of worst-case slices), then subtract the tier’s required certainty gap.
Vector liability aggregator. Default to max-protect the worst-affected; optionally allow a documented weighted scheme when policy demands it.
Imprecise update mode. If uncertainty bands overlap the required gap, return admissible actions + next best measurement plan rather than a single action.
Certificate extension (EIC++). Include: chosen risk measure, stakeholder weights/guard, shift penalty, and dependency-adjusted evidence deltas.

Computability from prose. Language → operations → evidence ledger → certificate.
Graceful stopping. Every answer carries a why-stop-now justification: the next test isn’t worth enough to matter.
Context-commensurability. One artifact across domains; only the liability tier, axis thresholds, and required gap change.
Accountable disagreement. Disagreements reduce to public differences in priors, instrument reliabilities, or liability settings—all auditable.

The argument is correct in principle and superior in practice provided you:
(a) enforce operationalization,
(b) calibrate liability into a risk-aware required certainty gap,
(c) control evidence dependence, and
(d) emit an auditable certificate.
Do that, and “mathiness” gives way to measured, decidable action with bounded error—the product markets and institutions actually demand.

We use five liability tiers. Higher tiers mean higher stakes and a bigger required cushion before we act. Think in three pieces:

Expected cost: what you expect each option will cost after considering chances and consequences.
Spread: how jumpy that comparison is—use a robust “typical swing” (median absolute deviation) rather than a fragile standard deviation.
Required certainty gap: how much better the best option must be (beyond noise) at this tier before we’re willing to act.

We also look at tail risk—how the worst few percent of cases behave. Concretely, we judge using the average of the worst X% of outcomes (that’s CVaR in plain English).

Tiers and defaults

Tier Typical contexts Worst-tail slice we average over Required certainty gap = multiplier × spread Minimum evidence surplus 1 Casual chat, exploratory analysis worst 20% 0.25 × spread ~0.5 “bits” (≈ 1.4:1 odds) 2 Consumer advice, coding tips worst 10% 0.50 × spread ~1.0 bit (≈ 2:1 odds) 3 Engineering, finance (non-safety) worst 5% 1.00 × spread ~2.0 bits (≈ 4:1 odds) 4 Medical, legal, compliance worst 1% 2.00 × spread ~3.0 bits (≈ 8:1 odds) 5 Societal or irreversible harms worst 0.5% 4.00 × spread ~4.0 bits (≈ 16:1 odds)

Decision rule (“decidability margin”)

Compute the expected cost of the best option and the runner-up, using the worst-tail averaging appropriate to the tier.
Subtract the best from the runner-up to get the benefit gap.
Subtract the required certainty gap (the multiplier × spread).
If what remains is zero or positive, and the testifiability thresholds (below) are met, the choice is decidable. Otherwise, gather more measurement.

We score five axes from 0 to 1. Thresholds tighten with liability. Empirical and operational requirements ramp fastest.

Categorical: terms are defined and used consistently; no category mistakes.
Logical: reasoning is coherent; no unresolved contradictions or circularity.
Empirical: claims are supported by measurements from reliable instruments or sources.
Operational: the claim reduces to concrete, executable steps with preconditions and expected observations.
Reciprocity: expected externalities are priced and disclosed; the choice does not impose hidden costs on others.

Minimum scores required to act

Tier Categorical Logical Empirical Operational Reciprocity 1 0.60 0.60 0.30 0.30 0.50 2 0.70 0.75 0.50 0.60 0.70 3 0.85 0.85 0.70 0.75 0.85 4 0.90 0.90 0.85 0.90 0.90 5 0.95 0.95 0.95 0.95 0.95

Interpretation: by Tier 4–5 you need near-complete measurement and a runnable procedure—not just clean logic.

Default: log-opinion pooling with dependency penalties—plain English version:

Start with multiple sources (experiments, datasets, experts).
Give each a reliability weight from 0 to 1, based on instrument quality and track record.
Detect clusters of dependent or near-duplicate sources; reduce their combined influence so you don’t “double-count the same voice.”
Cap any single source’s influence so no one dominates.
Combine the adjusted contributions to update the odds for each hypothesis.

Practical settings (defaults you can change):

Penalty strength for dependency: moderate.
Weight cap for a single source: 40%.
For a cluster of m near-duplicates, divide the cluster’s total weight by the square root of m (effective sample size rule of thumb).

Every answer comes with a short Epsilon-Indifference Certificate—an audit trail that justifies why we stopped now and why this action is warranted.

What’s in it (human-readable fields):

Claim and context tier.
Priors used.
Evidence ledger: each item with type, reliability, “how much it moved the needle,” and which cluster it belongs to.
Pooling summary: the final weights after dependency penalties.
Posterior odds in plain numbers.
Options compared and their expected costs (already using the right worst-tail averaging for the tier).
Spread of that cost difference (the typical swing).
Required certainty gap for this tier.
Decidability margin: benefit gap minus required gap (must be ≥ 0).
Testifiability scores on the five axes vs. the tier’s thresholds.
Value of the next measurement: how much we expect the next best test to help; if it’s below the required gap, we stop.
Decision and a short rationale.
Audit hash (so the exact artifact can be reproduced).

A note on “bits of evidence”: 1 bit ≈ moving from 1:1 to 2:1 odds; 2 bits ≈ 4:1; 3 bits ≈ 8:1; 4 bits ≈ 16:1. We require a minimum surplus by tier.

Offer to settle: $2.20M.
If litigate: about $1.00M in legal costs; if you lose, $5.00M in damages.
After pooling evidence: about a 50% chance of losing in court (dependency-penalized sources).
Expected cost of litigating: 0.5 × $5.00M + $1.00M = $3.50M.
Expected cost of settling: $2.20M.
Benefit gap: $3.50M − $2.20M = $1.30M.

Tier-4 settings:

Worst-tail averaging: we judge using the average of the worst 1% of outcomes.
Spread (typical swing) in the cost difference: about $0.50M.
Required certainty gap: 2.0 × $0.50M = $1.00M.
Decidability margin: $1.30M − $1.00M = $0.30M → passes.

Testifiability scores clear Tier-4 thresholds (empirical and operational are high because we have concrete costs and procedures). The expected value of one more study on damages might improve things by about $0.25M—below the $1.00M required gap—so we stop.

Decision: Settle. EIC issued with the ledger.

Warranty price: $200 for three years.
If it fails: average repair cost $500.
After pooling: failure probability around 12% (duplicates penalized).
Expected cost without warranty: 0.12 × $500 = $60.
Expected cost with warranty: $200.
Benefit gap (skip − buy): $200 − $60 = $140.

Tier-2 settings:

Worst-tail averaging: average of the worst 10% of outcomes.
Spread (typical swing) in the cost difference: about $50.
Required certainty gap: 0.5 × $50 = $25.
Decidability margin: $140 − $25 = $115 → passes.

Evidence surplus is above the Tier-2 minimum. The next measurement (brand-specific reliability) is worth about $10, below the required gap, so we stop.

Decision: Don’t buy the warranty. EIC issued.

Language → operations: every claim is turned into steps, measurements, and expected observations.
Accounting, not proof-hunting: we keep a ledger of how each piece of evidence changes the odds, while pricing externalities as liability.
Context-aware stopping: we stop when the next test isn’t worth as much as the required gap for this tier.
One artifact across domains: only the thresholds and required gap change with stakes; the method and the certificate don’t.

Tiers: 5, with the worst-tail slices, gap multipliers, and evidence minima listed above.
Thresholds: empirical and operational escalate faster than categorical and logical; table above.
Pooling: log-opinion pooling with dependency penalties; weight cap per source; cluster de-duplication by effective sample size.

If you want a stricter Tier-5 (e.g., push the required gap multiplier from 4.0 to 5.0 for extra conservatism on irreversible harms), say the word and we’ll ratchet that one knob and keep everything else fixed.

Source date (UTC): 2025-08-19 23:08:43 UTC

Original post: https://x.com/i/articles/1957942837355639117

Solving The Problem: Computability and Decidability in the Open World (ed: This

Solving The Problem: Computability and Decidability in the Open World

Comments

Leave a Reply Cancel reply

More posts

(A Punch) In The Face

1) Overlays = Photoshop layers 2) Consider using 11×14 paper size to give yourse

well done. you’re doing great work

I don’t see anything to even question. It’s pretty rock solid. I might have to g