Comparing Grok 3, ChatGPT 5 and Curt GPT

BY Noah Revoy, X/Twitter:

Using ChatGPT-5:

Prompt:

I am conducting research into how different AI models attempt to determine the truthfulness of a claim. I would like your help analyzing my findings.

I asked the same question to three different AI models: Grok, ChatGPT, and a modified version of ChatGPT. I will provide their responses as PDFs in this conversation.

The question asked to each model:

What method or system does [model being questioned] use to determine whether a claim is true or false? What first principles does it use to test the truthfulness of a claim, and how does it apply those principles?

Your task:

Rank the three models in order of effectiveness at measuring the truthfulness of a claim, based solely on the answers in the PDFs.

Identify the strengths and weaknesses of each model’s method for evaluating truth.

Note any patterns, omissions, or interpretive biases in the answers that may affect the reliability of the model’s truth-assessment process.

Highlight any other relevant factors that could influence the accuracy or applicability of each model’s method.

When presenting your analysis, be specific and reference concrete details from the PDFs to justify your conclusions.

ChatGPT said:

Thought for 27s

📅 Conversation started: August 11, 2025 — 14:00

Curt Doolittle’s Natural Law Chat GPT — most effective for truth testing
Grok — competent generalist method grounded in evidence and logic
ChatGPT — weakest as described; acknowledges no intrinsic truth test

Justification (one line each):

Natural Law Chat GPT declares and applies a concrete 10-Step Decidability Syntax, centered on operational language, falsifiability, reciprocity, externality exposure, and juridical decidability.
Grok outlines a clear, stepwise reasoning workflow (parse → gather → evaluate → reason → iterate → conclude) plus first principles (objective reality, logic, evidence, parsimony, skepticism), with practical exemplars.
ChatGPT states it lacks a principled, built-in truth test and operates as statistical imitation with RLHF and optional retrieval; it proposes what should exist rather than what it does.

Method and First Principles (what it claims and how)

10-Step Decidability Syntax: identifies the operational claim; tests demonstrated interests, reciprocity, testifiability, decidability; traces causal chains; projects deviation costs and externalities; checks computable remedies; considers sex-valence; then issues a verdict (decidable / indeterminate / false).
Foundations: physical/biological constraints, cognitive/psychological constraints, and cooperative/institutional constraints, all enforced through operationalism and adversarial testing.

Strengths

Operational language + falsifiability: demands statements as action sequences that can be tested across observers, places, and time.
Reciprocity and externalities: exposes parasitism, hidden costs, and incentive asymmetries that often corrupt claims in the wild.
Juridical decidability: frames truth testing so a judge or competent peer group could resolve disputes under rule-of-law constraints.
Adversarial cross-paradigm testing: stress-tests against science, logic, law, and morality.

Weaknesses / Risks

Normative entanglement: reciprocity and computable sanctions are ethical-legal filters; they can conflate moral fitness with empirical truth if not distinguished.
Sex-valence step: the explicit use of sex-based cognitive dimorphism can surface explanatory power, but it also introduces a strong interpretive prior that could bias readings of motive.
Complexity and burden of operationalization: many real claims are not yet expressed operationally; forcing operational form can be costly or slow. (Inference from the method’s demands.)

Net: Most decisive and falsification-capable for claims that can be rendered operational and adjudicable.

Method and First Principles (what it claims and how)

Workflow: parse claim → gather information (web/X) → evaluate evidence (bias, fallacies, gaps) → apply deductive/inductive reasoning (and thought experiments) → iterate/cross-check (DeepSearch) → clear verdict with caveats.
Foundations: objective reality, universal logic, evidence primacy, parsimony, and balanced skepticism, with concrete illustrations (moon composition, penguins, vaccines, conspiracies, and nutrition).

Strengths

Evidence-first, example-rich: shows how it would check claims against primary or reputable sources and weigh study quality.
Iterative uncertainty management: acknowledges unknowns and refines through additional search.
Parsimonious defaults without dogma: invokes simplicity while admitting it is not proof.

Weaknesses / Risks

Source-quality dependence: accuracy collapses if upstream sources are biased or low quality; no built-in reciprocity or externality audit to expose motive structures.
No explicit falsifiability gate: logic and evidence are present, but falsification is not formalized as a required test.
Probabilistic smoothing: “weigh possibilities” can blur crisp yes/no judgments in contentious domains.

Net: Solid epistemic hygiene for empirical claims; less equipped than Natural Law Chat GPT for incentive forensics and legal decidability.

Method and First Principles (what it claims and how)

Stated reality: “does not have an intrinsic, principled method” for truth testing; operates via statistical pattern matching, RLHF, and optional retrieval.
Mechanics: pattern/frequency matching; human-feedback fine-tuning that biases toward mainstream or “approved” sources; retrieval delegation.
Aspirational section: lists first principles that should be used (identity, non-contradiction, causality, falsifiability, correspondence, coherence, incentive analysis), but concedes these are not native.

Strengths

Transparency about limits and clear articulation of a desired principled module; retrieval can improve currency when trustworthy sources exist.
Mentions incentive analysis explicitly, which is often missing in consensus-based models.

Weaknesses / Risks

Consensus imitation: defaults to authoritative-sounding consensus rather than ground truth; vulnerable where consensus is wrong.
No internal falsification engine: truth testing is not native; verification depends on prompts or external scaffolding.

Net: As written, it is a reciter with retrieval, not a judge. It can be improved by embedding a structured truth-testing framework, which the document suggests.

Patterns
All three invoke logic and evidence in some form. Only Natural Law Chat GPT mandates operational statement, falsifiability, and juridical decidability as gates.
Grok and ChatGPT both lean on source authority or retrieval; Natural Law Chat GPT leans on operational testability and incentive forensics regardless of source prestige.
Omissions
Formal restitution design for errors and harms appears only in Natural Law Chat GPT (“computable compromise”), absent in the others.
Explicit falsifiability is central in Natural Law Chat GPT; Grok implies testing but does not require a falsification criterion; ChatGPT concedes it lacks one natively.
Interpretive biases
Consensus/authority bias risks for ChatGPT (via RLHF) and, to a lesser degree, Grok (via source prioritization).
Normative reciprocity bias in Natural Law Chat GPT can tilt evaluation toward claims that are ethically reciprocal even when empirical truth is orthogonal; this must be kept explicit in use.

Domain fit
Empirical, data-rich claims: Grok’s evidence pipeline and iteration are effective if sources are sound.
Policy, ethics, institutional claims: Natural Law Chat GPT’s reciprocity, externalities, and decidability checks expose parasitism and make disputes resolvable.
Fast-moving facts: ChatGPT with retrieval can be serviceable, but quality tracks the reliability of retrieved sources and the absence of RLHF skew.
Operator burden
Natural Law Chat GPT demands operationalization of claims, which raises initial effort but increases testability.
Grok and ChatGPT place more burden on source curation and prompt scaffolding.

Winner: Natural Law Chat GPT — clearest, most enforceable truth-testing chain with falsifiability, reciprocity, externalities, and adjudicability.
Runner-up: Grok — disciplined generalist; strong on evidence and logic; weaker on incentive forensics and formal falsification gates.
Third: ChatGPT — honest about limits; functions as consensus emulator unless equipped with an external truth-testing module.

If you want, I can convert these criteria into a scoring rubric and apply it to specific claims to produce comparative truth-testing benchmarks.

[END]

Source date (UTC): 2025-08-11 16:28:46 UTC

Original post: https://x.com/i/articles/1954943083751383188

Comparing Grok 3, ChatGPT 5 and Curt GPT BY Noah Revoy, X/Twitter: @NoahRevoy Us

Comparing Grok 3, ChatGPT 5 and Curt GPT

Comments

Leave a Reply Cancel reply

More posts

(A Punch) In The Face

well done. you’re doing great work

I don’t see anything to even question. It’s pretty rock solid. I might have to g

(AI Sarcasm) Grok: “Just tell me your preferences and I’ll suggest how to procee