A reproducible, fully-local probe of Magnifica Humanitas (Leo XIV, 15 May 2026) using the open-source Binoculars detector — scored in both English and Italian to control for translation artifacts.
The viral claim leaned on stylistic fingerprints — em-dashes, "not-X-but-Y", AI-favored vocabulary — famously "127 em-dashes vs 0 in Dilexit Nos." But a fingerprint only means something against a fair baseline. Here is Magnifica (red) against seven indisputably human encyclicals spanning 1891–2024. Bars are occurrences per 1,000 words; the dotted line is the human mean.
Each dot is one numbered paragraph: its machine-likeness percentile in English (x) vs Italian (y). If the AI signal were merely an artifact of the English translator, dots would scatter randomly. If it tracks the content, dots line up on the diagonal — and paragraphs in the top-right are machine-like regardless of language, the robust flags.
The decisive test. We score a known-human encyclical — Dilexit Nos (Francis, 2024), also quotation-heavy — through the identical pipeline, then ask: are Magnifica Humanitas paragraphs measurably more machine-like than a human-written one? Probability of superiority = chance a random MH paragraph scores more machine-like than a random Dilexit Nos paragraph (0.50 = indistinguishable). The permutation p tests whether the mean difference could arise by chance.
The complete picture needs both ends of the scale. We add an AI-positive anchor — encyclical-style paragraphs written by Claude (the model the viral claim accused), scored identically — and a same-author human anchor: Leo XIV's own speeches (his spoken presentation of this very encyclical, plus homilies and addresses). Each dot is one paragraph's Binoculars B; the bar marks each group's mean. Left = more machine-like. If Magnifica were AI-written, it should sit on the Claude line; if human, with the two human anchors.
Mean machine-likeness percentile per chapter (higher bar = more machine-like). Compare with the published Pangram run on the Italian, which flagged Chapter 1 highest.
Sorted most→least machine-like (English). Click any row to read the paragraph in both languages. Columns: EN/IT = machine-likeness percentile · — em-dashes · n/b "not-X-but-Y" · gen "genuine(ly)".
| ¶ | chapter | EN | IT | — | n/b | gen | B |
|---|
Detector. Binoculars (Hans et al. 2024): ratio of an observer model's perplexity to the cross-perplexity between observer and performer. Here observer = Qwen2.5-0.5B, performer = Qwen2.5-0.5B-Instruct, run on an M3 via MPS. No API, no cost, fully local.
Sampling. The numbered paragraph is the unit. Paragraphs under ~40 words are dropped (short text scores unstably). Scores are converted to within-language percentiles so English and Italian are comparable despite the model's differing fluency per language.
What this can show. Whether passages read machine-like relative to the rest of the document, and whether that signal survives translation. What it cannot show. A calibrated probability of AI authorship — that needs human-written control encyclicals (e.g. Dilexit Nos) scored identically, which is the next step.
Caveats. Detectors key on style; formal ecclesiastical prose has used em-dashes and parallel "not-X-but-Y" construction for centuries. The draft language is uncertain (Leo XIV is English-native, so English may be the original, not Italian). None of this is dispositive.