#86: Sometimes
Different format, same content, same stolen memes
Hey all! I kind of got bored of always writing in the same format, so this time I’ll try something new. I’m also really, really, really busy so there’s less snark.
Anyway, here’s a bunch of times.
This time (ayyy), the papers and news and memes will be all intermixed. Enjoy!!!
Unprecedented
Which really is the norm at this point
[2603.23951] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents — I think this is REALLY cool. What does bother me, however, is that there exists a reason why we hand-design these things, and it’s called ‘guarantees of convergence and correctness’. While there’s a lot of press on lLmS dOiNg MaThS, I highly disagree that they’ll be able to perform at the level we require.
[2603.22707] Detecting Non-Membership in LLM Training Data via Rank Correlations — honestly I’m gaining a lot of respect for JP Morgan’s research.1 This is solid work.
[2603.20079] Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues — ‘we find that individual cues vary with the listener’s level of understanding’.
There’s also this paper presented at LREC working on Ancient Egyptian, which is awesome.
The Best
(or at least funny)
Wikipedia Bans AI-Generated Content, With 2 Exceptions | PCMag — the difficult bit is that how will they be detecting it? Decentralisation implies that there’s less room for people to check other’s work. And manipulation/sockpuppetry can still be performed under the guise of ‘this was AI-generated content’.2
How AI Is Creeping Into The New York Times - The Atlantic — if you read my newsletter often, you’ll realise how amused I am about this. Though I do think that AI-generated content detectors are slightly better (and should catch low-hanging fruits3), they still do not make the cut.
OpenAI pulls the plug on its Sora AI video app - CBS News — I’m gonna go for this:
[2603.20730] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks — kinda would like to see some stat tests on these results (it feels odd to just read ‘accuracy’ but ohkay)
I also really liked this one about formalisers. The results are decidedly not what I expected, but I also liked how they eventually managed to break it. Great paper all around.
The Blurst
The ones which are frustrating to know about
New study says AI is giving bad advice to flatter its users | AP News — I shared the paper last week, but look how important this issue has become that literally the AP covers it. It was also covered by US News, in a less-subtle manner: Want A Bootlicking Yes Man? Ask An AI Chatbot For Advice, Study Warns
Number of AI chatbots ignoring human instructions increasing, study says | AI (artificial intelligence) | The Guardian — look, I keep writing about this over and over. Agentic systems without STRONG guardrails are a recipe for disaster. Something something greed and stupidity being correlated at p < 0.0000000000000001
[2603.25187] Probing the Lack of Stable Internal Beliefs in LLMs — with the sheer volume of persona-based research, this is an unsettling finding to say the least.
Desperate
They require desperate measures
Tech company climate goals under pressure due to AI energy demand | AP News — I’ve been always worried about this. But my double worry is that tech companies have wayyyy too much power. I don’t think they’ll be stopped by another climate change report.
[2603.19574] AI Psychosis: Does Conversational AI Amplify Delusion-Related Language? — answer: yes. It’s quite interesting though, since it shows variability over themes.
[2603.22015] Retrieving Climate Change Disinformation by Narrative — love the work from this group. Can we say that I do wish this paper had never had the need to be written? It drives me nuts that disinformation regarding climate change is a thing. It’s like that one with the doctor who decided to wash his hands when delivering a baby and was demonised by the community lol. Like why would people be so dumb.

[2603.23292] LLM Olympiad: Why Model Evaluation Needs a Sealed Exam — I had a fascinating conversation within my org on the need for sealed datasets. For some reason, it does not compute that knowing the answers to the exam defies the purpose of the exam. But I will stop ranting since I apparently made this one rant-free.
[2601.11633] Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images — this paper is frustrating (all papers in this section are good!!!!) because their findings are not surprising (we all saw them coming) but still had to be written. I also like their benchmark, jtbc.
On the same vein as the one above, there was one on the same but with 3D models, which I again say that we should stop acting surprised about that (but I’m thankful they wrote it).
There was some evaluation on LLM internal safety collapse which is SO interesting. Great paper tbh.
The Good Ones (That We Shall Let Roll)
Good ones are below best, but still good.
US jury verdicts against Meta, Google tee up fight over tech liability shield | Reuters — this is not really AI, but I do appreciate accountability and holes in armours
[2603.25222] Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages — I think that this is an excellent idea. The findings on token fertility aren’t that surprising, but these metrics are excellent to keep the field honest (in a metaphorical sense).
[2603.23507] Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes — you know that if there’s DLMs I will share it.
Of note there also was this parallel English-Tashlhiyt corpus (no, I didn’t know of this language either; that’s the beauty of these times we live in) and a catalogue of Basque dialects. Plus, I also really liked that there was a lot of non-LLM work, like this one doing analysis of Evgenij Onegin, and some topic analysis / sentiment analysis for good; namely an evaluation of post-Roe discourse.4
New Roman
You know what this section is.
I saw A LOT of papers from random companies (consulting, analytics, data annotation, etc), which A LOT to do with what they do. I really like that they have an emphasis on publication, but does make me sceptical on their intent (marketing, maybe? I can barely tolerate big tech’s). So I did not link them anywhere.
[2603.21174] Explainable Semantic Textual Similarity via Dissimilar Span Detection — look I think it is interesting for sure. But the claim (one STS score → less interpretability) does not match the results (new STS score!)
[2603.21350] Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles — It’s not a bad paper but I have strong reservations (as always) with epistemic experimentation + this methodology.
[2603.23114] Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment — stop 👏 anthropomorphising 👏 LLMs👏 (though I like the pragmatics stuff).
[2603.23146] Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy — the argument itself is not that interesting but the methods are absolutely amazing. Yes, I know I talk a lot about methods. That’s because it is the scientific method, not the scientific trustmebro.5
Ok, one more:
[2603.22312] The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis — I am so not sure about this paper. It’s quite interesting, but I’d argue that hearing ‘thus confirming X hypothesis’ requires way more than a percentage. I’m also unsure about using RL as an argument for emergence.
I did not have no respect for them lol; I just didn’t know anything about them before!
I have myself noticed Wikipedia pages for two VERY prominent scientists have curiously been systematically erasing sections related to criticism of their work.
You really should see my ICML reviews. It’d be really stupid to not flag them as AI when they call your paper ‘intriguing’
It is by far one of the worst things ever done by SCOTUS. But it is even worse when no more than six months later the US govt moved to keep states from regulating AI
Looking at you, Anthropic/OpenAI






