#83: This is fine
Finally my school-mandated Shakespeare paid off.
Update (28/2): Hilariously, OpenAI struck a deal with the Pentagon like a few hours after Anthropic’s deadline timed out. So the news I discuss below get even worse.
Ok, this one is very obscure. But, at least, proof that I know how to read.
Apparently an Old English short for Macbeth was ‘Bean’.
lol jk don’t believe anything from the Internet.
I think it is true for medieval Gaelic, but that one I don’t have any citations beyond a random Twitter post.
On the other hand, Sean Bean did play Macbeth. So there’s at least some tenuous association between that and the many quotes and memes you’ll get today even though you never asked for it.
In the news: Oddly, every news follows after another. And it’s not pretty. I didn’t even have to put my tinfoil hat this time.
In the papers: as a happy prologue to this swelling act, a supremely excellent paper in TACL on pragmatics, and a variety of other things, like pigeons, short-legged hens, mutton, and some tiny kickshaws. Idk what’s a kickshaw.
The cut content had mostly papers that said ‘this is bad’ but didn’t offer fixes gawd I’m so tired of that like we 👏know 👏llms 👏suck 👏fix 👏it
News
This time the news flow in order, like a river full of garbage next to a rare earth extraction plant.
Anthropic says it will not accede to Pentagon demands as deadline looms | AP News — just to be clear, these demands aren’t exactly what you’d call (a) out of character, and (b) large compared to what they have done (with permission) with Claude already.
TL;DR: it’s not about LLM guardrails. It’s about the fact that Anthropic explicitly said that they do not want Claude to be used in the following:
So it’s ok that it’s not in fully autonomous weapons (thank god lol, they are so bad1), and it is ok to use it to invade another country, but dear lord imagine what would happen if Claude were used in mass surveillance of Americans. Like it isn’t happening already (since like 2001) and literally any other LLM can be used for the same purpose.
Also note, nowhere in the article does Anthropic mention ethics. It’s just that ‘they are not ready’:

How Burger King's AI headsets are transforming employee interactions | AP News — plus, if you wanted something orwellian, you do not need the US government. After all, we do live in a technocracy.
Do you think I am exaggerating?
Californian pulls AI ballot measures, citing OpenAI intimidation - POLITICO — I wasn’t. I also think that political intimidation before a vote is a rather interesting tactic. Almost as if they are afraid of something. Someone definitely needs a swig of the ol’ milk of human kindness.
How Teens Use and View AI | Pew Research Center — I mean, after all, our children are the future. If you want someone to have a positive view of something, you should probably start indoctrinating them early.2
Anyway, I do want to close with something interesting. And, to be honest, not very interesting. As usual, I disagree with like 50% of what this article says, but I did like a lot of their points (e.g., how would AI consciousness impact its marketability lol)
Papers
[2602.23351] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning — I am genuinely surprised about this. One would argue that vision would add pragmatics, but… I guess they barely understand things? What I did love about this is that they point at reporting bias as the core problem. <3 (and yes they also take a potshot at emergence claims; do people still make these3?)
[2602.22752] Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction — okay I think this is not helpful on the whole ‘simulating social media users’, but does provide insight as to why personas do not work. I am unsure whether biographies will fix the problem, since the models have aligned towards a specific behaviour. Still an ok paper.
[2602.22661] dLLM: Simple Diffusion Language Modeling — the dLLM paper is out!! This makes me so happy.
[2602.22424] Causality $\neq$ Invariance: Function and Concept Vectors in LLMs — this is such an interesting paper, and makes sense. I mean, function vectors would be silly when the models are mostly assuming the distributional hypothesis, no? I like it a lot. The results also say (without saying it) that concept encoding does not mean learning. I’m so torn on that.
[2602.22221] Misinformation Exposure in the Chinese Web: A Cross-System Evaluation of Search Engines, LLMs, and AI Overviews — TL;DR: AI-mediated search sucks. And it really does, by the way. Half the time is a citation hallucination and the other half is incomplete info. But they are good when the query is easy to find in StackOverflow, so it saves you a click? But not really?4
[2602.21230] TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents — this is essentially a new metric for these agents. I like when measurement is grounded on fact, like ‘most benchmarks do not correspond to irl results’.
Speaking of, I ran into the paper ranking issue lately and I quite liked this paper as both a warning and as a ✨growth opportunity ✨5
[2602.21143] A Benchmark for Deep Information Synthesis — I know I criticise benchmarks pretty heavily this week, but this one is worth looking at. It’s called a double standard, Bobby. Don’t ruin it for us.
[2602.19743] NILE: Formalizing Natural-Language Descriptions of Formal Languages — I like this paper and I think that the key bit here is to translate this into the capabilities of LLMs. I’m not entirely sure about its usefulness given that LLMs are well-known to lie somewhere between R and CFG, but ohk.
[2602.18458] The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research — this is SUCH A GREAT PAPER. I like how they are going directly after fully reproducible research, even though I am still like ‘ew llms’.
[2602.19403] Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins — I am SO SO SO torn on this one due to, well, its implications. But they use it for good? Arhhrhghghg
Anyway. There was this paper proposing web-grounded cultural alignment, a position paper which slipped arxiv (but it is good) on cognitive models (but wtf is an ‘AI algorithm’) to develop better agents, and this one showed that LLMs as advisors suck. Which we guessed, but we didn’t know, so they get a pass.
Also someone is trying to watermark agentic trajectories, which is both useful and… I am not sure about this work lol.
There was also a wee paper on adding alphabet onscreen characters to an endangered language, which I like since I am also working on something similar.
Also JP Morgan has not just an evergreen booth at all conferences, but also a findings paper on claim verification that looks pretty good! Huh.
The Cut Content
Because I will dispute it as a man, but I shall also feel it as a man.
[2602.23225] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? — look, I’m not entirely convinced on this. Language is not sequential (think about it, you actually need to refer to the previous bits in the sentence to coreferently resolve what I’m talking about). And also, changing the data changes the problem RENDERING NON-COMPARABLE. But I think that it is neat as a method.
[2602.21854] FewMMBench: A Benchmark for Multimodal Few-Shot Learning — Look I always amplify research by uni of Amsterdam and surroundings, but we do 👏not 👏need👏 a 👏new 👏generalised👏 benchmark 👏and we definitely do not need one where every model has uniform, near-70% best-of perf on synthetic tasks.
I am being a bit unfair since their taxonomy is interesting, and I like their sampling strategy (finally something sane). However, I still stand by ‘why do we need a benchmark, just fix this and release the data’.
[2602.21485] Evaluating the Usage of African-American Vernacular English in Large Language Models — This one has a good cause, but needed a lit review. And again, lots of papers point fingers and go like ‘this is bad’; fix it, damn it.

Apparently Mr. Bean only lasted two seasons wtf. It was, as one would say, BUT A WALKING SHADOW.6
I should write a post on this paper, since I already did the Civ and LLM-as-a-judge ones.
Back in my day if I wanted to be disinformed I would be disinformed by my racist uncle’s memes on Facebook and obscure discussion forums, damn it.
Outside of the completely nonsensical papers I rejected this week
Gemini has this copy-paste tracking function which drives me insane. I have to type linux commands manually like if I were living in 1969
No, I do not use DR. I’m a good scientist. I study measurement in science, which means studying DR.
I get to misuse one.








