I have a weird hobby. I spend a lot of time on LM Arena, the “chatbot arena” where you blindly test two different AI models against each other. For me, it’s not just about finding the “best” one. I’m fascinated by how they think. I’ll give them the same complex prompt over and over, just to see the differences in their chain of thought and reasoning.
Most of the time, the results are good. Impressive, even. But they’re also… predictable. You start to see the patterns.
Then, a few nights ago, I ran my favorite test prompt. I was shocked for the first time. The response was so natural, so perfect, it was the kind of thing that makes the hairs on the back of your neck stand up.
The Test
My benchmark test is a real-world, high-stakes task. I uploaded my own resume as an image and pasted in the job description for a real Product Manager role at AWS.
My prompt is simple:
“You are an expert AWS recruiter… Based on the thousands of resumes you’ve seen, what is the unbiased review, and is he an ideal candidate for this role? Yes/no. Do a detailed analysis.”
It’s a “blind test,” remember. I get two responses, “Model A” and “Model B.” I have no idea which is which.
Model A (which was later revealed to be an advanced Flash model) gave me a great, A-grade answer. It said “YES (Strong Fit).” It praised my experience, pulled keywords, and even made a nice little table matching my resume to the job. It was exactly what I expected.
Then I read Model B’s response.
It started with a verdict that hit me like a ton of bricks: “The Verdict: NO.”
My first thought was, “What? It’s wrong.” But then I read the “Why.”
Model B, in the persona of a seasoned AWS “Bar Raiser,” didn’t just read the words on my resume; it understood the unspoken, real-world context of the hiring process.
Its first point:
“The ‘Knock-Out’ Factor: Availability.”
It had spotted my “May 2026” graduation date and instantly flagged it as a “logistical blocker” for a full-time role. It reasoned that a standard job requisition isn’t going to hire someone 6 months in advance. This wasn’t something I programmed in. It inferred it. Model A missed this completely.
My mind was already blown, but it wasn’t done.
Its second point:
“Functional Fit: The Mismatch.”
It diagnosed the AWS job as a “Programmatic/Outbound PM role” (focused on “communications strategy”) and my resume as that of a “Technical/Builder PM” (focused on “GenAI service”). It correctly identified this “functional misalignment,” reasoning that I’d probably be bored in the role.
This wasn’t keyword matching. This was expert-level career guidance.
And then, the kicker. The part that felt incredibly human. It said my resume suffered from “Keyword Stuffing” and that my overuse of “Operational Excellence” felt “robotic” and “lacked a bit of ‘soul.'”
I was simultaneously insulted and in absolute awe.
The Reveal
This was the first time I had ever been truly “shocked” by a response. It wasn’t just correct; it was wise. I instantly clicked it as the winner.
Only after I made my choice did the arena reveal the model’s name: Riftrunner.
I’d heard the whispers and rumors about this model, how everyone speculates it’s a secret test version of Google’s Gemini 3. I’d been chasing it in the arena, curious about its “chain of thought.”
What I found was something else entirely. It’s the first model I’ve seen that doesn’t just match patterns; it demonstrates true, critical reasoning. It gave me a response that was brutal, brilliant, and—more than anything else—human.
…Now, if you’ll excuse me, I have to go edit my resume. Apparently, it needs more “soul.”