by simonw on 10/13/25, 11:14 PM
If you take a look at the system prompt for Claude 3.7 Sonnet on this page you'll see:
https://docs.claude.com/en/release-notes/system-prompts#clau...> If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.
But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.
Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.
by jazzyjackson on 10/14/25, 2:51 AM
That's good. 1 800 chat gpt really let me down today, I like calling it to explain acronyms and define words since I travel with a flip phone without google, today I saw the word "littoral" and tried over and over to spell it out but the model could only give me the definition for "literal" (admittedly a homonym but hence spelling it out, Lima indigo tango tango oscar Romeo alpha Lima, to no avail)
I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."
Thankfully, the flip phone allows for some satisfaction when hanging up.
by malshe on 10/13/25, 11:56 PM
I play Quartiles in Apple News app daily (
https://support.apple.com/guide/iphone/solve-quartiles-puzzl...). Occasionally when I get stuck, I use ChatGPT to find a word that uses four word fragments or tiles. It never worked before GPT 5. And with GPT 5 it works only with reasoning enabled. Even then, there is no guarantee it will find the correct word and may end up hallucinating badly.
by necovek on 10/14/25, 12:58 AM
I think the base64 decoding is interesting: in a sense, model training set likely had lots of base64-encoded data (imagine MIME data in emails, JSON, HTML...), but for it to decode successfully, it had to learn decode sequences for every 4 base64 characters (which turn into 3 bytes). This could have been generated as a training set data easily, and I only wonder if each and every one was them was found enough times to end up in the weights?
by zeroq on 10/14/25, 2:24 PM
- How many letters R are in the word `strawberry`?
- There are seven letters R in the word `strawberry`.
Would you like me to rearrange them?
by atleastoptimal on 10/14/25, 6:32 AM
I rearry rove a ripe strawberry
by throw-10-13 on 10/15/25, 8:10 AM
"AI are getting better at search and replace, something that every text editor has been able to do for 40 years."
by hansonkd on 10/14/25, 12:41 AM
chatgpt5 still is pathetically bad at roman numerals. I asked it to find the longest roman numeral in a range. first guess was the highest number in the range despite being a short numeral. second guess after help was a longer numeral but outside the range. last guess was the correct longest numeral but it miscounted how many characters it contained.
by viraptor on 10/14/25, 1:03 AM
Why bother testing though? I was hoping this topic has finally died recently, but no. Someone's still interested in testing LLMs for something they're explicitly not designed for and nobody is using them for this in practice. I really hope one day openai will just add a "when asked about character level changes, insights and encodings, generate and run a program to answer it" to their system so we can never hear about it again...