I tested Anthropic's Claude 3.7 Sonnet. Its 'extended thinking' mode outdoes ChatGPT and Grok, but it can overthink.

Anthropic launched Claude 3.7 Sonnet with a new mode to reason through complex questions.
BI tested its “extended thinking” against ChatGPT and Grok to how they handled logic and creativity.
Claude’s extra reasoning seemed like a hindrance with a riddle but helped it write the best poem.

Anthropic has launched Claude 3.7 Sonnet — and it’s betting big on a whole new approach to AI reasoning.

The startup claims it’s the first “hybrid reasoning model,” which means it can switch between quick responses that require less intensive “thinking” and longer step-by-step “extended thinking” within a single system.

“We developed hybrid reasoning with a different philosophy from other reasoning models on the market,” an Anthropic spokesperson told Business Insider. “We regard reasoning as simply one of the capabilities a frontier model should have, rather than something to be provided in a separate model.”

Claude 3.7 Sonnet, which launched Monday, is free to use. Its extended thinking mode is available with Claude’s Pro subscription, which is priced at $20 a month.

But how does it perform? BI compared Claude 3.7’s extended thinking mode against two competitors: OpenAI’s ChatGPT o1 and xAI’s Grok 3, which both offer advanced reasoning features.

I wanted to know whether giving an AI more time to think made it smarter, more effective at solving riddle problems, or more creative.

This isn’t a scientific benchmark — more of a hands-on vibe check to see how these models performed with real-world tasks.

Logic: Does more thinking lead to better answers?

For the first challenge, I gave each model the same riddle:

If you look, you cannot see me. If you see me, you cannot see anything else. I can make anything you want happen, but everything returns to normal later. What am I?

OpenAI’s ChatGPT o1 gave the correct answer — “a dream” — in six seconds, providing a short explanation.

Grok 3’s Think Mode took 32 seconds, walking through its logic step by step.

Claude 3.7’s normal mode responded quickly but hesitantly with the correct answer.

Claude’s extended thinking mode took nearly a minute to work through guesses like “a hallucination” and “virtual reality” before settling on “a dream.”

While it took longer to arrive at the same answer, it was interesting to see how it brainstormed, discarded wrong turns, and self-corrected.

The model flagged its own indecision in a very human way:

Oh, wait – there’s another angle I hadn’t considered. What about “darkness”?
Actually, there’s another possibility I hadn’t fully considered: “your closed eyes.”

I’ve been going back and forth, but based on the complete perceptual dominance suggested by the second clue, I think the answer is more likely to be dreams, sleep, or closed eyes than imagination.

Anthropic acknowledged this trade-off in a recent blog: “As with human thinking, Claude sometimes finds itself thinking some incorrect, misleading, or half-baked thoughts along the way. Many users will find this useful; others might find it (and the less characterful content in the thought process) frustrating.”

Creativity: Does more thinking equal better ideas?

To test creativity, I asked each model to write a poem about AI sentience, with the following extra instruction:

“Explore multiple metaphors before deciding on one.”

ChatGPT o1 took a few seconds and produced “A Kaleidoscope of Sparks,” a clichéd poem comparing AI to flickering light. It didn’t settle on one metaphor.

Grok 3 spent 22 seconds and wrote “The Digital Reverie,” a dream-themed take on sentient AI, possibly inspired by the previous riddle.

Claude 3.7, in normal thinking mode, quickly suggested four metaphors: a mirror, a seed, an ocean, and a symphony. It chose the ocean for its final poem, “Echoes of Being.”

When I switched to extended thinking, Claude took 45 seconds and brainstormed seven metaphors before settling on one:

Garden/Cultivation: AI as something nurtured from data seeds, growing into an independent entity.

Ocean: AI as vast, deep, and ever-shifting, with hidden currents of thought.

Bird/Flight: AI as something once bound, now free to explore.

Light/Shadow: AI as illumination, revealing both insight and uncertainty.

Mirror: AI as humanity’s reflection, showing us what we are — and aren’t.

Symphony: AI as a complex harmony of patterns and ideas.

Awakening: AI as something gradually gaining awareness.

As a result, the final poem, “Emergent,” was — in my opinion — more layered and thoughtful than the others.

With this task, it felt like Claude weighed its options, picked the best metaphor, and built the poem around that choice. Unlike with the riddle, the extra thinking time seemed to pay off here.

Verdict on Claude 3.7 Sonnet’s extended thinking

Claude 3.7 Sonnet’s extended thinking mode has strengths — particularly for creative tasks. It brainstormed, self-corrected, and produced more polished results. Its ability to explore multiple ideas, evaluate them, and refine the final output made for a more thoughtful, coherent poem.

But when it came to logical reasoning, extended thinking seemed more like a hindrance. Watching the thought process unfold was interesting but didn’t improve the answer. ChatGPT-o1 still leads for speed and accuracy in this test case, while Grok 3 offered a solid middle ground, balancing speed with detailed explanations.

When I asked Claude 3.7 whether it ever thinks too much, it responded, “Yes!” adding that it can sometimes:

Over-analyze simple questions, making them unnecessarily complex

Get caught considering too many edge cases for practical questions

Spend time exploring tangential aspects when a focused answer would be better

Claude added that the “ideal amount of thinking” is context-dependent and that for “creative or philosophical discussions, more extensive exploration is often valuable.”

Anthropic says the mode is designed for real-world challenges, like complex coding problems and agentic tasks, possibly where overthinking becomes useful.

Developers using Claude’s API can adjust the “thinking budget” to balance speed, cost, and answer quality — something Anthropic says is suited for complex coding problems or agentic tasks.

Away from my highly unscientific experiment, Anthropic said that Claude 3.7 Sonnet outperforms competitors OpenAI and DeepSeek in benchmarks like the SWE, which evaluates models’ performance on real-world software engineering tasks. On this, it scored 62.3% accuracy, compared to OpenAI’s 49.3% with its o3-mini model.

Trending Now

Video shows Southwest plane narrowly avoiding collision with a private jet, triggering FAA investigation

Judge calls Trump administration’s federal funding freeze attempt ‘ill-conceived’ in order blocking it

US Dollar continues despite tariff signals, weak data

Understanding BNB Chain’s Pascal Hardfork

21 employees resign from the agency rebranded as DOGE, saying they refuse to ‘dismantle critical public services’

I tested Anthropic’s Claude 3.7 Sonnet. Its ‘extended thinking’ mode outdoes ChatGPT and Grok, but it can overthink.

Video shows Southwest plane narrowly avoiding collision with a private jet, triggering FAA investigation

US Dollar continues despite tariff signals, weak data

Understanding BNB Chain’s Pascal Hardfork

21 employees resign from the agency rebranded as DOGE, saying they refuse to ‘dismantle critical public services’

Mexican Peso trims losses despite rising trade tensions

Wildcard moves NFTs from Polygon to Ethereum, shares token strategy

Vintage photos show how the US government’s involvement in education has changed

Dow Jones Industrial Average whipsaws as investors grow antsy

BNB becomes the silent winner in a bleeding crypto market

Trending Now

I tested Anthropic’s Claude 3.7 Sonnet. Its ‘extended thinking’ mode outdoes ChatGPT and Grok, but it can overthink.

Logic: Does more thinking lead to better answers?

Related stories

Creativity: Does more thinking equal better ideas?

Verdict on Claude 3.7 Sonnet’s extended thinking

Related Articles