Exploring Reasoning LLMs and Their Real-World Applications

LLMs have excelled in writing, coding, and problem-solving tasks and prompts based on the data sets they were trained with. However, these models fall short when used to solve complex puzzles because they respond with the information they were trained with and lack the ability to self-correct. Recent LLMs, like OpenAI's o1 and o3 models, DeepSeek R1, Gemini 2.0 Flash Thinking, and Grok 3 have proven language models can use reasoning to solve complex mathematical, logical, science, and visual reasoning tasks and problems than previous models like gpt-4o.

This article focuses on LLMs' thinking techniques, reasoning tasks, and comparison with non-reasoning LLMs. You will also discover the usefulness of reasoning LLMs in application areas, such as integrating them as multi-agent models for seamless question-and-answer in customer support-based AI Chat Bot, Marketplace, Education, and Healthcare apps.

What is Reasoning in an LLM?

Large Language Models (LLMs) trained to think before executing complex tasks are known as reasoners. Reasoning is the actual thinking of an AI model that usually consists of a long Chain of Thought (CoT) before giving answers to user prompts. Reasoning LLMs’ CoT allows them to write their intermediate reasoning steps (thought process) before writing final solutions for a particular prompt. The hidden chain of thought can be helpful when the model is implemented in, for example, an AI Chatbot for planning, an agentic AI app that schedules and books appointments for doctors and patients, or an AI system responsible for meeting summarization. LLMs that generate output from patterns cannot reason out of their trained datasets. Reasoning in LLMs can be categorized into natural and formal language reasoning. Each category has its application areas, as stated below.

Formal Language Reasoning: Formal language reasoning can be applied to theorem proving, software verification, expert systems, and more.
Natural Language Reasoning: This category usually comprises dialogue systems, question and answering systems, recommendation systems, text summarization apps, and sentiment analysis.

Characteristics and Capabilities of Reasoning Models

Thinking models have unique features that distinguish them from others like GPT 3.5. Instead of responding right away, they tend to think deeper, making them excellent as planning and research companions. They use less context and a more helpful response approach to tedious problems. In addition, it is easy to follow along and verify the responses these models provide because they explain them step-by-step.

Problems Associated With Non-reasoning LLMs

A non-reasoning LLM is defined as any model with shallow reasoning and usually struggles with multi-step tasks. It is incapable of self-correction and iterative refinement. Example modes in this category include gpt-4o, Claude 3.5 Sonnet, and many more.

Whenever you prompt non-reasoning LLMs, they use the data they were trained on to generate the responses. For example, prompting these LLMs to create a SwiftUI animation using new Swift animation frameworks like Phase Animator and Keyframe Animation will give a result with an older SwiftUI animation technique.

One reason is that non-reasoning LLMs need to be more familiar with the latest technologies, but they are sometimes limited to training data from previous years. Non-reasoning models often memorize answers from their training data sets instead of thoroughly understanding them. They are also not good at multi-step tasks since they generate solutions from patterns instead of thinking.

How Reasoning Works in LLMs

Currently, when writing this post, the OpenAI o1 and o3 series models, DeepSeek R1, Gemini 2.0 Flash Thinking, and Grok 3 are the only AI models that can do actual thinking.

o1-mini: It is trained to be good at coding and math problem-solving. The following sections show how this model solves math and coding tasks.
o3-mini: The most cost-effective and capable model of the OpenAI’s reasoning models.
deepseek-r1: A cheaper thinking model comparable with the o1 family of models.
gemini-2.0-flash-thinking: One of Google’s most advanced reasoning models. It can show its thoughts and provide accurate results.
grok-3: xAI claims this model to be the smartest in the world. However, its benchmarks do not show performance with the OpenAI’s most advanced o3 models.

According to OpenAI’s blog post and the models’ system card, they were trained using reinforcement learning. When users assign these o1 families tasks involving logic and math, they decompose them into smaller tasks with reasoning tokens and use their hidden thoughts to solve the problems step-by-step. Using reasoning tokens to break down complex tasks into smaller ones helps LLMs to think methodically and prevent ambiguity, deviation, and hallucination.

However, for reasoning models like o1-mini and o1-preview, OpenAI does not explain the internal architecture and how the reasoning tokens work. Additionally, no research papers or source codes are accompanying the models. Here are some resources to read and learn more about the thinking processes behind these models.

Teaching an LLM To Reason

Several techniques are used in training LLMs to think, including Reinforcement Learning, which is like training dogs with treats in the real world. Like OpenAI o1 models, reinforcement learning uses data-efficient training processes to train LLMs to think and empower them with a long chain of thought for complex problem-solving. To teach LLMs to think, the following fundamental characteristics must be considered.

Refinement of strategies: Ability to refine reasoning strategies.
Correction of own mistakes: Ability to recognize, accept, and correct an LLM's errors.
Decomposition of tasks: Ability to break down complex multi-steps into more straightforward steps.
Ability to switch a problem-solving approach: When the currently selected approach does not work correctly, the LLM should be able to switch and adopt another strategy.

To learn more, check out the proposed method in this research paper, which aims to teach LLMs how to think through complex tasks without direct supervision.

Use Cases of Reasoning LLMs: Reasoning Tasks

Reasoning in LLMs can be applied in the following areas to answer complex questions.

Customer service solutions: Resolve help-center tasks involving multi-step processes, decision trees, and routine tasks. Implementing these routine tasks in multi-agent AI systems can help address customer service issues.
Data validation in synthetical medical dataset: Implementing reasoning LLMs in healthcare applications can help spot hidden data errors in medical datasets.
Maths Reasoning: Mathematical problem solving and theorem proving.

Solving Math Problems: Comparing Non-Reasoning and Reasoning Models

In this section, we will solve a math problem using the gpt-4o and o1-mini models, compare their approaches to the given prompt, and final results.

Reasoning-based LLMs can perform better when solving complex logic tasks than those not designed to reason. For example the following prompt was used purposely to evaluate the performance of OpenAI’s o1-mini and gpt-4o models. The preview above shows a visible difference in response speed between the gpt-4o and o1-mini models. `o1's deliberately slower response speed allows it to think deeply before generating answers.

bash

1
The sum of the first twenty - one terms of a linear sequence is 28, and the sum of the first twenty - eight terms is 21. Find which term of the sequence is 0 and the sum of the terms preceding it.

The above prompt was taken from a college elective mathematics book. As the preview illustrates, the gpt-4o model solves the question faster than the o1-mini model with a good answer for one part. However, the result from the gpt-4o model was incomplete compared with the solution from the book. On the other hand, the o1-mini model began thinking for about 13 seconds before outlining a step-by-step solution to solve the problem and came up with a similar chain of thought and exact answers from the book. In this case, it provided robust and more accurate results.

Note: The reasoning LLM's thinking duration is directly proportional to the complexity of the task.

Programming and Coding: Comparing `o1-mini`, `gpt-4o` and `Claude 3.5 Sonnet`

In this section, we test a coding prompt against o1-mini, gpt-4o, and Claude 3.5 Sonnet using Composer (cmd + l) in the Cursor AI Code Editor to build a functioning SwiftUI freeform drawing app with PencilKit. Download o1-mini version of the app from GitHub.

Note: The generated Swift files and code snippets from the o1-mini model produced a working app. However, the code is not perfect. So you can use it as a starting point and modify it further.

Prompt for `o1-mini`, `gpt-4o`, and `Claude 3.5 Sonnet`

Create an iOS/SwiftUI drawing app using Apple's PencilKit framework: https://developer.apple.com/documentation/pencilkit. Implement the following functions.

1. Drawing tools:

Pencil
Pen
Monoline
Fountain
Marker
Crayon
Watercolor

2.  Color picker: To allow users to change the colors of drawing tools before using them.
3. Correction tools: Undo, redo, eraser, cut drawing.
4. Ruler

Use appropriate SF Symbols: https://developer.apple.com/sf-symbols/ for each tool and functionalities above.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

`o1-mini`’s Response

The following sections represent the steps the o1-mini model took to build the SwiftUI drawing app.

Note: The overview, project structure, and step-by-step implementation sections below were copied directly from Cursor.

Overview

We'll build a SwiftUI application with PencilKit to provide a rich drawing experience. The app will include:

Drawing Tools: Pencil, Pen, Monoline, Fountain, Marker, Crayon, Watercolor
Color Picker: To select colors for the drawing tools.
Correction Tools: Undo, redo, eraser, cut drawing.
Ruler: For precise drawing.
SF Symbols: Using appropriate icons for each tool and functionality.

Project Structure

ContentView.swift: The main view containing the drawing canvas and toolbars.
DrawingCanvasView.swift: A UIViewRepresentable that wraps PKCanvasView.
ToolBar.swift: A custom toolbar for selecting drawing tools and colors.
Tools.swift: An enumeration of available drawing tools.
ColorPickerView.swift: A view for selecting colors.

Step-by-Step Implementation

Create a New SwiftUI Project
- Open Xcode and create a new SwiftUI project named PencilDrawingApp.
- Ensure that you select SwiftUI as the interface and Swift as the language.
Add PencilKit Framework
Define the Drawing Tools: Create an enumeration for the various drawing tools.

The o1-mini model used the steps above to generate the necessary Swift files and code snippets to create the SwiftUI drawing app. As demonstrated in the preview below, the model used about 38 seconds to think before writing the output code and files.

After copying the generated file into a blank Xcode (SwiftUI) project, a minor error was fixed using the Xcode's suggested fix option. When you download and run the app in Xcode, you should see a simple but working SwiftUI freeform drawing app.

`gpt-4o`’s Response

The gpt-4o model also generated a simple working SwiftUI drawing app with ContentView.swift, DrawingView.swift that had an error fixed with a follow-up prompt.

Fix Cannot convert value of type 'any PKTool' to expected argument type 'Binding<any PKTool>'

You can download and try the gpt-4o version of the app in Xcode.

`Claude 3.5 Sonnet`’s Response

The Swift files generated by Claude's 3.5 Sonnet were similar to those of o1-mini but without a chain of thought.

The Claude 3.5 Sonnet output is also a simple, functional SwiftUI drawing app. You can download the app from GitHub and test it in Xcode.
In this section, you have noticed that all three language models produced a successful outcome, although the prompt is not simple and straightforward. It involved implementing PencilKit in a SwiftUI app. PencilKit is a UIKit framework and it does not support SwiftUI out of the box. For prompts similar to the one in this section, gpt-4o and Claude 3.5 Sonnet will produce a successful outcome most of the time.

So, based on the results in this section, when is it appropriate to use a reasoning model like o1-mini or o1-preview? Reasoning models always spend some time thinking before responding to questions, so they may be optional for simple tasks that have direct answers. In coding, reasoning models can help fix bugs, complex errors, and implementation of external frameworks and libraries. They can also be used to brainstorm highly technical concepts and more.

Prompting Techniques for Reasoning LLMs

Reasoning LLMs always go through a lengthy internal chain of thought before responding to user prompts. For this reason, using some prompting techniques for instructing non-reasoning-based LLMs may not be necessary for reasoning ones. Let's look at the best practices you can adopt to prompt reasoning LLMs and evaluate an example prompt against OpenAI's o1-mini reasoning model, gpt-4o, and Claude 3.5 Sonnet.

Create distinct sections for prompts: When working with reasoning LLMs, always keep prompts concise and clear to allow the model to respond with the best output. For clarity, you should use sequences of characters (comma, colon, dash, quotation marks, headings, titles, and more) to separate the sections of your prompt.
Unnecessary chain-of-thought: Avoid using a phrase like think step by step in your prompt, as this may hinder the model's performance and ability to give the best results. Note: There is nothing wrong with using a phrase like this for a non-reasoning model.
The few-shot prompting technique (providing the AI model with a few task examples), usually used in non-reasoning LLMs, may not be helpful for reasoning LLMs due to their thinking abilities.
Prefer using fewer prompts: Avoid giving extensive guidance in your prompts when instructing reasoning LLMs. Give them less context for more useful and comprehensive answers.
Avoid using extensive examples: Feeding the model with overwhelming examples, documents, and context can hinder its performance. Doing so may confuse and complicate the reasoning model's ability to respond correctly because it is trained to understand and follow instructions well.
Use combined prompts: Always include all task requirements in a single prompt instead of aiming for chained prompts (piece by piece). Doing so helps the reasoning model produce better results. For example, OpenAI's o1-mini reasoning model could write 1500 to 2000 lines of code in a single prompt. However, GPT-4o and Claude 3.5 Sonnet struggle to write more than 500 lines of code.

Note: Some of the reasoning LLMs prompting techniques and best practices above may need to be revised for non-thinking models like GPT-4o or Claude 3.5 Sonnet. For example, GPT-4o does not work well for combined prompts for complex tasks so you may need to split complex tasks into subtasks.

Benefits of Reasoning LLMs

Reasoning in LLMs can be used across several domains to solve advanced and complex problems related to programming, planning (strategy), and general research. You can use reasoning-based language models together with non-reasoning ones in problem-solving. However, the reasoning capabilities make these models excel in specific tasks.

Reasoning adaptation: They can quickly revise their thinking process in real time during a solution-generation phase.
Tree of Thought: LLMs that can reason are designed to generate responses by exploring multiple reasoning paths simultaneously to achieve the best solution.
Thinking LLMs like the o3 and DeepSeek R1 are trained to understand and follow instructions methodically. Their ability to understand and adhere to instructions helps reduce the effort required to engineer prompts for their usage.
Solving reasoning-heavy tasks: OpenAI's reasoning model, DeepSeek R1, and others have proven to be the first AI models that solve complex multi-step and coding tasks well enough. Reasoning LLMs can solve complex workflows that are harder and impossible for other AI models to do. This is because they are trained to think deeper.
Reasoning models do away with quick and predictive answer generation and facilitate slower and more logical problem-solving.

Access and Test the Reasoning Models

The o3 families, DeepSeek R1, Gemini 2.0 Flash Thinking, and Grok 3, are the world's most advanced and smartest thinking AI models. You can access them through the ChatGPT, OpenAI API, Cursor, WindSurf, Zed AI, VS Code, etc.

ChatGPT Plus account: A subscription to ChatGPT Plus is required to access the o1 and o3 models through OpenAI.
Cursor AI Code Editor: You can test the o1 models through Cursor Pro or Business with limited usage per day.
GitHub Copilot in VS Code: Another way to try out and test the o1 models is via GitHub Copilot in VS Code.
Google AI Studio: You can access and use Gemini 2.0 Flash Thinking for Free.
Local and reasoning LLM inference hosting: Using local LLM tools like LM Studio or Ollama, you can access and use the DeepSeek R1 thinking model entirely offline. Fast AI inference providers like Cerebras and Groq also allow you to use the open-source distilled versions of R1 on their platforms at even higher response speeds.
Grok for Web and iOS: You can test Grok 3's thinking feature through the web version on X or the iOS app.

Challenges of Reasoning LLMs

Although reasoning LLMs are helpful in many aspects, they are in their early research stages and have some challenges. Let's look at a breakdown of some of these challenges.

Tool support: The o1 models do not support tools like their previous GPT models. For example, custom GPTs and document uploads are not supported (this may change in the future).
Context windows: The o1 models have shorter context windows.
Self-judgment: These LLMs may sometimes struggle with assessing the correctness or quality of their response, which makes them prone to confidently providing incorrect information without self-correction.
High latency in getting results: Reasoning through complex problems, especially step-by-step tasks, can result in longer response times and processing delays.
It is also generally difficult to understand how these LLMs make decisions and solve problems.

What’s Next in Reasoning LLms

In this article, we covered reasoning models such as the o1 and o3 series, DeepSeek R1, Gemini 2.0 Flash Thinking, and Grok 3. You now know what reasoning LLM is all about. We discussed these models' benefits, application areas, and prompting best practices.

In the future, Meta AI, Mistral, and Anthropic may have their equivalent models with long chains of thought and logical reasoning abilities.