Build an AI Meeting Summary Tool Using Ollama and Gemma

9 min read
Stefan B.
Stefan B.
Published March 13, 2024

Over the past years, AI has become more and more of a mainstream topic, specifically with the rise in popularity of ChatGPT. As developers, there has also been a rise in tools and SDKs to build AI applications. Today, we want to look at how to build a tool in this ecosystem. The topic we chose for our application is meeting summaries.

Because if there are useful outcomes from meetings, they are far from a waste of time. Many meetings nowadays happen online, and many tools, such as Google Meet and the Stream Video SDK, offer transcription services. What if we could save time by automatically summarizing everything that was said in a meeting, saving time and offering helpful assistance?

In this post, we want to explore how to build such a system. We challenge ourselves to make it run fully locally on our machines so that we can deploy them anywhere we want (e.g. on our servers).

We will use different tools to make that happen, specifically Python, Ollama, and Gemma, But throughout this article, we will see and mention many alternatives to these. We invite you to view this article as one of many ways to achieve this goal.

There is a repository where you can look at the result, but you can also follow this article and see for yourself.

Using Gemma as Our Machine Learning Model

AI is a very broad term that consists of many different technologies, architectures, and approaches in general. In the last years - partially due to the rise of ChatGPT - large language models (LLMs) have become one of the most popular AI tools.

There is a wide range of LLMs available. The most famous one is the latest OpenAI one, GPT-4. However, the landscape has become more vast over the last few years.

Other popular companies share models, such as Mistral, Stability, etc. One that has seen recent interest is Google sharing their Gemma models.

Quoting from the announcement on their site:

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models.

We want to use one of the models in this article. There are two options: Gemma 2B and Gemma 7B. The B stands for Billion and describes the number of parameters these models have. The simplified explanation is that more parameters mean better quality, a larger memory footprint, and slower inference times.

For this article, we choose the Gemma 2B model. We also experimented with larger models but haven’t seen much quality improvement for our narrow case. Feel free to experiment here, though.

Downloading and Installing Ollama

Our goal is to run the models locally on our machine. The option we are going with here is Ollama. It’s a great tool with a wide range of LLMs to run by a simple command.

It has official libraries for Javascript and Python, and many community packages are available for other languages (see an excessive list here).

While we will use the Python library in our use-case, they even offer a REST API, meaning we can use any other language we like to interact with Ollama.

To install Ollama, we can use their installation tool directly on their website. When writing this article, there are official macOS and Linux support, with Windows being in preview.

After installing and running Ollama, we only need to run the model we want. We can select from the wide range of Gemma models available. We choose the most basic 2B model and can run it from the terminal with the following command:

bash
ollama run gemma:2b

With that, we have the model running locally and are ready to interact with it. As we discussed, libraries are available for many platforms, but we’ll go with Python in this article as it’s the most common language for ML applications.

Setting Up the Project

We have the LLM running, so we can start implementing our meeting summarization tool now.

First, we set up the project using a virtual environment to have a clean slate. This is not only a good practice for projects on our machines but is also really helpful if we want to package things up to run on a server at some point in the future.

Python has a convenient tool called venv. It creates an isolated Python installation for our project and allows us only to install the packages we need. Create a project folder, and inside of it, run the following commands (on macOS/Linux):

bash
python3 -m venv summarizer
source venv/bin/activate

With that, the virtual environment is set up and activated. Now we can install the dependencies we need, in our case only the ollama package:

bash
pip install ollama

We want to have a list of requirements to install all of them in one piece. This is helpful if we want to run our script e.g. on a server, and need to set up the environment from scratch:

bash
pip freeze > requirements.txt

Now that we have dependencies set up, we can start writing the code. We create a file called [main.py](http://main.py) and open it up in our editor.

Loading the Conversation Data

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

The first step for us is to import the conversation data. It can come in a different format depending on where we get the data. We’re using the transcription service from Stream, and here we get a JSON file that is an array of objects of the following form:

json
{
  "type": string,
  "speaker_id": string,
  "text": string,
  "start_time": string,
  "stop_time": string
}

So, we want to load the file in and convert it into a Python array where we only have the speaker and the text that was spoken. We create a function to do this. Here’s the code, and let’s discuss what it does afterward:

python
import json

def load_conversation_data():
  with open("transcription.json") as f:
      json_file = json.load(f)
      conversation = list(map(lambda x: f"{x['speaker_id']}: {x['text']}", json_file))
      conversation_string = "\n".join(conversation)
      return conversation_string

We open the file (in our case, transcription.json) and load it using Python’s built-in json package. Then, we map each of the objects to be a string in the form of ": " using a lambda expression. We convert this to a list and combine all our elements into a single string divided by newline characters.

Note that this can look different for other vendors, but the principle will always be the same. We want a single text to hand that information to the LLM.

For simplicity, we are working with a local file here. A common approach is hosting transcription files in the cloud, e.g., an S3 bucket. The approach is essentially the same, except we need to download the file from there first. However, this is equally simple, e.g. using the boto3 package from AWS (see detailed instructions here).

Creating the Meeting Summary

We can finally get to the summary with the data preparation done. This is where we can use the power of the ollama package we installed earlier. We use the .chat endpoint to start a conversation with our LLM (gemma:2b).

First, we need to give it a system prompt with detailed instructions on what we want it to do for us. Now, while this might sound simple, it is a science in and of itself called prompt engineering. We won’t cover that too much here. If you’re interested, there are a lot of materials around it (and even a recommended free course from deeplearning.ai).

The prompt we came up with for this one is the following (again, feel free to change it up and experiment with it. One of the advantages of running an open model locally is that we can do a lot of trial-and-error without hitting rate limits or being charged per request):

“Your goal is to summarize the text given to you in roughly 300 words. It is from a meeting between one or more people. Only output the summary without any additional text. Focus on providing a summary in freeform text with what people said and the action items coming out of it.”

The prompt is the first message we send to the LLM, and the second will be the meeting data.

With this ready, we can finish our script. We first call the load_conversation_data function we defined earlier and then use the ollama package to request a response. For now, we’ll print the response and see the outcome:

python
conversation_string = load_conversation_data()

response = ollama.chat(model='gemma:2b', messages=[
    {
    'role': 'system',
    'content': 'Your goal is to summarize the text given to you in roughly 300 words. It is from a meeting between one or more people. Only output the summary without any additional text. Focus on providing a summary in freeform text with what people said and the action items coming out of it.'
    },
  {
    'role': 'user',
    'content': conversation_string,
  },
])

print(response['message']['content'])

We can run the script and see how well it performs. This is what we get (for the meeting notes that can be found in the repository):

python
The meeting discussed the marketing strategy for an upcoming product launch. Key points included:

- Development team progress: on track for launch in 3 months.
- Marketing channels: social media, email marketing, influencer partnerships, online advertising.
- Content creation: blog posts, videos, case studies.
- Paid advertising: budget allocated for maximum visibility.
- Influencer and partner collaboration: research and engagement strategies outlined.
- Messaging and branding refinement: aligned with the product roadmap.
- Next meeting scheduled for further planning and action items.

It picks up what was discussed well. It creates a summary first and then even adds bullet points of the most important topics. For only a few lines of code, the result is quite impressive.

Using Ollama’s REST API

We are using the ollama package for now. While this works perfectly, we are bound to be using Python like this. However, Ollama also offers a REST API. This allows us to use any language that we like and doesn’t require us to rely on a library being available.

The documentation states that we can access the API on port 11434, and through a simple POST request to the /api/generate endpoint, we can achieve the same result we did earlier.

We want to quickly demonstrate how to do this using Python again so that you can port this to any platform you like.

First, we install the requests library to make it easier for us to hit the API.

bash
pip install requests

We’ll first define the endpoint and the data we need to send. This is the same as before: the model and a prompt (where we combine the system prompt from before and the conversation_string we’ve created from our transcript). Here’s the preparation code:

python
OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"
  system_prompt = 'Your goal is to summarize the text given to you in roughly 300 words. It is from a meeting between one or more people. Only output the summary without any additional text. Focus on providing a summary in freeform text with what people said and the action items coming out of it.'
  conversation_string = load_conversation_data()

  OLLAMA_PROMPT = f"{system_prompt}: {conversation_string}"
  OLLAMA_DATA = {
     "model": "gemma:2b",
     "prompt": OLLAMA_PROMPT,
     "stream": False,
     "keep_alive": "1m",
  }

Finally, we must execute the POST request and extract its response. This is done using two lines of code:

python
import requests

response = requests.post(OLLAMA_ENDPOINT, json=OLLAMA_DATA)
print(response.json()["response"])

We get the same result from this and have created a more generic solution, not relying on any packages and ready to be ported to any other language.

Summary

In this article, we didn’t write much code. However, we’ve achieved a lot using the powers of AI (and the surrounding ecosystem). We got started setting up and running a local LLM using Ollama. Then, we have loaded and preprocessed the transcription data from a meeting. Finally, we explore two ways of handing that data to Gemma with a custom prompt to give us a summary.

It’s fascinating that we can achieve an incredible feature like this in such a quick way. It requires broader skills, including AI, data preprocessing, LLM knowledge, and prompt engineering. But we were able to set this up quickly in a way that can also be applied to similar use cases with minor adjustments.

We’d love to hear what you’re building with it and which platforms you run this on. Thanks for following the article, and have a great rest of your day.

decorative lines
Integrating Video With Your App?
We've built an audio and video solution just for you. Launch in days with our new APIs & SDKs!
Check out the BETA!