Willem's Fizzy Logic

Are open source large language models the future? Let's find out with an example

Are open source large language models the future? Let's find out with an example

Since the introduction of GPT-3.5 there’s been a race in the open-source community to come up with models that match the performance of the models produced by OpenAI. The main reason behind this race is that many people feel that it’s not a good idea to use the models produced by OpenAI. There are three main arguments as to why:

The answer to all these arguments is to go open source. In many cases you get to see what data is used by the team that trained the model, and how the model is trained. Because the models aren’t too large, you can fine-tune the model for specific tasks more easily. Most of the open-source models today can be reasonably hosted on a single GPU. In some cases, you’re going to need two GPUs.

Using desktop hardware reduces the carbon footprint and makes the whole use case much more reasonably priced for many organizations. Also, it allows you to keep everything in house if you want.

While you can use desktop hardware, it’s going to be quite a heavy setup.

System requirements

Hosting a language model on your own hardware requires serious equipment. I’ve tested various large language models on my machine which has this setup:

With this setup I was able to run all sorts of models. The GPU will have the biggest impact on what you can run. Most models will work best when hosted on the GPU. You can sometimes split the model between your GPU and CPU. But the performance will drop quite hard.

Before you start hosting your own LLM, make sure to check what you can host in the LLM database: https://www.hardware-corner.net/llm-database/

LLM database

The LLM database has been invaluable for me to evaluate the types of models I can run. In the list you select the type of model you want to check out. The database will show various variations of the selected model architecture.

Details of the mistral model archicture in the LLM database

To see what you can host, select the GPU from the dropdown menu at the top of the table. The list will be filtered on what your GPU is capable of hosting. Make sure to also set the available system RAM.

It’s worth noting that not all open-source models are included in the database. The set of available models is growing by the minute, but most models are a derivative of one of the models in the database.

Configuring your first language model

As with most machine-learning applications, open-source language models are hosted with Python. I’ve tested everything here on Arch Linux, because most of this stuff doesn’t work well on Windows.

Installing required packages

To set up a model, you’ll need a Python environment with the following packages:

Loading the language model

After installing these packages, you can get access to a language model using the following python code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "argilla/notus-7b-v1"
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype=torch.bfloat16,
)

This code performs the following steps:

Note that we’re using half precision floating points to load the model. The quality of the responses is a little lower than will full precision (float32) floating points. However, it will be faster to generate responses with half precision floating points.

Working with half precision floating points only works on recent graphics cards. You can leave the torch_dtype argument out if your graphics card doesn’t support float16 instructions.

Generating a response

Now that we’ve loaded the model, we can generate a response. The following code shows how to generate a response with the model:

messages = [
    {
        "role": "system",
        "content": "You're a friendly assistant. You help people generate content for their blog.",
    },
    {
        "role": "user",
        "content": "Please generate an outline for a blogpost about using open source lanugage models",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(device)

output = model.generate(
    inputs=inputs,
    max_new_tokens=2500,
    do_sample=True,
    temperature=0.7,
    top_p=0.98,
)

response_text = tokenizer.decode(output[0], skip_special_tokens=True)

This code performs the following steps:

The output looks like this:

<|system|>
You're a friendly assistant. You help people generate content for their blog.
<|user|>
Please generate an outline for a blogpost about using open source lanugage models
<|assistant|>
Title: Unleashing the Power of Natural Language Processing: A Comprehensive Guide to Using Open Source Language Models in Your Project

1. Introduction
- Briefly introduce natural language processing (NLP) and its applications
- Highlight the benefits of using open source language models (OSLMs)

2. Understanding Language Models
- Define what language models are and their role in NLP
- Explain the different types of language models (statistical, neural, rule-based)
- Discuss the advantages and disadvantages of each type

3. Popular Open Source Language Models
- Present a list of commonly used OSLMs (BERT, GPT, XLM, T5, etc.)
- Explain their features, use cases, and applications
- Provide examples of how each language model has been implemented in real-world scenarios

4. Getting Started with Open Source Language Models
- Walk through the process of setting up an environment for OSLMs
- Discuss the different programming languages that can be used (Python, Java, etc.)
- Explain the different tools and libraries that can be used for training and evaluating models (Hugging Face Transformers, PyTorch, TensorFlow, etc.)

5. Fine-Tuning and Customizing Language Models
- Explore how OSLMs can be fine-tuned and customized for specific tasks and datasets
- Provide step-by-step instructions on how to fine-tune and evaluate models
- Discuss the importance of data preprocessing and cleaning

6. Training and Evaluating Language Models
- Discuss the different metrics that can be used to evaluate the performance of NLP models
- Present a list of evaluation datasets and benchmarks (GLUE, WMT, etc.)
- Provide guidelines on how to train and evaluate language models effectively

7. Deploying Language Models in Production
- Explore the different deployment options for NLP models (on-premises, cloud, etc.)
- Discuss the importance of monitoring and maintaining models
- Provide best practices for deploying and maintaining models in production

8. Conclusion
- Summarize the key takeaways from the guide
- Highlight the potential future developments in OSLMs and NLP
- Provide resources for further learning and exploration.

Note that the output contains special markers that you need to parse to turn the sequence into a structured conversation. The model doesn’t know about JSON or other data formats, so you’ll have to build that part.

It will take several seconds up to a minute to get a response. This is normal for large language models, but it’s not the greatest experience. To get a better experience, you’ll need to stream the tokens as they’re generated.

Streaming language model responses

Language models are slow as you can tell from the end of the previous section. This is why OpenAI, and our internal chat agent are streaming responses. It doesn’t make the model faster, but it greatly improves the user experience.

To stream responses from the model, you’re going to need to use a Streamer component. You can configure one using the following code:

from threading import Thread

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

model_id = "argilla/notus-7b-v1"
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextIteratorStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype=torch.bfloat16,
)

messages = [
    {
        "role": "system",
        "content": "You're a friendly assistant. You help people generate content for their blog.",
    },
    {
        "role": "user",
        "content": "Please generate an outline for a blogpost about using open source lanugage models",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(device)

generator_thread = Thread(
    target=model.generate,
    kwargs=dict(
        inputs=inputs,
        max_new_tokens=2500,
        do_sample=True,
        temperature=0.7,
        streamer=streamer,
        top_p=0.98,
    ),
)

generator_thread.start()

for fragment in streamer:
    print(fragment, end="")

This code does the following:

By separating the generation code into a separate thread, we can let the model produce tokens while we’re doing other stuff on the main thread. On the main thread we collect tokens that were produced by the model.

This by itself isn’t very useful. You’re going to have to feed the tokens to the client via a streaming connection like gRPC, server-sent events or a websocket. I’ll leave that out for now.

Summary

In this post we covered how to use open-source large language models on your own hardware. We’ve covered most of the basics: Loading a model, generating a response, and streaming the response to improve perceived performance.

There’s a lot more to using large language models though. It’s important to note that open-source models lack the features that OpenAI and Microsoft offer.

Open-source language models don’t have guardrails. You need to add those yourself. It can require fine-tuning if the makers of the model didn’t spend time training the model with guardrails.

It’s also good to add a separate content moderation filter to your model implementation. The guardrails, even if they’re included, aren’t enough to make your application safe to use in public use-cases.

Finally, make sure you’re prepared to spend extra time setting up your environment. While the code in this blogpost produces a working implementation it’s not complete. Aside from guardrails, you’ll also need to tune the prediction performance and write code to integrate the model further into your solution.

Having said that, I hope this post gave you useful insights into using open source large language models!

About

Willem Meints is a Chief AI Architect at Aigency/Info Support, bringing a wealth of experience in artificial intelligence and software development. As a Microsoft AI Most Valuable Professional (MVP), he has been recognized for his significant contributions to the AI community.

In addition to his professional work, Willem is an author dedicated to advancing the field of AI. His latest work focuses on effective applications of large language models, providing valuable guidance for professionals and enthusiasts alike.

Willem is a fan of good BBQ, reflecting his appreciation for craftsmanship both in and out of the digital realm.

Contact