On July 23, 2024, Meta released Llama 3.1, its state-of-the-art large language model. The model was released in three versions, with 8 billion, 70 billion, and 405 billion parameters.

Meta claimed that the Llama 3.1 405B model is better than GPT-4 Turbo and on par with GPT-4o in some benchmarks. As per Livebench and Chat.lmsys, the Llama 3.1 405B ranks just below Claude 3.5 sonnet, GPT-4o, and Gemini 1.5 pro. However, Llama 3.1 models are open-source, meaning you can use them for free.

In this article, we’ll show you how to use the Llama 3.1 8B model from Hugging Face to develop a chatbot using LangChain. The approach to developing a chatbot using Llama 3.1 70B or 405B remains the same.

By the end of this article, you’ll understand:

  1. How to import an open source model such as Llama 3.1 8B from Hugging Face in LangChain.
  2. How to call an open-source LLM in LangChain.
  3. How to add memory to your chatbot in LangChain.

So, let’s get right to it!

Installing and Importing Required Libraries

We will use the transformers library from Hugging Face to import the Llama 3.1 model. Additionally, we’ll use the Python LangChain library as the orchestrator framework to develop our chatbot. The following script downloads the required libraries.

!pip install --upgrade transformers accelerate
!pip install -q -U sentence-transformers
!pip install -q -U huggingface_hub
!pip install -q -U langchain
!pip install langchain-community

The script below imports the required libraries in our Python application and suppresses lower level warnings.

from transformers import AutoModelForCausalLM, AutoTokenizer, logging, pipeline
from sentence_transformers import SentenceTransformer
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from huggingface_hub import login
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain.memory import ChatMessageHistory
import torch

#Ignore warnings
logging.set_verbosity(logging.CRITICAL)

Importing the Meta Llama 3.1 Model in LangChain

Before importing the Llama 3.1 model in LangChain, download it from the Hugging Face library. To do so, use the AutoModelForCausalLM class and pass your model ID to it. You will also need the tokenizer for the Llama 3.1 model, which you can download using the AutoTokenizer class, as shown in the script below. Notice the placeholder where you should place your own hugging face token, YOUR_HUGGING_FACE_TOKEN. If you’re new to Hugging Face, you can create a new token under Settings > Access Tokens.

login(token="YOUR_HUGGING_FACE_TOKEN")

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id)

If you run this and get a 403 Client Error (“You are trying to access a gated repo”), you’ll need to click the link in the error and agree to the terms to access the model. After accepting the agreement, your information is reviewed. This review process could take a few days, so be patient.

Next, to import the downloaded Hugging Face model in LangChain, create a pipeline class object and pass the model and tokenizer you downloaded to its constructor. If you want to develop a chatbot, the first argument to the pipeline class must be text-generation. Since I want the model to restrict the output to 300 tokens, I set the max_new_tokens value to 300.

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=300)

llm = HuggingFacePipeline(pipeline=pipe)

The rest of the process involves creating a LangChain prompt, the chain, and calling the invoke() method, which calls the chain using the passed prompt.

To create a chain, you can use the Lang Chain Expression Language (LCEL) that allows you to bind the prompt, the LLM model and the output parser in a single step.

Finally, call the invoke() method using the chain object and pass the query string to the chain.

The following script shows how you can get a response from the Meta Llama 3.1 model you downloaded.

template = """Answer the following question precisely and to the best of your abilities in 200 words.
Question: {Question}"""

prompt = ChatPromptTemplate.from_template(template)

output_parser = StrOutputParser()

chain = prompt | llm | output_parser

question = "How to prepare lemon juice by hand?"
response = chain.invoke({"Question": question})
print(f"\n{response}")

Output:

single response from llama 3.1 model

The above script returns a response to isolated queries and does not remember anything from the previous conversation. For chatbot creation, you’ll want an LLM to remember previous interactions so it can build on prior discussions. In the next section, we’ll show you how to create a chatbot with memory using LangChain.

Creating a Chatbot with Memory Using Open Source LLMs in LangChain

LangChain provides the ChatMessageHistory class, which you can use in conjunction with the RunnableWithMessageHistory class to implement a chain with memory.

To do so, you need to modify your template and pass it the system prompt along with a MessagesPlaceholder object containing the name of a variable that stores your chat history. In the following script, we name the variable chat_history. Additionally, you must pass the value for the variable storing human query. The following script names this variable input.

Next, you need to create an object of the ChatMessageHistory() class that will store all present and past interactions with the chatbot.

Finally, instead of calling the invoke() method directly on the chain, you need to create an object of the RunnableWithMessageHistory() class and pass it the original chain, the ChatMessageHistory object, the input_messages_key, and the history_messages_key - the latter of which are input and chat_history, respectively, as you can see in the prompt.

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant. Answer human questions to the best of your ability in 30 words",
        ),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{input}"),
    ]
)

output_parser = StrOutputParser()

chain = prompt | llm | output_parser

chat_history_memory = ChatMessageHistory()

chain_with_message_history = RunnableWithMessageHistory(
    chain,
    lambda session_id: chat_history_memory,
    input_messages_key="input",
    history_messages_key="chat_history",
)

Next, we will define the return_response() function, which accepts a user query and passes it to the invoke() method of the RunnableWithMessageHistory class object (chain_with_message_history in the script below).

def return_response(user_query):


  response = chain_with_message_history.invoke(
      {"input": user_query},
      {"configurable": {"session_id": "session_1"}},
  )

  return response

To create the chatbot functionality, add a while loop that executes until a user enters bye. Inside the loop, the user enters a query, which we send to the return_response() function. Finally, we print the response on the console.

print("=======================================================================")
print("Welcome to your own Llama 3.1 Chatbot")
print("=======================================================================")

query = ""
while query != "bye":
  query = input("\033[1m User >>:\033[0m")
  response = return_response(query)
  print(f"\033[1m Chatbot>>:\033[0m {response}")

Output:

response with memory from llama-3.1

From the above output, you can see that the user question and past history are passed to the LLM in every call. You can do some post-processing to remove the question text and past history from the response.

I suggest you try Meta Llama 3.1 70B or even 405B to get better responses. However, you will need a vast GPU memory for that. In this article, I used Nvidia A100 with 40 GB GPU memory to get responses from the Meta Llama 3.1 8B model.

Conclusion

Open-source models are quickly catching up with proprietary models like Chat-GPT and Claude. What makes open-source models attractive is that you have the source code and don’t have to pay to get responses from them.

In this article, you saw how to create a simple chatbot, free of cost, using the Meta Llama 3.1 8B model. If you have the hardware to support it, try Llama 3.1 70B and 405B models and see if you get better results.