Resources & Guides

The Open-Source Dream Team: HuggingFace, Qdrant, Streamlit, and Llama 2 in Your Toolkit

Buzzwords baby.
A picture from the Hugging Face talk
June 10, 2024,
12:47 am

What are these anyway?


I've been exploring various open-source large language models and vector databases lately, both for client projects and personal experiments. Recently, I decided to dive into Llama 2 to reduce my dependency on the big player in the field, OpenAI. This case study is a quick guide to getting started with Llama 2 and implementing semantic search. Stick around, it gets interesting!

First, the model


Most of you are probably familiar with GPT-4 and 3.5 turbo, as well as the costs associated with using these models. While these costs aren't prohibitive for personal projects, scaling to production can get pricey. This is where Llama comes in. Developed by Meta, Llama and its more powerful successor, Llama 2, are among the leading open-source large language models. Being open-source means you can run these models locally and deploy them on your chosen infrastructure, avoiding reliance on OpenAI's APIs.

To get started, you can use HuggingFace (after requesting model access from Meta) or a great tool called Ollama (note: I have no affiliation with them; I just think they have a fantastic tool). Ollama lets you run Llama 2 and other models using Docker or natively on MacOS and Linux. It's as simple as downloading the app, choosing a model, and running a couple of commands in your terminal.

Download Ollama here
Choose a Model (Llama 2, Llama 2 uncensored, or a variant). Make sure your machine meets the model's requirements; for example, the 7b model generally requires at least 8GB of RAM.

Then, you can either:

Interact on the CLI directly by running ollama run llama2 in your terminal, or, call it like an API in your app or program (I'll show how I did this further below:

Second, the database

With Llama 2 set up, the next step is to configure the database and insert the data. For this, we’ll use a vector database.

What is a vector database? A vector database stores data as vectors, which are lists of numbers representing objects like words or images in a multi-dimensional space. These vectors capture the semantic meaning and relationships in data, enabling fast similarity searches.

I chose Qdrant for this project because it’s new, written in Rust, and works locally using Docker. Here’s how to set it up:

Pull the Qdrant Docker image:
Run Qdrant
Access the Qdrant UI at localhost:6333/dashboard.
Next, we need to insert our data. I used an Amazon products dataset from Kaggle for this demo, assuming an e-commerce scenario where users might search through a product catalog.

Third, the application

We’ll create an application using Streamlit to handle data uploading, vectorizing, and searching.
Set up the Streamlit app:
import streamlit as st



st.set_page_config(
    page_title="Home Page & Data Loading"
)

st.sidebar.success("Each page is another stage in the demo, starting with the data loading phase.")

st.markdown(
    """
This application is a semantic search demo complete with data uploading and querying.


You can start on the db upload page, where you will be uploading a CSV file of data. 
This uploaded data will be vectorized and the resulting file of embeddings will be saved to your local folder. 
From there, your data will be uploaded to the vector db.

Once you have uploaded your data to the db, you can go on the search page and look up results semantically similar to your query
"""
)

Data Transformation and Upload:
Anyway, the code for that step looks something like:
import streamlit as st
from functions import calculate_embeddings, clean_textfiled
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from qdrant_client.http import models as rest
import pandas as pd
from qdrant_client import QdrantClient, models
import json

import numpy as np

st.set_page_config(
    page_title="Data Transformed"
)



st.markdown(
"""
Start by uploading a CSV file of data. Your uploaded data will be transformed 
and vectorized and the resulting file of embeddings will be saved to your local folder.
You will need this file for the next phase.** 
"""
)

TEXT_FIELD_NAME = st.text_input("Enter the field name that you will use for the embeddings")
data_file = st.file_uploader("Please upload a CSV file", type="csv")
if data_file is not None:
    df = pd.read_csv(data_file)
    df = clean_textfiled(df, TEXT_FIELD_NAME)
    # vectors file will save to your local folder
    npy_file_path = data_file.name

    # Load the SentenceTransformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')


    # # Split the data into chunks to save RAM
    batch_size = 1000
    num_chunks = len(df) // batch_size + 1

    embeddings_list = []

    # Iterate over chunks and calculate embeddings
    for i in tqdm(range(num_chunks), desc="Calculating Embeddings"):
        start_idx = i * batch_size
        end_idx = (i + 1) * batch_size
        batch_texts = df[TEXT_FIELD_NAME].iloc[start_idx:end_idx].tolist()
        batch_embeddings = calculate_embeddings(batch_texts, model)
        embeddings_list.extend(batch_embeddings)
    
    # Convert embeddings list to a numpy array
    embeddings_array = np.array(embeddings_list)

    # Save the embeddings to an NPY file
    np.save(npy_file_path, embeddings_array)

    print(f"Embeddings saved to {npy_file_path}")


else:
    st.warning("you need to upload a csv file")
After encoding our data and saving it to a .npy file, we need to upload this data into our vector db. If you remember, this case study uses Qdrant as the db. Qdrant supports the use of payloads, which is basically json content that represents any additional information we want to store along with vectors. So at this point in the process, we are looking to submit the vector data and associated payload data to Qdrant. To this, we need to first create the collection, then submit the data.
The code for that bit is:
import streamlit as st
from functions import calculate_embeddings, clean_textfiled
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from qdrant_client.http import models as rest
import pandas as pd
from qdrant_client import QdrantClient, models
import json

import numpy as np

st.set_page_config(
    page_title="Data Upload to DB"
)

VECTOR_FIELD_NAME = st.text_input("Enter the field name that you will use for the embeddings field in the db")


st.markdown(
    """
   **Upload your embeddings file (should be the .npy file in your local folder). This file's vector data will be uploaded to your local Qdrant db instance.**
   **You will also need to upload the csv file that was used to create the embeddings file. This is so we can upload the appropriate payload along with the vectors**
    """
    )
embed_data_file = st.file_uploader("Please upload the corresponding vectors file", type="npy")
data_file = st.file_uploader("Please upload the appropriate CSV file", type="csv")
TEXT_FIELD_NAME = st.text_input("Enter the field name that you will use for the embeddings")
if embed_data_file is not None and data_file is not None:
    client = QdrantClient('http://localhost:6333')
    df = pd.read_csv(data_file)
    df = clean_textfiled(df, TEXT_FIELD_NAME)
    payload = df.to_json(orient='records')
    payload = json.loads(payload)
    vectors = np.load(embed_data_file)
    client.recreate_collection(
    collection_name="amazon-products",
    vectors_config={
        VECTOR_FIELD_NAME: models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
            on_disk=True,
        )
    },
    # Quantization is optional, but it can significantly reduce the memory usage
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)
    client.upload_collection(
    collection_name="amazon-products",
    vectors={
        VECTOR_FIELD_NAME: vectors
    },
    payload=payload,
    ids=None,  # Vector ids will be assigned automatically
    batch_size=256  # How many vectors will be uploaded in a single request?
)



else:
    st.warning("you need to upload an npy file")
We can also chat over our data using llama-2 uncensored. This uncensored version basically has the guardrails off compared to the standard one. This allows us to get creative with our system prompts and avoid some of the more uhhh HR speak responses that the other version tends to throw out ("As an AI I cannot blah blah blah). The fun twist I'm throwing in here is that I am setting up the system prompt to force the model to respond in spanish. This obviously has inconsistent results compared to the more typical english scenarios, but I thought it was an interesting wrinkle to add to our case study and to explore more in depth the differences in quality of responses from the model.
Some of the better results can be seen in this chat:
And the code + prompts to get us to that stage can be found here:
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
import streamlit as st
from langchain.llms import Ollama


from langchain.llms import Ollama
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer


# from functions import conversational_chat
from streamlit_chat import message
from langchain.vectorstores import Qdrant
from qdrant_client.http import models as qdrant_models




st.set_page_config(
    page_title="Data Chat"
)


st.text("ollama initialize...")
ollama = Ollama(base_url='http://localhost:11434',
model="llama2")


st.markdown(
"""
**The final feature I'm including in this demo is a Q&A bot aka the hello world of LLMs. This particular bot has custom prompting to talk to users in Spanish within the context of e-commerce product data** 
"""
)
# search qdrant
collection_name = "amazon-products"

client = QdrantClient('http://localhost:6333')
# Initialize encoder model
model = SentenceTransformer('all-MiniLM-L6-v2')




if 'history' not in st.session_state:
    st.session_state['history'] = []

if 'generated' not in st.session_state:
    st.session_state['generated'] = ["Hola ! Estoy aqui para responder sobre cualquier preguntas que tengas sobre:  " + collection_name + " 🤗"]

if 'past' not in st.session_state:
    st.session_state['past'] = ["Hola ! 👋"]
    


def conversational_chat(query):
    vector = model.encode(query).tolist()
    hits = client.search(
    collection_name="amazon-products",
    query_vector=vector,
    limit=3
)
# the payload that comes back from the db has a lot of extra data so at this point we are cleaning it up to reduce the noise and help the model focus on the important bits.
# These fields can change depending on what we deem as relevant data
    toplevelkeys = ['Product 1','Product 2','Product 3','Product 4','Product 5']
    context = {'Product 1': [], 'Product 2': [], 'Product 3': [],  'Product 4': [], 'Product 5': []}
    for hit in hits:
        key1 = "Price"
        key2 = "About Product"
        key3 = "Product Name"
        key4 = "Product Specification"
        val1 = hit.payload["Selling Price"]
        val2 = hit.payload["About Product"]
        val3 = hit.payload["Product Name"]
        val4 = hit.payload["Product Specification"]
        for i in toplevelkeys:
            hitinstance = {key1:val1,key2:val2,key3:val3,key4:val4}
            context[i].append(hitinstance)
    input_prompt = f"""[INST] <<SYS>>
    You are a customer service agent for a latin american e-commerce store. As such you must always respond in the Spanish language. Using the search results for context: {context}, do your best to answer any customer questions. If you do not have enough data to reply, make sure to tell the user that they should contact a salesperson. Everytime you don't reply in Spanish, you will be punished
    <</SYS>>

    {query} [/INST]"""
    output = ollama(input_prompt)
    return output

#container for the chat history
response_container = st.container()
#container for the user's text input
container = st.container()

with container:
    with st.form(key='my_form', clear_on_submit=True):
        
        user_input = st.text_input("Query:", placeholder="Puedes hablar sobre los productos de la tienda de e-commerce aqui (:", key='input')
        submit_button = st.form_submit_button(label='Send')
        
    if submit_button and user_input:
        output = conversational_chat(user_input)
        
        st.session_state['past'].append(user_input)
        st.session_state['generated'].append(output)

if st.session_state['generated']:
    with response_container:
        for i in range(len(st.session_state['generated'])):
            message(st.session_state["past"][i], is_user=True, key=str(i) + '_user', avatar_style="big-smile")
            message(st.session_state["generated"][i], key=str(i), avatar_style="thumbs")
Also before I forget, if you are having trouble with some of the functions referenced in the gists above, you can find them here:
import streamlit as st
from typing import List
# Define a function to calculate embeddings
def calculate_embeddings(texts, model):
    embeddings = model.encode(texts, show_progress_bar=False)
    return embeddings

#define a function to clean up data
def clean_textfiled(df, TEXT_FIELD_NAME):
    # Handle missing or non-string values in the TEXT_FIELD_NAME column
    df[TEXT_FIELD_NAME] = df[TEXT_FIELD_NAME].fillna('')  # Replace NaN with empty string
    df[TEXT_FIELD_NAME] = df[TEXT_FIELD_NAME].astype(str)  # Ensure all values are strings

    df[TEXT_FIELD_NAME] =  df[TEXT_FIELD_NAME].map(lambda x: x.lstrip('Make sure this fits by entering your model number. |').rstrip('aAbBcC'))
    return df

Conclusion

Okay so uh congrats if you made it this far. You now have a (scaled down and local version) of an LLM based data pipeline. I've been getting really deep into this space lately (well deep for anyone without a phd) and find it incredibly interesting, commercial value aside. The aspects of this little case study that I found really valuable were all the different open source components that make up the whole end to end pipeline. Most of the documentation and examples out there of systems like these tend to use OpenAI from top to bottom, which can definitely start running up costs. Exploring these open source components has alleviated a little of my anxiety around vendor dependency // lock-in and I highly recommend that anyone attempting to launch LLM based systems consider using these options instead. You can find the full source code here, https://github.com/dverasc/semantic-search-app.

Bottom Text

What if we emailed you the secrets to the entire universe?

We wont, but that’d be cool, right?

Wait, there's more!