Resources & Guides

Deep dive into open-source large language models

image of hugging face qdrant
June 10, 2024,
11:59 pm


I have been tinkering with different large language models and vector dbs recently for some client work and some personal projects. Recently, I started working with Llama 2 in order to stop relying on the elephant in the room (OpenAI). The following case study is basically a quick start on working with Llama 2 and semantic search as well. Read on if you aren't bored at this point.

First, the model

Okay so most of you reading this are probably already familiar with GPT4 and 3.5 turbo. You probably are also familiar with the billing associated with these models. While not insane for personal projects, if you're looking to get anything LLM based into production, you should probably at least attempt to do so without having to pay a tithe to Sam Altman every time your user has a question. This is where Llama comes in. Developed at Meta, Llama and its younger (and more powerful) brother Llama 2 are some of the leading open source large language models. Rare Zuck W. Anyway, being open source means that these models can be run locally and deployed on your chosen infrastruture, as opposed to the OpenAI APIs.
How is this done? Well, you can use HuggingFace (and request model access from Meta), or you can use an awesome tool like Ollama (note: I have zero affiliation with them, they just have a great tool). Ollama allows you to run Llama 2 and other models using Docker or your local machine's native capabilities (MacOS and Linux). It's as simple as downloading the app, model, and then running a couple commands on your terminal.
  1. Download Ollama (

  2. Choose a model (Llama 2, Llama 2 uncensored, or even a variant) NOTE: Be conscious of your machine's capabilites and the models spec requirements i.e. the 7b model generally requires at least 8GB of RAM, etc

  3. A) Interact on the CLI directly by running "ollama run llama2" in your terminal


    B) Call it like you would an API in your app or program (you will see how I did it further below)

Second, the database (and the data I guess)

At this point, we've got our local model set up and running. Our next step is getting our database sorted and the data inserted. This is where things start picking up a bit in complexity.
First, what even is a vector db? If you aren't familiar, a vector database is used for storing data in the vector format (simple enough so far). What's a vector? While that can be whole article in of itself, at a high-level, a vector is basically a list of numbers that represents an object (like a word, image, etc) as a point in multi-dimensional space. For example, the word "cat" could be represented as [0.1, 0.3, 0.8] in a 3D vector space. A vector can be used to capture semantic meaning and relationships in data (aka embeddings). A vector db then is a specialized database optimized for storing and querying large collections of vectors/embeddings. They allow fast similarity searches to find vectors close to a query vector in the semantic embedding space. This enables applications like recommender systems which retrieve items similar to a user's interests.

If you're interested in learning about this component in more details, I highly recommend the following resources:

  • Vicki Boykis' "What are embeddings" book,

  • Prashanth Rao's "Vector Databases" series,

In this particular case, I decided to work with Qdrant's Vector DB, The resources above go into more details regarding the pros and cons of some of the major vector dbs out there, but to keep it short for this piece, I basically chose it because it is one of the newer ones on the scene, it was written in Rust, it isn't postgres, and it works locally using Docker.

With that said, assuming you have Docker itself already set up, here's how to get Qdrant up locally:

  1. Run 'docker pull qdrant/qdrant’ in your terminal

  2. Then, run 'docker run -p 6333:6333 \

  3. -v $(pwd)/qdrant_storage:/qdrant/storage:z \

  4. qdrant/qdrant'

You should now be able to see the db UI at localhost:6333/dashboard and you'll be interfacing with it programatically at localhost:6333

Okay so DB is stood up and sorted at this point. The other thing I want to briefly discuss in this section is the data and the scenario this case study presents. I found an Amazon products data set on Kaggle in a .csv format. Given this data, this use case is sort of assuming an e-commerce scenario, where some entity has a product catalog that users may want to search through or ask questions about. I included the data set in the project repo (, but you can also find it on Kaggle here,

Third, the thing

The thing here is an application layer + a couple different features that I added to it. The application will be done with Streamlit (shocker I know) and the features are basically some basic CRUD + semantic search and chat using the model and db running locally (see earlier sections).
First, let's get the application started with creating some data. In this case, we will be uploading CSV files and turning that data into vectors, before uploading those to the vector database.
This post will show the snippets of code, but if you're trying to replicate this demo like for like, check out the actual Github repo to see the full source code. With that said, here's how we start the application.
This first snippet shows you how to set up the streamlit multipage app:
import streamlit as st

    page_title="Home Page & Data Loading"

st.sidebar.success("Each page is another stage in the demo, starting with the data loading phase.")

This application is a semantic search demo complete with data uploading and querying.

You can start on the db upload page, where you will be uploading a CSV file of data. 
This uploaded data will be vectorized and the resulting file of embeddings will be saved to your local folder. 
From there, your data will be uploaded to the vector db.

Once you have uploaded your data to the db, you can go on the search page and look up results semantically similar to your query

Each phase of this process (data transformation, data upload to db, and then search and chat) has its own dedicated page within this app.
This second snippet shows you how we handle our data transformation. At this stage, we are turning our CSV file data into vectors using the HuggingFace Sentence transformer, which is one of the more popular (and open source) models to create embeddings. This process requires us to chunk the data in the csv file, encode our data, then turn that encoding into a numpy array and saving that into a .npy file. You will need this file for the next stage.
Anyway, the code for that step looks something like:
import streamlit as st
from functions import calculate_embeddings, clean_textfiled
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from qdrant_client.http import models as rest
import pandas as pd
from qdrant_client import QdrantClient, models
import json

import numpy as np

    page_title="Data Transformed"

Start by uploading a CSV file of data. Your uploaded data will be transformed 
and vectorized and the resulting file of embeddings will be saved to your local folder.
You will need this file for the next phase.** 

TEXT_FIELD_NAME = st.text_input("Enter the field name that you will use for the embeddings")
data_file = st.file_uploader("Please upload a CSV file", type="csv")
if data_file is not None:
    df = pd.read_csv(data_file)
    df = clean_textfiled(df, TEXT_FIELD_NAME)
    # vectors file will save to your local folder
    npy_file_path =

    # Load the SentenceTransformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # # Split the data into chunks to save RAM
    batch_size = 1000
    num_chunks = len(df) // batch_size + 1

    embeddings_list = []

    # Iterate over chunks and calculate embeddings
    for i in tqdm(range(num_chunks), desc="Calculating Embeddings"):
        start_idx = i * batch_size
        end_idx = (i + 1) * batch_size
        batch_texts = df[TEXT_FIELD_NAME].iloc[start_idx:end_idx].tolist()
        batch_embeddings = calculate_embeddings(batch_texts, model)
    # Convert embeddings list to a numpy array
    embeddings_array = np.array(embeddings_list)

    # Save the embeddings to an NPY file, embeddings_array)

    print(f"Embeddings saved to {npy_file_path}")

    st.warning("you need to upload a csv file")
After encoding our data and saving it to a .npy file, we need to upload this data into our vector db. If you remember, this case study uses Qdrant as the db. Qdrant supports the use of payloads, which is basically json content that represents any additional information we want to store along with vectors. So at this point in the process, we are looking to submit the vector data and associated payload data to Qdrant. To this, we need to first create the collection, then submit the data.
The code for that bit is:
import streamlit as st
from functions import calculate_embeddings, clean_textfiled
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from qdrant_client.http import models as rest
import pandas as pd
from qdrant_client import QdrantClient, models
import json

import numpy as np

    page_title="Data Upload to DB"

VECTOR_FIELD_NAME = st.text_input("Enter the field name that you will use for the embeddings field in the db")

   **Upload your embeddings file (should be the .npy file in your local folder). This file's vector data will be uploaded to your local Qdrant db instance.**
   **You will also need to upload the csv file that was used to create the embeddings file. This is so we can upload the appropriate payload along with the vectors**
embed_data_file = st.file_uploader("Please upload the corresponding vectors file", type="npy")
data_file = st.file_uploader("Please upload the appropriate CSV file", type="csv")
TEXT_FIELD_NAME = st.text_input("Enter the field name that you will use for the embeddings")
if embed_data_file is not None and data_file is not None:
    client = QdrantClient('http://localhost:6333')
    df = pd.read_csv(data_file)
    df = clean_textfiled(df, TEXT_FIELD_NAME)
    payload = df.to_json(orient='records')
    payload = json.loads(payload)
    vectors = np.load(embed_data_file)
        VECTOR_FIELD_NAME: models.VectorParams(
    # Quantization is optional, but it can significantly reduce the memory usage
        VECTOR_FIELD_NAME: vectors
    ids=None,  # Vector ids will be assigned automatically
    batch_size=256  # How many vectors will be uploaded in a single request?

    st.warning("you need to upload an npy file")
We can also chat over our data using llama-2 uncensored. This uncensored version basically has the guardrails off compared to the standard one. This allows us to get creative with our system prompts and avoid some of the more uhhh HR speak responses that the other version tends to throw out ("As an AI I cannot blah blah blah). The fun twist I'm throwing in here is that I am setting up the system prompt to force the model to respond in spanish. This obviously has inconsistent results compared to the more typical english scenarios, but I thought it was an interesting wrinkle to add to our case study and to explore more in depth the differences in quality of responses from the model.
Some of the better results can be seen in this chat:
And the code + prompts to get us to that stage can be found here:
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
import streamlit as st
from langchain.llms import Ollama

from langchain.llms import Ollama
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer

# from functions import conversational_chat
from streamlit_chat import message
from langchain.vectorstores import Qdrant
from qdrant_client.http import models as qdrant_models

    page_title="Data Chat"

st.text("ollama initialize...")
ollama = Ollama(base_url='http://localhost:11434',

**The final feature I'm including in this demo is a Q&A bot aka the hello world of LLMs. This particular bot has custom prompting to talk to users in Spanish within the context of e-commerce product data** 
# search qdrant
collection_name = "amazon-products"

client = QdrantClient('http://localhost:6333')
# Initialize encoder model
model = SentenceTransformer('all-MiniLM-L6-v2')

if 'history' not in st.session_state:
    st.session_state['history'] = []

if 'generated' not in st.session_state:
    st.session_state['generated'] = ["Hola ! Estoy aqui para responder sobre cualquier preguntas que tengas sobre:  " + collection_name + " 🤗"]

if 'past' not in st.session_state:
    st.session_state['past'] = ["Hola ! 👋"]

def conversational_chat(query):
    vector = model.encode(query).tolist()
    hits =
# the payload that comes back from the db has a lot of extra data so at this point we are cleaning it up to reduce the noise and help the model focus on the important bits.
# These fields can change depending on what we deem as relevant data
    toplevelkeys = ['Product 1','Product 2','Product 3','Product 4','Product 5']
    context = {'Product 1': [], 'Product 2': [], 'Product 3': [],  'Product 4': [], 'Product 5': []}
    for hit in hits:
        key1 = "Price"
        key2 = "About Product"
        key3 = "Product Name"
        key4 = "Product Specification"
        val1 = hit.payload["Selling Price"]
        val2 = hit.payload["About Product"]
        val3 = hit.payload["Product Name"]
        val4 = hit.payload["Product Specification"]
        for i in toplevelkeys:
            hitinstance = {key1:val1,key2:val2,key3:val3,key4:val4}
    input_prompt = f"""[INST] <<SYS>>
    You are a customer service agent for a latin american e-commerce store. As such you must always respond in the Spanish language. Using the search results for context: {context}, do your best to answer any customer questions. If you do not have enough data to reply, make sure to tell the user that they should contact a salesperson. Everytime you don't reply in Spanish, you will be punished

    {query} [/INST]"""
    output = ollama(input_prompt)
    return output

#container for the chat history
response_container = st.container()
#container for the user's text input
container = st.container()

with container:
    with st.form(key='my_form', clear_on_submit=True):
        user_input = st.text_input("Query:", placeholder="Puedes hablar sobre los productos de la tienda de e-commerce aqui (:", key='input')
        submit_button = st.form_submit_button(label='Send')
    if submit_button and user_input:
        output = conversational_chat(user_input)

if st.session_state['generated']:
    with response_container:
        for i in range(len(st.session_state['generated'])):
            message(st.session_state["past"][i], is_user=True, key=str(i) + '_user', avatar_style="big-smile")
            message(st.session_state["generated"][i], key=str(i), avatar_style="thumbs")
Also before I forget, if you are having trouble with some of the functions referenced in the gists above, you can find them here:
import streamlit as st
from typing import List
# Define a function to calculate embeddings
def calculate_embeddings(texts, model):
    embeddings = model.encode(texts, show_progress_bar=False)
    return embeddings

#define a function to clean up data
def clean_textfiled(df, TEXT_FIELD_NAME):
    # Handle missing or non-string values in the TEXT_FIELD_NAME column
    df[TEXT_FIELD_NAME] = df[TEXT_FIELD_NAME].fillna('')  # Replace NaN with empty string
    df[TEXT_FIELD_NAME] = df[TEXT_FIELD_NAME].astype(str)  # Ensure all values are strings

    df[TEXT_FIELD_NAME] =  df[TEXT_FIELD_NAME].map(lambda x: x.lstrip('Make sure this fits by entering your model number. |').rstrip('aAbBcC'))
    return df


Okay so uh congrats if you made it this far. You now have a (scaled down and local version) of an LLM based data pipeline. I've been getting really deep into this space lately (well deep for anyone without a phd) and find it incredibly interesting, commercial value aside. The aspects of this little case study that I found really valuable were all the different open source components that make up the whole end to end pipeline. Most of the documentation and examples out there of systems like these tend to use OpenAI from top to bottom, which can definitely start running up costs. Exploring these open source components has alleviated a little of my anxiety around vendor dependency // lock-in and I highly recommend that anyone attempting to launch LLM based systems consider using these options instead. You can find the full source code here,

Things to extend:

- As you may have noticed, this entire project is built to run locally. So the next step would be to take this to a production type of config

- Qdrant does offer some more enterprisy deployment options, including Qdrant Cloud (, a SaaS version that gives you a managed instance of the db basically (note: I have not gone past reading their docs on this product so not sure how it performs. proceed accordingly)

- Ollama allowed us to run these open sourced models locally. this obviously does not work at any sort of scale. With that in mind, HuggingFace has offerings around open source models like Llama 2 + AWS BedRock and equivalent offerings at the hyperscalers also have Llama 2 options for deployment (note: again I have not gotten past the reading docs stage for these services so I can't speak too much about how good they are in reality)

Bottom Text

What if we emailed you the secrets to the entire universe?

We wont, but that’d be cool, right?

Wait, there's more!