Day 7: Building a Proof of Concept Using OpenAI Embeddings
Exploring OpenAI Embeddings for AI-Powered Tooltips: Building a Node.js proof of concept to answer questions about the Vacation Tracker app features using OpenAI Embeddings and GPT API
My assignment from the end of the last week was to investigate the OpenAI Embeddings feature to understand its capabilities and requirements better, with the idea of using it to build the AI-powered tooltips MVP for Knowlo.
Learning is more fun when you try to build a proof of concept, so I decided to create a simple Node.js script that will be able to answer the questions about the Vacation Tracker app features.
For this proof of concept, I’ll need to do the following steps:
- Export the articles from the Vacation Tracker’s knowledge base.
- Understand how the OpenAI Embeddings feature works.
- Create a Node.js script to create embeddings from the knowledge base articles and store them somewhere.
- Create a Node.js script to ask questions and get the answers.
This blog post will be a bit more technical than the previous ones. It’ll have some code examples, but I’ll try to add clear explanations for each step.
This was a long week, so let’s get straight to the fun part.
Getting the Knowledge base data
There are multiple ways to get the data from a knowledge base. Some knowledge bases have an API, so you can send a few API requests and pull the data. Others offer different mechanisms for data export, such as CSV files.
If a knowledge base does not support any of these methods (which is highly unlikely), there’s always an option to crawl it and get the content. Crawling is sending an HTTP request or opening a home page link in a hidden browser, saving the content, and following the links to discover other pages.
Vacation Tracker uses Crisp Chat for support chat and helpdesk. Crisp Chat offers an API with many methods, but the helpdesk management is still not supported in their REST API version 1. However, I can export the helpdesk content in CSV format on the Helpdesk page in the admin panel.
The exported CSV includes all helpdesk articles (both draft and public ones), and it contains the following fields:
- id – the ID of the helpdesk article
- slug – the URL slug (https://example.com/this-is-slug)
- title – the title of the helpdesk article
- description – the description of the helpdesk article (if exists)
- content – the content of the helpdesk article in markdown format
- published_at – the publish date or an empty string if the article is not published
- created_at – the date of the helpdesk article creation
- updated_at – the date of the latest update
That sounds like the data we need for Knowlo!
How OpenAI Embeddings Work
The next thing on my to-do list was understanding OpenAI Embeddings. I looked at the OpenAI Embeddings documentation a few times but never spent enough time there to understand that feature. The documentation says that text embeddings measure the relatedness of text strings and that you can use them for search, clustering, recommendations, etc. This explanation sounds useful for our use case, and CofounderGPT said we should be able to use them. But how do they work?
The data classification in machine learning can be complicated when data has many features (high-dimensional data). Let’s say that you have a box of fruit and want to organize your fruit by its characteristics: color, shape, size, taste, and texture. This task can be hard to visualize if your box contains banana, pineapple, mango, dragon fruit, grapes, plums, and quince. However, if you pick only two of these characteristics, such as color and size, visualization becomes easier (low-dimensional data).
An embedding is simply a list of floating numbers (a vector) representing a low-dimensional space into which you can translate high-dimensional vectors (such as words). The low-dimensional representation of data makes it easier for machine learning algorithms to perform a search, clustering, recommendation, etc.
So, we should be able to make embeddings from each of our helpdesk articles (transform them into a long list of numbers). Then do the same for the user’s input, and calculate the closest helpdesk article to the asked question. The closest helpdesk article should contain the answer to the asked question, or it should at least be somehow related to that question.
Creating Embeddings from Knowledge Base Articles
Creating an embedding should be easy using OpenAI’s Embeddings API. Ideally, I should be able to read the knowledge base CSV file, create an embedding from each row, and store the data somewhere for further processing.
Let’s write a script to test that. In case you want to try this out, you’ll need Node.js (version 18 or newer) installed on your computer, a terminal, a code editor, and an OpenAPI API key.
Open your terminal and create a new folder. On Mac and Linux, you can do that by running the following command: mkdir embeddings-poc && cd $_. Then run the npm init -y command to initialize a new Node.js project and create a package.json file with default values. This file is used for Node.js dependencies (npm is a package manager that comes with Node.js).
While you are still in your terminal and in the “embeddings-poc” folder, run the following command to install the CSV parser Node.js module: npm install csv-parse -S.
Then create a new file in the “embeddings-poc” folder, name it create-embeddings.js, and open it in your favorite code editor (Visual Studio Code, for example).
Paste the following code inside the create-embeddings.js file:
This code snippet is a Node.js script that reads a CSV file containing helpdesk articles, filters out the unpublished ones, and then creates embeddings for each of the published articles using OpenAI’s API. The script saves the article data along with their embeddings in a JSON file, and logs the total number of tokens used during the process.
Once the file was ready, I opened my terminal again and ran the file with the following command: node create-embeddings.js knowledge-base-export.csv sk-jRXXX25p, where:
- node create-embeddings.js runs the file using Node.js
- knowledge-base-export.csv represents the name of the CSV file with the knowledge base export.
- sk-jRXXX25p represents an OpenAI API key.
The result looked similar to the following screenshot:
A few seconds later, the script created embeddings for 105 pages in the Vacation Tracker knowledge base, and it saved the output to the “helpdesk-embeddings.json” file in the “embeddings-poc” folder.
An embedding for each helpdesk article contains many numbers. But now that we have these numbers, we can visualize the helpdesk. For example, we can use the t-SNE technique to reduce the number of floating numbers for each embedding to two and then render the 2D visualization of our helpdesk that looks similar to the following image:
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique that is particularly well-suited for the visualization of high-dimensional datasets. It was developed by Laurens van der Maaten and Geoffrey Hinton in 2008.
Now that we have embeddings for each of the helpdesk articles let’s create another script that will allow us to ask questions and get answers. This script should do the following:
- Create an embedding from the question.
- Calculate the cosine similarity between the question embedding and each helpdesk article embeddings.
- Choose the article with the highest similarity score.
- Use the GPT API to generate an answer using the selected article’s content as context.
Open your terminal and navigate to the “embeddings-poc” folder again. Create a new file, name it generate-answer.js, and open it in your favorite code editor. Then paste the following code inside the new file and save it:
This code is a Node.js script that processes a user’s question and returns an answer using the GPT API, based on a given set of helpdesk articles. As mentioned above, the main steps it follows are:
- Read the precomputed embeddings of helpdesk articles from a JSON file (helpdesk-embeddings.json).
- Accept a user’s question from the command-line arguments.
- Create an embedding for the user’s question using the createEmbedding() function.
- Calculate the cosine similarity between the question embedding and the helpdesk article embeddings using the cosineSimilarity() function.
- Find the article with the highest similarity score.
- Generate the answer using the GPT API with the selected article’s content as context. This is done using the generateAnswer() function, which sends a request to the GPT API with a formatted message containing the context and the user’s question.
- Print the generated answer to the console.
The script takes the OpenAI API token and the user’s question as command-line arguments, and it requires the ‘fs’ and ‘path’ modules to read the helpdesk embeddings file.
The most complicated part of this script was the cosine similarity. Luckily, CofounderGPT created it in a few seconds.
Now let’s run this script with the following command from the terminal: node generate-answer.js sk-jRXXX25p “Some question?”, where:
- node generate-answer.js runs the file using Node.js
- sk-jRXXX25p represents an OpenAI API key
- “Some question?” represents the question we want to be answered
Seconds later, the script answers with something similar to the following:
Woohoo, it worked!
Here’s another visualization of the helpdesk articles (blue dots) and the asked question (red dot):
Summary and scoreboard
It seems that we can use OpenAI embeddings for the Knowlo AI-driven tooltips. This script was just a quick proof of concept, and there’s still a lot of work to deliver the MVP version.
OpenAI Embeddings work faster than I expected. However, one of the most concerning things with this proof of concept is the speed of the GPT-4 answers. Luckily, the GPT-3.5-turbo model answers faster, and its answers are not much worse than the GPT-4 answers.
Another excellent benefit of this approach is built-in multi-language support. Our helpdesk is in English, but if you ask a question in another language, you’ll get an answer in that language!
Here’s the scoreboard for today:
Time spent today: 8h
Total time spent: 41h
Investment today: $0.5 (for OpenAI API usage)
Total investment: $207.5
Now that we’ve established that this works, the next step is to build an actual prototype of the product where users can signup, login, import data, and create and manage tooltips. We’ll be figuring out all the pages that we need and we’ll do some mockups of those pages. Then we’ll figure out a template that we can use that fits with our mockups.