Introducing pgvector for Heroku Postgres
Over the past few weeks, we worked on adding pgvector as an extension on Heroku Postgres. We're excited to release this feature, and based on the feedback on our public roadmap, many of you are too. We want to share a bit more about how you can use it and how it may be helpful to you.
All Standard-tier or higher databases running Postgres 15 now support the pgvector
extension. You can get started by running CREATE EXTENSION vector;
in a client session. Postgres 15 has been the default version on Heroku Postgres since March 2023. If you're on an older version and want to use pgvector, upgrade to Postgres 15.
The extension adds the vector data type to Heroku Postgres along with additional functions to work with it. Vectors are important for working with large language models and other machine learning applications, as the embeddings generated by these models are often output in vector format. Working with vectors lets you implement things like similarity search across these embeddings. See our launch blog for more background into what pgvector is, its significance, and ideas for how to use this new data type.
An Example: Word Vector Similarity Search
To show a simple example of how to generate and save vector data to your Heroku database, I'm using the Wikipedia2Vec pretrained embeddings. However, you can train your own embeddings or use other models providing embeddings via API, like HuggingFace or OpenAI. The model you want to use depends on the type of data you're working with. There are models for tasks like computing sentence similarities, searching large texts, or performing image classification. Wikipedia2Vec uses a Word2vec algorithm to generate vectors for individual words, which maps similar words close to each other in a continuous vector space.
I like animals, so I want to use Wikipedia2Vec to group similar animals. I’m using the vector embeddings of each animal and the distance between them to find animals that are alike.
If I want to get the embedding for a word from Wikipedia2Vec, I need to use a model. I downloaded one from the pretrained embeddings on their website. Then I can use their Python module and the function get_word_vector
as follows:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load('enwiki_20180420_100d.pkl')
wiki2vec.get_word_vector('llama')
The output of the vector looks like this:
memmap([-0.15647224, 0.04055957, 0.48439676, -0.22689971, -0.04544162,
-0.06538601, 0.22609918, -0.26075622, -0.7195759 , -0.24022003,
0.1050799 , -0.5550985 , 0.4054564 , 0.14180332, 0.19856507,
0.09962048, 0.38372937, -1.1912689 , -0.93939453, -0.28067762,
0.04410955, 0.43394643, -0.3429818 , 0.22209083, -0.46317756,
-0.18109794, 0.2775289 , -0.21939017, -0.27015808, 0.72002393,
-0.01586861, -0.23480305, 0.365697 , 0.61743397, -0.07460125,
-0.10441436, -0.6537417 , 0.01339269, 0.06189647, -0.17747395,
0.2669941 , -0.03428648, -0.8533792 , -0.09588563, -0.7616592 ,
-0.11528812, -0.07127796, 0.28456485, -0.12986512, -0.8063386 ,
-0.04875885, -0.27353695, -0.32921 , -0.03807172, 0.10544889,
0.49989182, -0.03783042, -0.37752548, -0.19257008, 0.06255971,
0.25994852, -0.81092316, -0.15077794, 0.00658835, 0.02033841,
-0.32411653, -0.03033727, -0.64633304, -0.43443972, -0.30764043,
-0.11036412, 0.04134453, -0.26934972, -0.0289086 , -0.50319433,
-0.0204528 , -0.00278326, 0.36589545, 0.5446438 , -0.10852882,
0.09699931, -0.01168614, 0.08618425, -0.28925297, -0.25445923,
0.63120073, 0.52186656, 0.3439454 , 0.6686451 , 0.1076297 ,
-0.34688494, 0.05976971, -0.3720558 , 0.20328045, -0.485623 ,
-0.2222396 , -0.22480975, 0.4386788 , -0.7506131 , 0.14270408],
dtype=float32)
To get your vector data into your database:
- Generate the embeddings.
- Add a column to your database to store your embeddings.
- Save the embeddings to the database.
I already have the embeddings from Wikipedia2Vec, so let’s walk through preparing my database and saving them. When creating a vector column, it's necessary to declare a length for it, so check and see the length of the embedding the model outputs. In my case, the embeddings are 100 numbers long, so I add that column to my table.
CREATE TABLE animals(id serial PRIMARY KEY, name VARCHAR(100), embedding VECTOR(100));
From there, save the items you're interested in to your database. You can do it directly in SQL:
INSERT INTO animals(name, embedding) VALUES ('llama', '[-0.15647223591804504,
…
-0.7506130933761597, 0.1427040845155716]');
But you can also use your favorite programming language along with a Postgres client and a pgvector library. For this example, I used Python, psycopg, and pgvector-python. Here I'm using the pretrained embedding file to generate embeddings for a list of animals I made, valeries-animals.txt
, and save them to my database.
import psycopg
from pathlib import Path
from pgvector.psycopg import register_vector
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load('enwiki_20180420_100d.pkl')
animals = Path('valeries-animals.txt').read_text().split('\n')
with psycopg.connect(DATABASE_URL, sslmode='require', autocommit=True) as conn:
register_vector(conn)
cur = conn.cursor()
for animal in animals:
cur.execute("INSERT INTO animals(name, embedding) VALUES (%s, %s)", (animal, wiki2vec.get_word_vector(animal)))
Now that I have the embeddings in my database, I can use pgvector's functions to query them. The extension includes functions to calculate Euclidean distance (<->
), cosine distance (<=>
), and inner product (<#>
). You can use all three for calculating similarity between vectors. Which one you use depends on your data as well as your use case.
Here I'm using Euclidean distance to find the five animals closest to a shark:
=> SELECT name FROM animals WHERE name != 'shark' ORDER BY embedding <-> (SELECT embedding FROM animals WHERE name = 'shark') LIMIT 5;
name
-----------
crocodile
dolphin
whale
turtle
alligator
(5 rows)
It works! It's worth noting that the model that we used is based on words appearing together in Wikipedia articles, and different models or source corpuses likely yield different results. The results here are also limited to the hundred or so animals that I added to my database.
pgvector Optimization and Performance Considerations
As you add more vector data to your database, you may notice performance issues or slowness in performing queries. You can index vector data like other columns in Postgres, and pgvector provides a few ways to do so, but there are some important considerations to keep in mind:
- Adding an index causes pgvector to switch to using approximate nearest neighbor search instead of exact nearest neighbor search, possibly causing a difference in query results.
- Indexing functions are based on distance calculations, so create one based on the calculation you plan to rely on the most in your application.
- There are two index types supported, IVFFlat and HNSW. Before you add an IVFFlat index, make sure you have some data in your table for better recall.
Check out the pgvector documentation for more information on indexing and other performance considerations.
Collaborate and Share Your pgvector Projects
Now that pgvector for Heroku Postgres is out in the world, we're really excited to hear what you do with it! One of pgvector's great advantages is that it lets vector data live alongside all the other data you might already have in Postgres. You can add an embedding column to your existing tables and start experimenting. Our launch blog for this feature includes a lot of ideas and possible use cases for how to use this new tool, and I'm sure you can come up with many more. If you have questions, our Support team is available to assist. Don't forget you can share your solutions using the Heroku Button on your repo. If you feel like blogging on your success, tag us on social media and we would love to read about it!