In the previous chapter, PluginManager (pm), you learned how to extend llm using plugins. Now, let's explore how to store and retrieve embeddings using a Collection.
Imagine you have a massive library of books. To find a book similar to one you're reading, you could manually compare it to every other book – a very slow process! A Collection in llm is like a smart librarian: it organizes "embeddings" (numerical representations of text or data) so you can quickly find similar items.
Why do we need a Collection?
Let's say you want to build a question-answering system. You have a database of documents, and you want to find the document most relevant to a user's question. You can embed the documents and the question, then use a Collection to quickly find the document with the most similar embedding to the question. Without a Collection, you'd have to compare the question embedding to every document embedding, which would be very slow.
Core Concepts: What is a Collection?
A Collection is essentially a digital filing cabinet for embeddings. Here are the key things to understand:
-
Embeddings: These are numerical representations of text or other data, created by an EmbeddingModel (a special type of Model). Think of them as coordinates that indicate how similar two pieces of data are.
-
Storing Embeddings: The
Collectionstores these embeddings, along with the original text (or data) and any associated metadata. -
Similarity Search: The main purpose of a
Collectionis to efficiently find embeddings that are similar to a given query embedding. "Similar" means the embeddings are "close" to each other in the multi-dimensional space, which can be calculated using cosine similarity. -
Metadata: You can store extra information (like author, date, or category) along with each embedding. This metadata can be used to filter search results.
Solving the Use Case: Finding Similar Articles
Let's say you want to create a Collection of news articles and then find articles similar to a given article. Here's how you can do it:
-
Create a
Collection: First, you need to create aCollectionobject. This tellsllmwhere to store the embeddings. -
Embed Articles: Use an EmbeddingModel to embed each article.
-
Store Embeddings in the
Collection: Add each article's embedding, along with its text and metadata, to theCollection. -
Find Similar Articles: To find articles similar to a given article, embed that article and then use the
Collectionto find the most similar embeddings.
Here's some simplified Python code (note that to run this code as-is, you'd need to make sure you have sqlite-utils installed, e.g. pip install sqlite-utils):
import llm
from sqlite_utils import Database
# Create a database to store the collection
db = Database("articles.db")
# Create a collection called "news_articles"
collection = llm.Collection(name="news_articles", db=db, model_id="clip")
print(f"Collection '{collection.name}' created using model '{collection.model_id}'.")Explanation:
- First, we import the
llmlibrary andDatabasefromsqlite-utils. - We create an SQLite database called
"articles.db"to store ourCollection. - We create a
Collectionobject named"news_articles", associating it with the database and the"clip"model (an embedding model that you'll need to have available in yourllmsetup). If the collection doesn't exist, it will be created.
Now, let's add some articles to the Collection:
articles = [
("article1", "Climate change is a serious threat."),
("article2", "The economy is growing rapidly."),
("article3", "New technology is transforming society."),
("article4", "Global warming is impacting weather patterns."),
]
collection.embed_multi(articles)
print(f"Added {len(articles)} articles to the collection.")Explanation:
- We have a list of articles, where each article is a tuple containing an ID and the article text.
- We use the
embed_multimethod to embed all of the articles and store them in theCollection.
Finally, let's find articles similar to a given article:
query_text = "What are the effects of climate change?"
similar_articles = collection.similar(query_text, number=2)
print(f"Articles similar to '{query_text}':")
for article in similar_articles:
print(f"- {article.id}: {article.content} (score: {article.score})")Explanation:
- We define the text we want to find similar articles for.
- We use the
similarmethod to find the 2 most similar articles in theCollection. - We print the ID, text, and similarity score for each similar article.
Example output (will vary slightly depending on the model and content):
Articles similar to 'What are the effects of climate change?':
- article4: Global warming is impacting weather patterns. (score: 0.85)
- article1: Climate change is a serious threat. (score: 0.80)
Internal Implementation Walkthrough
Let's see what happens under the hood when you use a Collection to find similar items:
sequenceDiagram
participant User
participant Collection
participant EmbeddingModel as EM
participant Database as DB
User->>Collection: similar("query text", number=2)
Collection->>EM: embed("query text")
EM-->>Collection: embedding_vector
Collection->>Collection: similar_by_vector(embedding_vector, number=2)
Collection->>DB: Query embeddings table for similar vectors
DB-->>Collection: List of similar embeddings
Collection->>User: List of Entry objects
This diagram shows:
- The user calls the
similarmethod on aCollectionobject with the text to search for and the desired number of results. - The
Collectionuses the EmbeddingModel associated with theCollectionto embed the query text. - The
Collectioncalls thesimilar_by_vectorfunction using the generated vector. - The
Collectionqueries the database for embeddings similar to the query embedding. - The database returns a list of similar embeddings.
- The
Collectionreturns a list ofEntryobjects, each containing the ID, text, and similarity score for a similar item.
Diving into the Code
Let's look at some key parts of the Collection class definition in llm/embeddings.py:
from dataclasses import dataclass
from typing import Optional, Dict, Any, Union
import time
import hashlib
@dataclass
class Entry: # Represents a single entry in the collection
id: str
score: Optional[float]
content: Optional[str] = None
metadata: Optional[Dict[str, Any]] = None
class Collection:
def __init__(self, name: str, db, model_id: Optional[str] = None) -> None:
self.name = name
self.db = db
self.model_id = model_id # ID of the embedding model to use
def embed(self, id: str, value: Union[str, bytes], metadata: Optional[Dict[str, Any]] = None):
# Embed the value and store in the database
embedding = self.model().embed(value)
self.db["embeddings"].insert({ # Insert or replace
"collection_id": self.id,
"id": id,
"embedding": self.encode(embedding),
"content": value if isinstance(value, str) else None, # only if store==True
"metadata": self.encode_metadata(metadata),
"updated": int(time.time()),
}, replace=True) # Deduplicate based on content hash
def similar(self, value: Union[str, bytes], number: int = 10) -> list[Entry]:
# Find similar entries
comparison_vector = self.model().embed(value)
return self.similar_by_vector(comparison_vector, number)
@staticmethod
def content_hash(input: Union[str, bytes]) -> bytes:
"Hash content for deduplication. Override to change hashing behavior."
if isinstance(input, str):
input = input.encode("utf8")
return hashlib.md5(input).digest()Explanation:
- The
Entrydataclass defines the structure for storing information about each item in theCollection. - The
Collectionclass stores items along with the metadata and can find the similar items based on the embedding vectors. - The
embedmethod embeds the given text, calculates hash and stores it in the "embeddings" table (replace if existing). - The
similarmethod finds the similar items using the embeddings.
Conclusion
The Collection provides a way to efficiently store and retrieve embeddings, allowing you to quickly find similar items in a large dataset. This is useful for building question-answering systems, recommendation engines, and other applications that rely on semantic similarity.
Congratulations! You have completed the core tutorial for llm. You can now use llm to interact with Large Language Models, customize their behavior, and extend their functionality.
Generated by AI Codebase Knowledge Builder