Skip to content

Latest commit

 

History

History
287 lines (202 loc) · 18.4 KB

local-embeddings.md

File metadata and controls

287 lines (202 loc) · 18.4 KB

Local Embeddings

Embeddings are used for semantic similarity search. Natural-language strings are converted into numerical vectors called embeddings. The more conceptually related are two strings, the closer their vectors.

While you can use an external AI service to compute embeddings, in many cases you can simply compute them locally on your server (no need for a GPU - the CPU will work fine). SmartComponents.LocalEmbeddings is a library to simplify doing this.

With SmartComponents.LocalEmbeddings, you can compute embeddings in under a millisecond, and perform semantic search over hundreds of thousands of candidates in single-digit milliseconds. However, there are limits. To understand the performance characteristics and when you might benefit from moving to an external vector database, see Performance below.

Relationship to Semantic Kernel

Originally, SmartComponents.LocalEmbeddings was a standalone library, but more recently has been changed to be a wrapper around Semantic Kernel's own ability to compute embeddings locally using ONNX runtime.

As such, SmartComponents.LocalEmbeddings is now equivalent to using Semantic Kernel's BertOnnxTextEmbeddingGenerationService, with the following additional features:

  • Acquiring the embeddings model automatically at build time. If you use SK directly, you need to take care of downloading a suitable .onnx file for the embeddings model and making it available at runtime. LocalEmbeddings handles this for you - see below for details of how to customize it.
  • Helper methods for finding the closest match from a set of candidates. If you use SK directly, you can use TensorPrimitives.CosineSimilarity and similar methods to compute similarity between two embeddings, or SemanticTextMemory.SearchAsync to find the closest match from a precomputed set of embeddings. In comparison, LocalEmbeddings provides LocalEmbedder.FindClosest (described below) as an alternative way to search through a set of candidates. Both approaches will perform the same, but are convenient in different circumstances. If you're using SK, it's best to stick with the SK APIs, but if you're not using SK, the LocalEmbedder.FindClosest helper may be easier to use.
  • Alternative representations for embeddings. With Semantic Kernel, the convention is to represents embeddings as Span<float> or ReadOnlyMemory<float>, which are equivalent in space/accuracy to EmbeddingF32. Beyond this, SmartComponents.LocalEmbeddings offers other representations EmbeddingI8 and EmbeddingI1 (described below) which give you different space/accuracy tradeoffs. For example, EmbeddingI1 takes up only 1/32 of the memory of EmbeddingF32 or Span<float> and can be use in nearest-neighbour searches considerably faster, at the cost of reduced accuracy. This is described in detail below.

Recommendation: SmartComponents.LocalEmbeddings is now a set of samples of ways you can build further capabilities and conveniences on top of Semantic Kernel's BertOnnxTextEmbeddingGenerationService. If you find these useful, you can use them in your own applications. But if SK's APIs are sufficient for your use cases, you should simply use them directly without using SmartComponents.LocalEmbeddings.

Getting started

Add the SmartComponents.LocalEmbeddings project from this repo to your solution and reference it from your app.

To acquire the local model needed for calculating local embeddings as part of your build, import build/SmartComponents.LocalEmbeddings.targets from the SmartComponents.LocalEmbeddings project into your app project:

<Import Project="<REPO PATH>\src\SmartComponents.LocalEmbeddings\build\SmartComponents.LocalEmbeddings.targets" />

You can now compute embeddings of strings:

using var embedder = new LocalEmbedder();
var cat = embedder.Embed("Cats can be blue");
var dog = embedder.Embed("Dogs can be red");
var snooker = embedder.Embed("Snooker world champion Stephen Hendry");

... and assess their semantic similarity:

var kitten = embedder.Embed("Kittens!!!");
Console.WriteLine(kitten.Similarity(kitten));  // 1.00
Console.WriteLine(kitten.Similarity(cat));     // 0.65
Console.WriteLine(kitten.Similarity(dog));     // 0.53
Console.WriteLine(kitten.Similarity(snooker)); // 0.37

As you can see, "Kittens!!!" is:

  • ... perfectly related to itself
  • ... fairly related to the statement about cats
  • ... less related to the statement about dogs
  • ... very unrelated to the statement about snooker

Peforming similarity search

If you want, you can find the closest matches from a set of candidate embeddings simply using candidates.OrderByDescending(x => x.Similarity(target)).Take(count).

However, it's a little more efficient to use LocalEmbedder.FindClosest, because it only sorts the best N matches, instead of sorting all the candidates. FindClosest accepts the following parameters:

  • target: An embedding previously returned by embedder.Embed or embedder.EmbedRange
  • candidates: An enumerable of tuples of the form (item, embedding). The item can be the string that you embedded, or it can be any other object of generic type T.
  • maxResults: The maximum number of results
  • minSimilarity: Optional. If set, candidates with a similarity below this threshold won't be included.

The return value is an array of T values, ordered most-similar-first. If you want the similarity scores too, use LocalEmbedder.FindClosestWithScore instead, which returns an array of SimilarityScore<T> giving both the T and its score.

For example, given this class:

class Sport
{
    public string Name { get; init; }
    public EmbeddingF32 Embedding { get; init; }
}

... and this data:

var sportNames = new[] { "Soccer", "Tennis", "Swimming", "Horse riding", "Golf", "Gymnastics" };

var sports = sportNames.Select(name => new Sport
{
    Name = name,
    Embedding = embedder.Embed(name)
}).ToArray();

You can find the closest 3 Sport instances for the string "ball game":

var candidates = sports.Select(a => (a, a.Embedding));
var target = embedder.Embed("ball game");
Sport[] closest = LocalEmbedder.FindClosest(target, candidates, maxResults: 3);

// Displays: Soccer, Golf, Tennis
Console.WriteLine(string.Join(", ", closest.Select(x => x.Name)));

While at first it might feel cumbersome to pass an enumerable of tuples for candidates, this allows you to get back any data type (e.g., the strings that were embedded, or entity objects holding those embeddings, or just their int ID values), and allows you to prefilter by applying a .Where(...) clause, all without any extra memory allocations.

Alternatively, as shorthand, you can use EmbedRange to produce the tuples over many inputs at once:

var candidates = embedder.EmbedRange(sports, x => x.Name);
Sport[] closest = LocalEmbedder.FindClosest(
  embedder.Embed("ball game"),
  candidates,
  maxResults: 3);

Reusing LocalEmbedder instances

LocalEmbedder instances are:

  • Thread-safe. You can share a singleton instance across many threads.
  • Disposable. It holds unmanaged resources since it uses the ONNX runtime internally to run the embeddings ML model. Remember to dispose it.
  • Expensive to create. Each instance has to load the ML model and set up a session with ONNX.
    • Where possible, retain an instance as a singleton and reuse it. For example, register it as a DI service using builder.Services.AddSingleton<LocalEmbedder>(). In that case, you won't dispose it because the DI container will take care of that.

Shrinking embeddings (quantization)

By default, SmartComponents.LocalEmbeddings uses an embeddings model that returns 384-dimensional embedding vectors. Each component is represented by a single-precision float value (4 bytes), so the memory required for a raw, unquantized embedding is 384*4 = 1536 bytes.

In many scenarios this is too much memory. For a million embeddings, it would be 1.5 GiB, which is a lot to hold in memory, and a lot to add to your database.

A common technique for reducing the space needed to store vector data is quantization. There are many forms of quantization. LocalEmbeddings has three built-in storage formats for embeddings, offering different quantizations:

Type Size (bytes) Similarity Info
EmbeddingF32 1536 Cosine Raw, unquantized data. Each component is stored as a float. Maximum accuracy.
EmbeddingI8 388 Cosine Each component is stored as a sbyte (signed byte), plus there's 4 bytes to hold a scale factor. This cuts storage significantly, while retaining good accuracy. It's similar to SQ8 quantization in Faiss.
EmbeddingI1 48 Hamming Each component is stored as a single bit, equivalent to LSH quantization in Faiss. This is a massive reduction in storage, at the cost of moderate reduction in accuracy.

When evaluating similarity, the scores are computed directly from the quantized representations, without expanding back to a 1536-byte representation. As such, similarity search works faster on the smaller quantizations, because the CPU is processing far fewer bytes.

You can only compute similarity within a type. That is, an EmbeddingI1 can be compared to another EmbeddingI1, but not to an EmbeddingF32.

To get an embedding in a chosen format, pass it as a generic parameter to Embed or EmbedRange. Examples:

// To produce a single embedding:
var embedding = embedder.Embed<EmbeddingI1>(someString);

// Or to produce a set of (item, embedding) pairs:
var candidates = embedder.EmbedRange<Sport, EmbeddingI1>(sports, x => x.Name);

Persisting embeddings

When you want to save embeddings to a file or database, you can use the Buffer property to access the raw memory as a ReadOnlyMemory<byte>. This property is available on any embedding type. Example:

// Normally you'd store embeddings in a database, not a file on disk,
// but for simplicity let's use a file
var originalEmbedding = embedder.Embed<EmbeddingF32>("The chickens are here to see you");
using (var file = File.OpenWrite("somefile"))
{
    await file.WriteAsync(originalEmbedding.Buffer);
}

// Now load it back from disk. Be sure to use the same embedding type.
var loadedBuffer = File.ReadAllBytes("somefile");
var loadedEmbedding = new EmbeddingF32(loadedBuffer);

// Displays "1" (the embeddings are identical)
Console.WriteLine(originalEmbedding.Similarity(loadedEmbedding));

If you want to access the numerical values of the vector components (e.g., to store them in an external vector database), you can use the following properties:

Embedding type Values properties Values type
EmbeddingF32 Values ReadOnlyMemory<float>
EmbeddingI8 Values and Magnitude ReadOnlyMemory<sbyte> and float
EmbeddingI1 Not available... ... because the packed bits are simply what you find in Buffer

Storing and querying using Entity Framework

With Entity Framework, you can add a byte[] property onto an entity class to hold the raw data for an embedding. For example:

public class Document
{
    public int DocumentId { get; set; }
    public int OwnerId { get; set; }
    public required string Title { get; set; }
    public required string Body { get; set; }

    // It's helpful to use the property name to keep track of which
    // format of embedding is being used
    public required byte[] EmbeddingI8Buffer { get; set; }
}

You can populate the byte[] property using ToArray():

var doc = new Document
{
   // ... set other properties here ...
   EmbeddingI8Buffer = embedder.Embed<EmbeddingI8>(title).Buffer.ToArray()
};

You might want to recompute this embedding each time the user edits whatever text is used to compute it.

Next, if you need to search over a small number of entities (e.g., just the records created by the current user), it may be sufficient to load the data on demand and then run a similarity search:

using var dbContext = new MyDbContext();

// Load whatever subset of the data you want to consider
// No need to fetch all the columns - only need ID/embedding pairs
var currentUserDocs = await dbContext.Documents
    .Where(x => x.OwnerId == currentUserId)
    .Select(x => new { x.DocumentId, x.EmbeddingI8Buffer })
    .ToListAsync();

// Perform the similarity search
int[] matchingDocIds = LocalEmbedder.FindClosest(
    embedder.Embed<EmbeddingI8>(searchText),
    currentUserDocs.Select(x => (x.DocumentId, new EmbeddingI8(x.EmbeddingI8Buffer))),
    maxResults: 5);

// Load the complete entities for the matching documents
var matchingDocs = await dbContext.Documents
    .Where(x => matchingDocIds.Contains(x.DocumentId))
    .ToDictionaryAsync(x => x.DocumentId);
var matchingDocsInOrder = matchingDocIds.Select(x => matchingDocs[x]);

In many cases you'll want to search over a large number of entities, e.g., tens of thousands of entities shared across all users. You would not want to retrieve them all from the database for every search (especially for each keystroke in a Smart ComboBox). Instead, it would make sense to have the server cache the list of ID/embedding pairs in memory. An (int Id, EmbeddingI1 Embedding) pair would be only 52 bytes, so holding a million of them would not be problematic (52 MiB). You could cache them in a MemoryCache that will expire at regular intervals, offering a tradeoff between database load and freshness of results.

Customizing the underlying embeddings model

LocalEmbedder works by using the ONNX runtime, which can execute many different embeddings models on CPU or GPU (and often, CPU works faster for such small models).

The SmartComponents.LocalEmbeddings library does not actually contain any ML model, but it is configured to download a model when you first build your application. You can configure which model is downloaded.

The default model that gets downloaded on build is bge-micro-v2, an MIT-licensed BERT embedding model, which has been quantized down to just 22.9 MiB, runs efficiently on CPU, and scores well on benchmarks - outperforming many gigabyte-sized models.

If you want to use a different model, specify the URL to its .onnx file and the vocabulary that should be used for tokenization. For example, to use gte-tiny, set the following in your .csproj:

<PropertyGroup>
  <LocalEmbeddingsModelUrl>https://huggingface.co/TaylorAI/gte-tiny/resolve/main/onnx/model_quantized.onnx</LocalEmbeddingsModelUrl>
  <LocalEmbeddingsVocabUrl>https://huggingface.co/TaylorAI/gte-tiny/resolve/main/vocab.txt</LocalEmbeddingsVocabUrl>
</PropertyGroup>

Requirements: The model must be in ONNX format, accept BERT-tokenized text, accept inputs labelled input_ids, attention_mask, token_type_ids, and return an output tensor suitable for mean pooling. Many sentence transformer models on Hugging Face follow these patterns. These are often 384-dimensional embeddings.

Performance

As a rough approximation, based on an Intel i9-11950H CPU:

  • Using embedder.Embed for a 50-character string may take around 0.5ms of CPU time (shorter text is quicker).
    • So, if you're computing embeddings over many thousands of strings (or very long strings), it's worth storing the computed embeddings in your existing database (e.g., each time a user saves changes to the corresponding text) instead of recomputing them all from scratch each time the app restarts.
  • An in-memory, single-threaded similarity search using LocalEmbedder.FindClosest with EmbeddingF32 can search through 1,000 candidates in around 0.06ms, or 100,000 candidates in around 6ms (it's linear in the number of candidates, independent of the text length). This goes down to ~2.8ms if you use EmbeddingI1.
    • So, if you need to search through tens of millions of candidates, you should consider more advanced similarity search options such as using Faiss or an external vector database.
    • From benchmarks, LocalEmbedder.FindClosest performance is equivalent to Faiss using its Flat index type. You'll only get better speeds from Faiss using its more powerful indexes such as HNSW or IVF, which requires training on your data.

Recommendations for scaling up

The overall goal for SmartComponents.LocalEmbeddings is to make semantic search easy to get started with. It may be sufficient for many applications. But if you outgrow it:

  • You can use an external service to compute embeddings, e.g., OpenAI embeddings or Azure OpenAI embeddings
  • You can perform similarity search using Faiss on your server (e.g., in .NET via FaissSharp, faissmask, or FaissNet). This allows you to set up much more powerful indexes that can be trained on your own data. It's a lot more to learn.
  • Or instead of Faiss, you can use an external vector database such as pgvector or cloud-based vector database services.

Usage with Semantic Kernel

As mentioned in the introduction to this document, SmartComponents.LocalEmbeddings is simply a wrapper around Semantic Kernel's BertOnnxTextEmbeddingGenerationService, showing ways to add further conveniences and capabilities.

The LocalEmbedder type implements SK's ITextEmbeddingGenerationService interface, so it can be used directly with any Semantic Kernel APIs that needs to generate embeddings. For example, when constructing a SemanticTextMemory, you can pass an instance of LocalEmbedder as the embeddingGenerator constructor argument:

var storage = new VolatileMemoryStore(); // Requires a reference to Microsoft.SemanticKernel.Plugins.Memory
using var embedder = new LocalEmbedder();
var semanticTextMemory = new SemanticTextMemory(storage, embedder);

// ... and now use semanticTextMemory to store and search for items