项目作者: ankane

项目描述 :
Nearest neighbor search for Rails and Postgres
高级语言: Ruby
项目地址: git://github.com/ankane/neighbor.git
创建时间: 2021-02-16T04:36:33Z
项目社区:https://github.com/ankane/neighbor

开源协议:MIT License

下载


Neighbor

Nearest neighbor search for Rails

Supports:

  • Postgres (cube and pgvector)
  • SQLite (sqlite-vec) - experimental
  • MariaDB 11.7 - experimental
  • MySQL 9 (searching requires HeatWave) - experimental

Build Status

Installation

Add this line to your application’s Gemfile:

  1. gem "neighbor"

For Postgres

Neighbor supports two extensions: cube and pgvector. cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.

For cube, run:

  1. rails generate neighbor:cube
  2. rails db:migrate

For pgvector, install the extension and run:

  1. rails generate neighbor:vector
  2. rails db:migrate

For SQLite

Add this line to your application’s Gemfile:

  1. gem "sqlite-vec"

And run:

  1. rails generate neighbor:sqlite

Getting Started

Create a migration

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. # cube
  4. add_column :items, :embedding, :cube
  5. # pgvector, MariaDB, and MySQL
  6. add_column :items, :embedding, :vector, limit: 3 # dimensions
  7. # sqlite-vec
  8. add_column :items, :embedding, :binary
  9. end
  10. end

Add to your model

  1. class Item < ApplicationRecord
  2. has_neighbors :embedding
  3. end

Update the vectors

  1. item.update(embedding: [1.0, 1.2, 0.5])

Get the nearest neighbors to a record

  1. item.nearest_neighbors(:embedding, distance: "euclidean").first(5)

Get the nearest neighbors to a vector

  1. Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)

Records returned from nearest_neighbors will have a neighbor_distance attribute

  1. nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
  2. nearest_item.neighbor_distance

See the additional docs for:

Or check out some examples

cube

Distance

Supported values are:

  • euclidean
  • cosine
  • taxicab
  • chebyshev

For cosine distance with cube, vectors must be normalized before being stored.

  1. class Item < ApplicationRecord
  2. has_neighbors :embedding, normalize: true
  3. end

For inner product with cube, see this example.

Dimensions

The cube type can have up to 100 dimensions by default. See the Postgres docs for how to increase this.

For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

  1. class Item < ApplicationRecord
  2. has_neighbors :embedding, dimensions: 3
  3. end

pgvector

Distance

Supported values are:

  • euclidean
  • inner_product
  • cosine
  • taxicab
  • hamming
  • jaccard

Dimensions

The vector type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.

The halfvec type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.

The bit type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.

The sparsevec type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.

Indexing

Add an approximate index to speed up queries. Create a migration with:

  1. class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
  2. def change
  3. add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
  4. # or
  5. add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
  6. end
  7. end

Use :vector_cosine_ops for cosine distance and :vector_ip_ops for inner product.

Set the size of the dynamic candidate list with HNSW

  1. Item.connection.execute("SET hnsw.ef_search = 100")

Or the number of probes with IVFFlat

  1. Item.connection.execute("SET ivfflat.probes = 3")

Half-Precision Vectors

Use the halfvec type to store half-precision vectors

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. add_column :items, :embedding, :halfvec, limit: 3 # dimensions
  4. end
  5. end

Half-Precision Indexing

Index vectors at half precision for smaller indexes

  1. class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
  2. def change
  3. add_index :items, "(embedding::halfvec(3)) halfvec_l2_ops", using: :hnsw
  4. end
  5. end

Get the nearest neighbors

  1. Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)

Binary Vectors

Use the bit type to store binary vectors

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. add_column :items, :embedding, :bit, limit: 3 # dimensions
  4. end
  5. end

Get the nearest neighbors by Hamming distance

  1. Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)

Binary Quantization

Use expression indexing for binary quantization

  1. class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
  2. def change
  3. add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
  4. end
  5. end

Sparse Vectors

Use the sparsevec type to store sparse vectors

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
  4. end
  5. end

Get the nearest neighbors

  1. embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
  2. Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)

sqlite-vec

Distance

Supported values are:

  • euclidean
  • cosine
  • taxicab
  • hamming

Dimensions

For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

  1. class Item < ApplicationRecord
  2. has_neighbors :embedding, dimensions: 3
  3. end

Virtual Tables

You can also use virtual tables

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. # Rails 8+
  4. create_virtual_table :items, :vec0, [
  5. "id integer PRIMARY KEY AUTOINCREMENT NOT NULL",
  6. "embedding float[3] distance_metric=L2"
  7. ]
  8. # Rails < 8
  9. execute <<~SQL
  10. CREATE VIRTUAL TABLE items USING vec0(
  11. id integer PRIMARY KEY AUTOINCREMENT NOT NULL,
  12. embedding float[3] distance_metric=L2
  13. )
  14. SQL
  15. end
  16. end

Use distance_metric=cosine for cosine distance

You can optionally ignore any shadow tables that are created

  1. ActiveRecord::SchemaDumper.ignore_tables += [
  2. "items_chunks", "items_rowids", "items_vector_chunks00"
  3. ]

Get the k nearest neighbors

  1. Item.where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)

Filter by primary key

  1. Item.where(id: [2, 3]).where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)

Int8 Vectors

Use the type option for int8 vectors

  1. class Item < ApplicationRecord
  2. has_neighbors :embedding, dimensions: 3, type: :int8
  3. end

Binary Vectors

Use the type option for binary vectors

  1. class Item < ApplicationRecord
  2. has_neighbors :embedding, dimensions: 8, type: :bit
  3. end

Get the nearest neighbors by Hamming distance

  1. Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)

MariaDB

Distance

Supported values are:

  • euclidean
  • cosine
  • hamming

Indexing

Vector columns must use null: false to add a vector index

  1. class CreateItems < ActiveRecord::Migration[8.0]
  2. def change
  3. create_table :items do |t|
  4. t.vector :embedding, limit: 3, null: false
  5. t.index :embedding, type: :vector
  6. end
  7. end
  8. end

Binary Vectors

Use the bigint type to store binary vectors

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. add_column :items, :embedding, :bigint
  4. end
  5. end

Note: Binary vectors can have up to 64 dimensions

Get the nearest neighbors by Hamming distance

  1. Item.nearest_neighbors(:embedding, 5, distance: "hamming").first(5)

MySQL

Distance

Supported values are:

  • euclidean
  • cosine
  • hamming

Note: The DISTANCE() function is only available on HeatWave

Binary Vectors

Use the binary type to store binary vectors

  1. class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
  2. def change
  3. add_column :items, :embedding, :binary
  4. end
  5. end

Get the nearest neighbors by Hamming distance

  1. Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)

Examples

OpenAI Embeddings

Generate a model

  1. rails generate model Document content:text embedding:vector{1536}
  2. rails db:migrate

And add has_neighbors

  1. class Document < ApplicationRecord
  2. has_neighbors :embedding
  3. end

Create a method to call the embeddings API

  1. def embed(input)
  2. url = "https://api.openai.com/v1/embeddings"
  3. headers = {
  4. "Authorization" => "Bearer #{ENV.fetch("OPENAI_API_KEY")}",
  5. "Content-Type" => "application/json"
  6. }
  7. data = {
  8. input: input,
  9. model: "text-embedding-3-small"
  10. }
  11. response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
  12. JSON.parse(response.body)["data"].map { |v| v["embedding"] }
  13. end

Pass your input

  1. input = [
  2. "The dog is barking",
  3. "The cat is purring",
  4. "The bear is growling"
  5. ]
  6. embeddings = embed(input)

Store the embeddings

  1. documents = []
  2. input.zip(embeddings) do |content, embedding|
  3. documents << {content: content, embedding: embedding}
  4. end
  5. Document.insert_all!(documents)

And get similar documents

  1. document = Document.first
  2. document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)

See the complete code

Cohere Embeddings

Generate a model

  1. rails generate model Document content:text embedding:bit{1536}
  2. rails db:migrate

And add has_neighbors

  1. class Document < ApplicationRecord
  2. has_neighbors :embedding
  3. end

Create a method to call the embed API

  1. def embed(input, input_type)
  2. url = "https://api.cohere.com/v2/embed"
  3. headers = {
  4. "Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
  5. "Content-Type" => "application/json"
  6. }
  7. data = {
  8. texts: input,
  9. model: "embed-v4.0",
  10. input_type: input_type,
  11. embedding_types: ["ubinary"]
  12. }
  13. response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
  14. JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
  15. end

Pass your input

  1. input = [
  2. "The dog is barking",
  3. "The cat is purring",
  4. "The bear is growling"
  5. ]
  6. embeddings = embed(input, "search_document")

Store the embeddings

  1. documents = []
  2. input.zip(embeddings) do |content, embedding|
  3. documents << {content: content, embedding: embedding}
  4. end
  5. Document.insert_all!(documents)

Embed the search query

  1. query = "forest"
  2. query_embedding = embed([query], "search_query")[0]

And search the documents

  1. Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)

See the complete code

Sentence Embeddings

You can generate embeddings locally with Informers.

Generate a model

  1. rails generate model Document content:text embedding:vector{384}
  2. rails db:migrate

And add has_neighbors

  1. class Document < ApplicationRecord
  2. has_neighbors :embedding
  3. end

Load a model

  1. model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")

Pass your input

  1. input = [
  2. "The dog is barking",
  3. "The cat is purring",
  4. "The bear is growling"
  5. ]
  6. embeddings = model.(input)

Store the embeddings

  1. documents = []
  2. input.zip(embeddings) do |content, embedding|
  3. documents << {content: content, embedding: embedding}
  4. end
  5. Document.insert_all!(documents)

And get similar documents

  1. document = Document.first
  2. document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)

See the complete code

You can use Neighbor for hybrid search with Informers.

Generate a model

  1. rails generate model Document content:text embedding:vector{768}
  2. rails db:migrate

And add has_neighbors and a scope for keyword search

  1. class Document < ApplicationRecord
  2. has_neighbors :embedding
  3. scope :search, ->(query) {
  4. where("to_tsvector(content) @@ plainto_tsquery(?)", query)
  5. .order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
  6. }
  7. end

Create some documents

  1. Document.create!(content: "The dog is barking")
  2. Document.create!(content: "The cat is purring")
  3. Document.create!(content: "The bear is growling")

Generate an embedding for each document

  1. embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
  2. embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
  3. Document.find_each do |document|
  4. embedding = embed.(document.content, **embed_options)
  5. document.update!(embedding: embedding)
  6. end

Perform keyword search

  1. query = "growling bear"
  2. keyword_results = Document.search(query).limit(20).load_async

And semantic search in parallel (the query prefix is specific to the embedding model)

  1. query_prefix = "Represent this sentence for searching relevant passages: "
  2. query_embedding = embed.(query_prefix + query, **embed_options)
  3. semantic_results =
  4. Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async

To combine the results, use Reciprocal Rank Fusion (RRF)

  1. Neighbor::Reranking.rrf(keyword_results, semantic_results).first(5)

Or a reranking model

  1. rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
  2. results = (keyword_results + semantic_results).uniq
  3. rerank.(query, results.map(&:content)).first(5).map { |v| results[v[:doc_id]] }

See the complete code

You can generate sparse embeddings locally with Transformers.rb.

Generate a model

  1. rails generate model Document content:text embedding:sparsevec{30522}
  2. rails db:migrate

And add has_neighbors

  1. class Document < ApplicationRecord
  2. has_neighbors :embedding
  3. end

Load a model to generate embeddings

  1. class EmbeddingModel
  2. def initialize(model_id)
  3. @model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
  4. @tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
  5. @special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
  6. end
  7. def embed(input)
  8. feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
  9. output = @model.(**feature)[0]
  10. values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
  11. values = Torch.log(1 + Torch.relu(values))
  12. values[0.., @special_token_ids] = 0
  13. values.to_a
  14. end
  15. end
  16. model = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")

Pass your input

  1. input = [
  2. "The dog is barking",
  3. "The cat is purring",
  4. "The bear is growling"
  5. ]
  6. embeddings = model.embed(input)

Store the embeddings

  1. documents = []
  2. input.zip(embeddings) do |content, embedding|
  3. documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
  4. end
  5. Document.insert_all!(documents)

Embed the search query

  1. query = "forest"
  2. query_embedding = model.embed([query])[0]

And search the documents

  1. Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)

See the complete code

Disco Recommendations

You can use Neighbor for online item-based recommendations with Disco. We’ll use MovieLens data for this example.

Generate a model

  1. rails generate model Movie name:string factors:cube
  2. rails db:migrate

And add has_neighbors

  1. class Movie < ApplicationRecord
  2. has_neighbors :factors, dimensions: 20, normalize: true
  3. end

Fit the recommender

  1. data = Disco.load_movielens
  2. recommender = Disco::Recommender.new(factors: 20)
  3. recommender.fit(data)

Store the item factors

  1. movies = []
  2. recommender.item_ids.each do |item_id|
  3. movies << {name: item_id, factors: recommender.item_factors(item_id)}
  4. end
  5. Movie.create!(movies)

And get similar movies

  1. movie = Movie.find_by(name: "Star Wars (1977)")
  2. movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)

See the complete code for cube and pgvector

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

  1. git clone https://github.com/ankane/neighbor.git
  2. cd neighbor
  3. bundle install
  4. # Postgres
  5. createdb neighbor_test
  6. bundle exec rake test:postgresql
  7. # SQLite
  8. bundle exec rake test:sqlite
  9. # MariaDB
  10. docker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 mariadb:11.7
  11. bundle exec rake test:mariadb
  12. # MySQL
  13. docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9
  14. bundle exec rake test:mysql