Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ Vicinity is a light-weight, low-dependency vector store. It provides a simple an

There are many nearest neighbors packages and methods out there. However, we found it difficult to compare them. Every package has its own interface, quirks, and limitations, and learning a new package can be time-consuming. In addition to that, how do you effectively evaluate different packages? How do you know which one is the best for your use case?


This is where Vicinity comes in. Instead of learning a new interface for each new package or backend, Vicinity provides a unified interface for all backends. This allows you to easily experiment with different indexing methods and distance metrics and choose the best one for your use case. Vicinity also provides a simple way to evaluate the performance of different backends, allowing you to measure the queries per second and recall.
This is where Vicinity comes in. Instead of learning a new interface for each new package or backend, Vicinity provides a unified interface for all backends. This allows you to easily experiment with different indexing methods and distance metrics and choose the best one for your use case. Vicinity also provides a simple way to evaluate the performance of different backends, allowing you to measure the queries per second and recall.

## Quickstart

Expand All @@ -49,13 +48,13 @@ Optionally, [install any of the supported backends](#installation), or simply in
pip install vicinity[all]
```


The following code snippet demonstrates how to use Vicinity for nearest neighbor search:

```python
import numpy as np
from vicinity import Vicinity, Backend, Metric

# Create some dummy data
# Create some dummy data as strings or other serializable objects
items = ["triforce", "master sword", "hylian shield", "boomerang", "hookshot"]
vectors = np.random.rand(len(items), 128)

Expand All @@ -82,12 +81,14 @@ results = vicinity.query(query_vectors, k=3)
```

Saving and loading a vector store:

```python
vicinity.save('my_vector_store')
vicinity = Vicinity.load('my_vector_store')
```

Evaluating a backend:

```python
# Use the first 1000 vectors as query vectors
query_vectors = vectors[:1000]
Expand All @@ -100,6 +101,7 @@ qps, recall = vicinity.evaluate(
```

## Main Features

Vicinity provides the following features:
- Lightweight: Minimal dependencies and fast performance.
- Flexible Backend Support: Use different backends for vector storage and search.
Expand All @@ -108,6 +110,7 @@ Vicinity provides the following features:
- Easy to Use: Simple and intuitive API.

## Supported Backends

The following backends are supported:
- `BASIC`: A simple (exact matching) flat index for vector storage and search.
- [HNSW](https://github.com/nmslib/hnswlib): Hierarchical Navigable Small World Graph (HNSW) for ANN search using hnswlib.
Expand All @@ -126,8 +129,6 @@ The following backends are supported:
- `ivfpqr`: Inverted file search with product quantizer and refinement.
- [VOYAGER](https://github.com/spotify/voyager): Voyager is a library for performing fast approximate nearest-neighbor searches on an in-memory collection of vectors.



NOTE: the ANN backends do not support dynamic deletion. To delete items, you need to recreate the index. Insertion is supported in the following backends: `FAISS`, `HNSW`, and `Usearch`. The `BASIC` backend supports both insertion and deletion.

### Backend Parameters
Expand Down Expand Up @@ -159,7 +160,9 @@ NOTE: the ANN backends do not support dynamic deletion. To delete items, you nee
| | `m` | The number of connections between nodes in the tree’s internal data structure. | `16` |

## Installation

The following installation options are available:

```bash
# Install the base package
pip install vicinity
Expand Down
14 changes: 13 additions & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,19 @@
@pytest.fixture(scope="session")
def items() -> list[str]:
"""Fixture providing a list of item names."""
return [f"item{i}" for i in range(1, 10001)]
return [f"item{i}" if i % 2 == 0 else {"name": f"item{i}", "id": i} for i in range(1, 10001)]


@pytest.fixture(scope="session")
def non_serializable_items() -> list[str]:
"""Fixture providing a list of non-serializable items."""

class NonSerializable:
def __init__(self, name: str, id: int) -> None:
self.name = name
self.id = id

return [NonSerializable(f"item{i}", i) for i in range(1, 10001)]


@pytest.fixture(scope="session")
Expand Down
30 changes: 23 additions & 7 deletions tests/test_vicinity.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import numpy as np
import pytest
from orjson import JSONEncodeError

from vicinity import Vicinity
from vicinity.datatypes import Backend
Expand Down Expand Up @@ -162,6 +163,21 @@ def test_vicinity_save_and_load_vector_store(tmp_path: Path, vicinity_instance_w
assert v.vector_store is not None


def test_vicinity_save_and_load_non_serializable_items(
tmp_path: Path, non_serializable_items: list[str], vectors: np.ndarray
) -> None:
"""
Test Vicinity.save and Vicinity.load with non-serializable items.

:param tmp_path: Temporary directory provided by pytest.
:param non_serializable_items: A list of non-serializable items.
"""
vicinity = Vicinity.from_vectors_and_items(vectors=vectors, items=non_serializable_items)
save_path = tmp_path / "vicinity_data"
with pytest.raises(JSONEncodeError):
vicinity.save(save_path)


def test_index_vector_store(vicinity_with_basic_backend_and_store: Vicinity, vectors: np.ndarray) -> None:
"""
Index vectors in the Vicinity instance.
Expand All @@ -183,18 +199,17 @@ def test_index_vector_store(vicinity_with_basic_backend_and_store: Vicinity, vec
vicinity_with_basic_backend_and_store.get_vector_by_index([-1])


def test_vicinity_insert_duplicate(vicinity_instance: Vicinity, query_vector: np.ndarray) -> None:
def test_vicinity_insert_duplicate(items: list[str], vicinity_instance: Vicinity, query_vector: np.ndarray) -> None:
"""
Test that Vicinity.insert raises ValueError when inserting duplicate items.

:param vicinity_instance: A Vicinity instance.
:raises ValueError: If inserting items that already exist.
"""
new_items = ["item1"]
new_vector = query_vector

with pytest.raises(ValueError):
vicinity_instance.insert(new_items, new_vector[None, :])
vicinity_instance.insert(items[0], new_vector[None, :])


def test_vicinity_delete_nonexistent(vicinity_instance: Vicinity) -> None:
Expand Down Expand Up @@ -281,22 +296,23 @@ def test_vicinity_delete_and_query(vicinity_instance: Vicinity, items: list[str]
return

# Delete some items from the Vicinity instance
items_to_delete = ["item2", "item4", "item6"]
non_existing_items_indices = [0, 1, 2]
items_to_delete = [items[i] for i in non_existing_items_indices]
vicinity_instance.delete(items_to_delete)

# Ensure the items are no longer in the items list
for item in items_to_delete:
assert item not in vicinity_instance.items

# Query using a vector of an item that wasn't deleted
item3_index = items.index("item3")
item3_vector = vectors[item3_index]
existing_item_index = 3
item3_vector = vectors[existing_item_index]

results = vicinity_instance.query(item3_vector, k=10)
returned_items = [item for item, _ in results[0]]

# Check that the queried item is in the results
assert "item3" in returned_items
assert items[existing_item_index] in returned_items


def test_vicinity_evaluate(vicinity_instance: Vicinity, vectors: np.ndarray) -> None:
Expand Down
43 changes: 25 additions & 18 deletions vicinity/vicinity.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import numpy as np
import orjson
from numpy import typing as npt
from orjson import JSONEncodeError

from vicinity import Metric
from vicinity.backends import AbstractBackend, BasicBackend, BasicVectorStore, get_backend_class
Expand All @@ -29,7 +30,7 @@ class Vicinity:

def __init__(
self,
items: Sequence[str],
items: Sequence[Any],
backend: AbstractBackend,
metadata: Union[dict[str, Any], None] = None,
vector_store: BasicVectorStore | None = None,
Expand All @@ -49,7 +50,7 @@ def __init__(
raise ValueError(
"Your vector space and list of items are not the same length: " f"{len(backend)} != {len(items)}"
)
self.items: list[str] = list(items)
self.items: list[Any] = list(items)
self.backend: AbstractBackend = backend
self.metadata = metadata or {}
self.vector_store = vector_store
Expand All @@ -74,7 +75,7 @@ def __len__(self) -> int:
def from_vectors_and_items(
cls: type[Vicinity],
vectors: npt.NDArray,
items: Sequence[str],
items: Sequence[Any],
backend_type: Backend | str = Backend.BASIC,
store_vectors: bool = False,
**kwargs: Any,
Expand Down Expand Up @@ -177,6 +178,7 @@ def save(
:param folder: The path to which to save the JSON file. The vectors are saved separately. The JSON contains a path to the numpy file.
:param overwrite: Whether to overwrite the JSON and numpy files if they already exist.
:raises ValueError: If the path is not a directory.
:raises JSONEncodeError: If the items are not serializable.
"""
path = Path(folder)
path.mkdir(parents=True, exist_ok=overwrite)
Expand All @@ -185,9 +187,11 @@ def save(
raise ValueError(f"Path {path} should be a directory.")

items_dict = {"items": self.items, "metadata": self.metadata, "backend_type": self.backend.backend_type.value}

with open(path / "data.json", "wb") as file_handle:
file_handle.write(orjson.dumps(items_dict))
try:
with open(path / "data.json", "wb") as file_handle:
file_handle.write(orjson.dumps(items_dict))
except JSONEncodeError as e:
raise JSONEncodeError(f"Items could not be encoded to JSON because they are not serializable: {e}")

self.backend.save(path)
if self.vector_store is not None:
Expand All @@ -211,7 +215,7 @@ def load(cls, filename: PathLike) -> Vicinity:

with open(folder_path / "data.json", "rb") as file_handle:
data: dict[str, Any] = orjson.loads(file_handle.read())
items: Sequence[str] = data["items"]
items: Sequence[Any] = data["items"]

metadata: dict[str, Any] = data["metadata"]
backend_type = Backend(data["backend_type"])
Expand All @@ -227,7 +231,7 @@ def load(cls, filename: PathLike) -> Vicinity:

return instance

def insert(self, tokens: Sequence[str], vectors: npt.NDArray) -> None:
def insert(self, tokens: Sequence[Any], vectors: npt.NDArray) -> None:
"""
Insert new items into the vector space.

Expand All @@ -241,16 +245,12 @@ def insert(self, tokens: Sequence[str], vectors: npt.NDArray) -> None:
if vectors.shape[1] != self.dim:
raise ValueError("The inserted vectors must have the same dimension as the backend.")

item_set = set(self.items)
for token in tokens:
if token in item_set:
raise ValueError(f"Token {token} is already in the vector space.")
self.items.append(token)
self.items.extend(tokens)
self.backend.insert(vectors)
if self.vector_store is not None:
self.vector_store.insert(vectors)

def delete(self, tokens: Sequence[str]) -> None:
def delete(self, tokens: Sequence[Any]) -> None:
"""
Delete tokens from the vector space.

Expand All @@ -260,10 +260,17 @@ def delete(self, tokens: Sequence[str]) -> None:
:param tokens: A list of tokens to remove from the vector space.
:raises ValueError: If any passed tokens are not in the vector space.
"""
try:
curr_indices = [self.items.index(token) for token in tokens]
except ValueError as exc:
raise ValueError(f"Token {exc} was not in the vector space.") from exc
tokens_to_find = list(tokens)
curr_indices = []
for idx, elem in enumerate(self.items):
matching_tokens = [t for t in tokens_to_find if t == elem]
if matching_tokens:
curr_indices.append(idx)
for t in matching_tokens:
tokens_to_find.remove(t)

if tokens_to_find:
raise ValueError(f"Tokens {tokens_to_find} were not in the vector space.")

self.backend.delete(curr_indices)
if self.vector_store is not None:
Expand Down
Loading