LSM-Vec — Documentation

Quick Start (C++)

Prerequisites

Compiler & Tools

C++17 compiler (GCC 8+ or Clang 10+)
CMake >= 3.10
GNU Make
Boost (headers only)

System Libraries

Ubuntu / Debian

sudo apt-get update
sudo apt-get install -y \
  build-essential cmake libboost-dev \
  libzstd-dev libsnappy-dev liblz4-dev libbz2-dev zlib1g-dev

macOS (Homebrew)

brew install cmake boost zstd snappy lz4 bzip2

Step 1 -- Build Aster

Aster is included as a Git submodule. Initialize it and build the static library:

git submodule update --init --recursive
make aster

This produces lib/aster/librocksdb.a. The build typically takes a few minutes and uses all available cores automatically.

Step 2 -- Build the LSM-Vec Libraries

make lib

Outputs:

File	Description
`build/lib/liblsmvec.a`	Static library
`build/lib/liblsmvec.so` (Linux) / `liblsmvec.dylib` (macOS)	Shared library

Step 3 -- Build the Test Binary

make bin

The executable lsm_vec is placed in build/bin/. It reads a vector dataset, builds the HNSW index, runs k-NN queries, and compares results against a ground truth file.

One-Shot Build

make          # equivalent to: make lib bin

Cleaning

make clean    # removes the build/ directory

Quick Start (Python)

Prerequisite: Aster must be built first (see Step 1 above). And please install ninja-build first.

git submodule update --init --recursive   # if not done already
make aster                                 # builds lib/aster/librocksdb.a
python -m pip install .                    # builds and installs the lsm_vec module

python -m pip install . handles the entire compilation internally via scikit-build-core. You do not need to run make lib beforehand.

To verify:

python -c "import lsm_vec; print('OK')"

C++ API: Open / Close

All types live in the lsm_vec namespace. Every mutating or querying method returns a Status object (aliased from RocksDB). Call .ok() to check for success and .ToString() for an error message.

#include "lsm_vec_db.h"
using namespace lsm_vec;

// Configure
LSMVecDBOptions opts;
opts.dim = 128;                              // required
opts.vector_file_path = "./db/vectors.bin";  // required
opts.reinit = true;                          // true = start fresh

// Open
std::unique_ptr<LSMVecDB> db;
Status s = LSMVecDB::Open("./db", opts, &db);
if (!s.ok()) { /* handle error */ }

// ... use the database ...

// Close
db->Close();

LSMVecDB::Open

static Status Open(const std::string& path,
                   const LSMVecDBOptions& opts,
                   std::unique_ptr<LSMVecDB>* db);

Creates or opens a database at the given directory.

Parameter	Type	Description
`path`	`const std::string&`	Directory for database files. Created automatically if it does not exist.
`opts`	`const LSMVecDBOptions&`	Configuration. `opts.dim` must be > 0.
`db`	`std::unique_ptr<LSMVecDB>*`	On success, `*db` holds the opened database handle.

LSMVecDB::Close

Status Close();

Flushes pending writes and releases all resources. The handle is unusable after this call.

C++ API: Insert

std::vector<float> vec(128, 0.5f);
db->Insert(42, Span<float>(vec));

LSMVecDB::Insert

Status Insert(node_id_t id, Span<float> vec);

Parameter	Type	Description
`id`	`node_id_t` (`uint64_t`)	Unique identifier for this vector.
`vec`	`Span<float>`	Vector data. Length must equal `opts.dim`. A `std::vector<float>` converts implicitly.

Inserts a new vector and builds its HNSW graph connections. Returns InvalidArgument if the dimension mismatches.

C++ API: Search

std::vector<float> query(128, 0.1f);

SearchOptions search_opts;
search_opts.k = 10;           // number of neighbors
search_opts.ef_search = 128;  // candidate pool size (higher = better recall)

std::vector<SearchResult> results;
db->SearchKnn(Span<float>(query), search_opts, &results);

for (const auto& r : results) {
    std::cout << "id=" << r.id << " dist=" << r.distance << "\n";
}

LSMVecDB::SearchKnn

Status SearchKnn(Span<float> query,
                 const SearchOptions& options,
                 std::vector<SearchResult>* out);

Parameter	Type	Description
`query`	`Span<float>`	Query vector. Length must equal `opts.dim`.
`options`	`const SearchOptions&`	Search parameters (see below).
`out`	`std::vector<SearchResult>*`	Results sorted by ascending distance, up to `k` entries.

SearchOptions

Field	Type	Default	Description
`k`	`int`	`1`	Number of nearest neighbors to return.
`ef_search`	`int`	`64`	Candidate pool size during search. Must be >= `k`. Higher values improve recall at the cost of latency.

SearchResult

Field	Type	Description
`id`	`node_id_t`	Identifier of the matched vector.
`distance`	`float`	Distance from the query vector.

C++ API: Get / Update / Delete

LSMVecDB::Get

Status Get(node_id_t id, std::vector<float>* vec);

Retrieves the vector for a given ID. vec is resized to opts.dim on success. Returns NotFound if the ID does not exist.

LSMVecDB::Update

Status Update(node_id_t id, Span<float> vec);

Replaces the vector data for an existing ID and rebuilds its graph connections. Returns NotFound if the ID does not exist.

LSMVecDB::Delete

Status Delete(node_id_t id);

Marks a vector as deleted. Deleted vectors are excluded from future search results.

LSMVecDB::printStatistics

void printStatistics() const;

Prints I/O and timing statistics to stdout. Only meaningful when opts.enable_stats = true.

C++ API: Configuration

LSMVecDBOptions

Pass this struct to LSMVecDB::Open().

LSMVecDBOptions opts;
opts.dim = 128;
opts.metric = DistanceMetric::kL2;
// ... set other fields as needed ...

Field	Type	Default	Description
`dim`	`int`	`0`	Required. Dimensionality of vectors.
`metric`	`DistanceMetric`	`kL2`	Distance metric (`kL2` or `kCosine`).
`m`	`int`	`8`	HNSW: bi-directional links created per node at layer 0.
`m_max`	`int`	`16`	HNSW: max neighbors per node at upper layers.
`ef_construction`	`float`	`32.0`	HNSW: candidate pool size during index construction.
`vec_file_capacity`	`size_t`	`100000`	Initial vector file capacity. Auto-expands with PagedVectorStorage.
`paged_max_cached_pages`	`size_t`	`4096`	Number of 4 KB pages in the user-space page cache.
`vector_storage_type`	`int`	`1`	`0` = BasicVectorStorage (flat file), `1` = PagedVectorStorage (paged + cached).
`db_target_size`	`uint64_t`	~100 GiB	Target file size hint for Aster (RocksDB).
`random_seed`	`int`	`12345`	RNG seed for HNSW level generation.
`enable_stats`	`bool`	`false`	Collect I/O and timing statistics.
`enable_batch_read`	`bool`	`true`	Group vector reads by page during search.
`reinit`	`bool`	`false`	`true` = wipe existing data; `false` = reopen.
`vector_file_path`	`string`	`""`	Path for vector storage file.
`log_file_path`	`string`	`""`	Path for log file (empty = no file logging).

Tuning tips:

Higher m and ef_construction → better recall, slower indexing.
Higher ef_search → better recall, slower queries.
Higher paged_max_cached_pages → more RAM, fewer disk reads.

C++ API: Types

DistanceMetric

enum class DistanceMetric {
    kL2,      // Euclidean distance
    kCosine,  // 1 - cosine_similarity
};

node_id_t

using node_id_t = std::uint64_t;

Unique 64-bit vector identifier.

Span<T>

A lightweight, non-owning view over contiguous memory (similar to C++20 std::span). You rarely need to construct one explicitly because std::vector<float> converts to Span<float> implicitly.

std::vector<float> vec(128);
db->Insert(0, vec);  // implicit Span<float>(vec)

Python API: Installation

git submodule update --init --recursive
make aster
python -m pip install .

Verify:

python -c "import lsm_vec; print('OK')"

Python API: Open / Close

import os
import lsm_vec

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.vector_file_path = "./db/vectors.bin"
opts.reinit = True

os.makedirs("./db", exist_ok=True)
db = lsm_vec.LSMVecDB.open("./db", opts)

# ... use the database ...

db.close()

LSMVecDB.open

db = lsm_vec.LSMVecDB.open(path: str, opts: LSMVecDBOptions) -> LSMVecDB

Opens or creates a database. Raises ValueError on invalid arguments, RuntimeError on I/O errors.

db.close

db.close() -> None

Python API: Insert

Accepts both Python lists and NumPy arrays.

db.insert(0, [0.1] * 128)

import numpy as np
db.insert(1, np.random.rand(128).astype(np.float32))

db.insert

db.insert(id: int, vector: list[float] | numpy.ndarray) -> None

Parameter	Type	Description
`id`	`int`	Unique vector identifier.
`vector`	`list[float]` or `numpy.ndarray` (float32, 1-D)	Must have length `opts.dim`.

Raises ValueError on dimension mismatch.

Python API: Search

# With SearchOptions
search_opts = lsm_vec.SearchOptions()
search_opts.k = 10
search_opts.ef_search = 128
results = db.search_knn([0.1] * 128, search_opts)

# Or with k and ef_search directly
results = db.search_knn(query_array, k=10, ef_search=128)

for r in results:
    print(f"id={r.id}  distance={r.distance:.4f}")

db.search_knn

# Option A: with SearchOptions
db.search_knn(query, opts: SearchOptions) -> list[SearchResult]

# Option B: with explicit parameters
db.search_knn(query, k: int, ef_search: int) -> list[SearchResult]

query can be a list[float] or a 1-D numpy.ndarray of float32.

SearchOptions

Property	Type	Default	Description
`k`	`int`	`1`	Number of nearest neighbors.
`ef_search`	`int`	`64`	Candidate pool size. Must be >= `k`.

SearchResult

Property	Type	Description
`id`	`int`	Vector identifier (read-only).
`distance`	`float`	Distance from query (read-only).

Python API: Get / Update / Delete

db.get

vec = db.get(id: int) -> numpy.ndarray  # float32, 1-D

Raises KeyError if the ID does not exist.

db.update

db.update(id: int, vector: list[float] | numpy.ndarray) -> None

Raises KeyError if the ID does not exist.

db.delete

db.delete(id: int) -> None

Python API: Configuration

LSMVecDBOptions

All properties are read-write.

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.metric = lsm_vec.DistanceMetric.Cosine

Property	Type	Default	Description
`dim`	`int`	`0`	Required. Vector dimensionality.
`metric`	`DistanceMetric`	`L2`	`L2` or `Cosine`.
`m`	`int`	`8`	HNSW connections per node.
`m_max`	`int`	`16`	Max neighbors at upper layers.
`ef_construction`	`float`	`32.0`	Construction-time candidate pool.
`vec_file_capacity`	`int`	`100000`	Initial vector file capacity.
`paged_max_cached_pages`	`int`	`4096`	Page cache size (4 KB pages).
`vector_storage_type`	`int`	`1`	`0` = basic, `1` = paged.
`db_target_size`	`int`	~100 GiB	Aster target file size.
`random_seed`	`int`	`12345`	RNG seed.
`enable_stats`	`bool`	`False`	Collect statistics.
`enable_batch_read`	`bool`	`True`	Batch vector reads by page.
`reinit`	`bool`	`False`	Wipe existing data on open.
`vector_file_path`	`str`	`""`	Vector storage file path.
`log_file_path`	`str`	`""`	Log file path.

DistanceMetric

lsm_vec.DistanceMetric.L2      # Euclidean distance
lsm_vec.DistanceMetric.Cosine  # 1 - cosine similarity

Python API: Error Handling

C++ Status	Python Exception	When
`InvalidArgument`	`ValueError`	Dimension mismatch, bad parameters.
`NotFound`	`KeyError`	Vector ID not found.
Other errors	`RuntimeError`	I/O failures, database errors.

Example: C++ Embedding

Include headers from include/ and link against liblsmvec.a (static) or liblsmvec.so / liblsmvec.dylib (shared). Transitive link dependencies: rocksdb (Aster), zstd, snappy, lz4, bz2, z, pthread, dl. On macOS, jemalloc is also required.

#include "lsm_vec_db.h"

lsm_vec::LSMVecDBOptions opts;
opts.dim = 128;
opts.vector_file_path = "./db/vectors.bin";

std::unique_ptr<lsm_vec::LSMVecDB> db;
auto s = lsm_vec::LSMVecDB::Open("./db", opts, &db);

// Insert
std::vector<float> vec(128, 0.1f);
db->Insert(0, vec);

// Search (uses k and ef_search from opts)
std::vector<lsm_vec::SearchResult> results;
db->SearchKnn(vec, &results);

// Close
db->Close();

Example: Python Quick Start

import lsm_vec
import os

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
db_dir = "./run/db/"
opts.vector_file_path = os.path.join(db_dir, "vectors.bin")
opts.reinit = True

db = lsm_vec.LSMVecDB.open(db_dir, opts)

db.insert(1, [0.1] * 128)

# Search (uses k and ef_search from opts)
results = db.search_knn([0.1] * 128)
print(results[0].id, results[0].distance)

Example: NumPy

import numpy as np
import lsm_vec

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.k = 5
opts.ef_search = 100
opts.vector_file_path = "./run/db/vectors.bin"
opts.reinit = True

os.makedirs("./run/db/", exist_ok=True)
db = lsm_vec.LSMVecDB.open("./run/db/", opts)

vec = np.random.rand(128).astype(np.float32)
db.insert(42, vec)

query = np.random.rand(128).astype(np.float32)
results = db.search_knn(query)
for r in results:
    print(f"id={r.id}  distance={r.distance:.4f}")

db.close()

Implementation Details

LSMVecDB is the public API layer, handling database lifecycle (Open/Close), input validation, metadata serialization, and deleted-ID tracking. It delegates all indexing operations to LSMVec.

LSMVec implements the HNSW algorithm. Layer-0 graph edges are persisted in Aster's RocksGraph (an LSM-tree backed graph store), while upper-layer edges are kept in an in-memory map for fast hierarchical navigation.

Raw vector data is managed by IVectorStorage, which offers two backends: BasicVectorStorage (a contiguous flat file addressed by ID offset) and PagedVectorStorage (4 KB page-managed layout with a user-space FIFO page cache that co-locates vectors sharing the same HNSW entry point for better spatial locality).