Quick Start (C++)

Prerequisites

Compiler & Tools

System Libraries

Ubuntu / Debian

sudo apt-get update
sudo apt-get install -y \
  build-essential cmake libboost-dev \
  libzstd-dev libsnappy-dev liblz4-dev libbz2-dev zlib1g-dev

macOS (Homebrew)

brew install cmake boost zstd snappy lz4 bzip2

Step 1 -- Build Aster

Aster is included as a Git submodule. Initialize it and build the static library:

git submodule update --init --recursive
make aster

This produces lib/aster/librocksdb.a. The build typically takes a few minutes and uses all available cores automatically.

Step 2 -- Build the LSM-Vec Libraries

make lib

Outputs:

FileDescription
build/lib/liblsmvec.aStatic library
build/lib/liblsmvec.so (Linux) / liblsmvec.dylib (macOS)Shared library

Step 3 -- Build the Test Binary

make bin

The executable lsm_vec is placed in build/bin/. It reads a vector dataset, builds the HNSW index, runs k-NN queries, and compares results against a ground truth file.

One-Shot Build

make          # equivalent to: make lib bin

Cleaning

make clean    # removes the build/ directory

Quick Start (Python)

Prerequisite: Aster must be built first (see Step 1 above). And please install ninja-build first.

git submodule update --init --recursive   # if not done already
make aster                                 # builds lib/aster/librocksdb.a
python -m pip install .                    # builds and installs the lsm_vec module

python -m pip install . handles the entire compilation internally via scikit-build-core. You do not need to run make lib beforehand.

To verify:

python -c "import lsm_vec; print('OK')"

C++ API: Open / Close

All types live in the lsm_vec namespace. Every mutating or querying method returns a Status object (aliased from RocksDB). Call .ok() to check for success and .ToString() for an error message.

#include "lsm_vec_db.h"
using namespace lsm_vec;

// Configure
LSMVecDBOptions opts;
opts.dim = 128;                              // required
opts.vector_file_path = "./db/vectors.bin";  // required
opts.reinit = true;                          // true = start fresh

// Open
std::unique_ptr<LSMVecDB> db;
Status s = LSMVecDB::Open("./db", opts, &db);
if (!s.ok()) { /* handle error */ }

// ... use the database ...

// Close
db->Close();

LSMVecDB::Open

static Status Open(const std::string& path,
                   const LSMVecDBOptions& opts,
                   std::unique_ptr<LSMVecDB>* db);

Creates or opens a database at the given directory.

ParameterTypeDescription
pathconst std::string&Directory for database files. Created automatically if it does not exist.
optsconst LSMVecDBOptions&Configuration. opts.dim must be > 0.
dbstd::unique_ptr<LSMVecDB>*On success, *db holds the opened database handle.

LSMVecDB::Close

Status Close();

Flushes pending writes and releases all resources. The handle is unusable after this call.

C++ API: Insert

std::vector<float> vec(128, 0.5f);
db->Insert(42, Span<float>(vec));

LSMVecDB::Insert

Status Insert(node_id_t id, Span<float> vec);
ParameterTypeDescription
idnode_id_t (uint64_t)Unique identifier for this vector.
vecSpan<float>Vector data. Length must equal opts.dim. A std::vector<float> converts implicitly.

Inserts a new vector and builds its HNSW graph connections. Returns InvalidArgument if the dimension mismatches.

std::vector<float> query(128, 0.1f);

SearchOptions search_opts;
search_opts.k = 10;           // number of neighbors
search_opts.ef_search = 128;  // candidate pool size (higher = better recall)

std::vector<SearchResult> results;
db->SearchKnn(Span<float>(query), search_opts, &results);

for (const auto& r : results) {
    std::cout << "id=" << r.id << " dist=" << r.distance << "\n";
}

LSMVecDB::SearchKnn

Status SearchKnn(Span<float> query,
                 const SearchOptions& options,
                 std::vector<SearchResult>* out);
ParameterTypeDescription
querySpan<float>Query vector. Length must equal opts.dim.
optionsconst SearchOptions&Search parameters (see below).
outstd::vector<SearchResult>*Results sorted by ascending distance, up to k entries.

SearchOptions

FieldTypeDefaultDescription
kint1Number of nearest neighbors to return.
ef_searchint64Candidate pool size during search. Must be >= k. Higher values improve recall at the cost of latency.

SearchResult

FieldTypeDescription
idnode_id_tIdentifier of the matched vector.
distancefloatDistance from the query vector.

C++ API: Get / Update / Delete

LSMVecDB::Get

Status Get(node_id_t id, std::vector<float>* vec);

Retrieves the vector for a given ID. vec is resized to opts.dim on success. Returns NotFound if the ID does not exist.

LSMVecDB::Update

Status Update(node_id_t id, Span<float> vec);

Replaces the vector data for an existing ID and rebuilds its graph connections. Returns NotFound if the ID does not exist.

LSMVecDB::Delete

Status Delete(node_id_t id);

Marks a vector as deleted. Deleted vectors are excluded from future search results.

LSMVecDB::printStatistics

void printStatistics() const;

Prints I/O and timing statistics to stdout. Only meaningful when opts.enable_stats = true.

C++ API: Configuration

LSMVecDBOptions

Pass this struct to LSMVecDB::Open().

LSMVecDBOptions opts;
opts.dim = 128;
opts.metric = DistanceMetric::kL2;
// ... set other fields as needed ...
FieldTypeDefaultDescription
dimint0Required. Dimensionality of vectors.
metricDistanceMetrickL2Distance metric (kL2 or kCosine).
mint8HNSW: bi-directional links created per node at layer 0.
m_maxint16HNSW: max neighbors per node at upper layers.
ef_constructionfloat32.0HNSW: candidate pool size during index construction.
vec_file_capacitysize_t100000Initial vector file capacity. Auto-expands with PagedVectorStorage.
paged_max_cached_pagessize_t4096Number of 4 KB pages in the user-space page cache.
vector_storage_typeint10 = BasicVectorStorage (flat file), 1 = PagedVectorStorage (paged + cached).
db_target_sizeuint64_t~100 GiBTarget file size hint for Aster (RocksDB).
random_seedint12345RNG seed for HNSW level generation.
enable_statsboolfalseCollect I/O and timing statistics.
enable_batch_readbooltrueGroup vector reads by page during search.
reinitboolfalsetrue = wipe existing data; false = reopen.
vector_file_pathstring""Path for vector storage file.
log_file_pathstring""Path for log file (empty = no file logging).

Tuning tips:

C++ API: Types

DistanceMetric

enum class DistanceMetric {
    kL2,      // Euclidean distance
    kCosine,  // 1 - cosine_similarity
};

node_id_t

using node_id_t = std::uint64_t;

Unique 64-bit vector identifier.

Span<T>

A lightweight, non-owning view over contiguous memory (similar to C++20 std::span). You rarely need to construct one explicitly because std::vector<float> converts to Span<float> implicitly.

std::vector<float> vec(128);
db->Insert(0, vec);  // implicit Span<float>(vec)

Python API: Installation

git submodule update --init --recursive
make aster
python -m pip install .

Verify:

python -c "import lsm_vec; print('OK')"

Python API: Open / Close

import os
import lsm_vec

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.vector_file_path = "./db/vectors.bin"
opts.reinit = True

os.makedirs("./db", exist_ok=True)
db = lsm_vec.LSMVecDB.open("./db", opts)

# ... use the database ...

db.close()

LSMVecDB.open

db = lsm_vec.LSMVecDB.open(path: str, opts: LSMVecDBOptions) -> LSMVecDB

Opens or creates a database. Raises ValueError on invalid arguments, RuntimeError on I/O errors.

db.close

db.close() -> None

Python API: Insert

Accepts both Python lists and NumPy arrays.

db.insert(0, [0.1] * 128)

import numpy as np
db.insert(1, np.random.rand(128).astype(np.float32))

db.insert

db.insert(id: int, vector: list[float] | numpy.ndarray) -> None
ParameterTypeDescription
idintUnique vector identifier.
vectorlist[float] or numpy.ndarray (float32, 1-D)Must have length opts.dim.

Raises ValueError on dimension mismatch.

# With SearchOptions
search_opts = lsm_vec.SearchOptions()
search_opts.k = 10
search_opts.ef_search = 128
results = db.search_knn([0.1] * 128, search_opts)

# Or with k and ef_search directly
results = db.search_knn(query_array, k=10, ef_search=128)

for r in results:
    print(f"id={r.id}  distance={r.distance:.4f}")

db.search_knn

# Option A: with SearchOptions
db.search_knn(query, opts: SearchOptions) -> list[SearchResult]

# Option B: with explicit parameters
db.search_knn(query, k: int, ef_search: int) -> list[SearchResult]

query can be a list[float] or a 1-D numpy.ndarray of float32.

SearchOptions

PropertyTypeDefaultDescription
kint1Number of nearest neighbors.
ef_searchint64Candidate pool size. Must be >= k.

SearchResult

PropertyTypeDescription
idintVector identifier (read-only).
distancefloatDistance from query (read-only).

Python API: Get / Update / Delete

db.get

vec = db.get(id: int) -> numpy.ndarray  # float32, 1-D

Raises KeyError if the ID does not exist.

db.update

db.update(id: int, vector: list[float] | numpy.ndarray) -> None

Raises KeyError if the ID does not exist.

db.delete

db.delete(id: int) -> None

Python API: Configuration

LSMVecDBOptions

All properties are read-write.

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.metric = lsm_vec.DistanceMetric.Cosine
PropertyTypeDefaultDescription
dimint0Required. Vector dimensionality.
metricDistanceMetricL2L2 or Cosine.
mint8HNSW connections per node.
m_maxint16Max neighbors at upper layers.
ef_constructionfloat32.0Construction-time candidate pool.
vec_file_capacityint100000Initial vector file capacity.
paged_max_cached_pagesint4096Page cache size (4 KB pages).
vector_storage_typeint10 = basic, 1 = paged.
db_target_sizeint~100 GiBAster target file size.
random_seedint12345RNG seed.
enable_statsboolFalseCollect statistics.
enable_batch_readboolTrueBatch vector reads by page.
reinitboolFalseWipe existing data on open.
vector_file_pathstr""Vector storage file path.
log_file_pathstr""Log file path.

DistanceMetric

lsm_vec.DistanceMetric.L2      # Euclidean distance
lsm_vec.DistanceMetric.Cosine  # 1 - cosine similarity

Python API: Error Handling

C++ StatusPython ExceptionWhen
InvalidArgumentValueErrorDimension mismatch, bad parameters.
NotFoundKeyErrorVector ID not found.
Other errorsRuntimeErrorI/O failures, database errors.

Example: C++ Embedding

Include headers from include/ and link against liblsmvec.a (static) or liblsmvec.so / liblsmvec.dylib (shared). Transitive link dependencies: rocksdb (Aster), zstd, snappy, lz4, bz2, z, pthread, dl. On macOS, jemalloc is also required.

#include "lsm_vec_db.h"

lsm_vec::LSMVecDBOptions opts;
opts.dim = 128;
opts.vector_file_path = "./db/vectors.bin";

std::unique_ptr<lsm_vec::LSMVecDB> db;
auto s = lsm_vec::LSMVecDB::Open("./db", opts, &db);

// Insert
std::vector<float> vec(128, 0.1f);
db->Insert(0, vec);

// Search (uses k and ef_search from opts)
std::vector<lsm_vec::SearchResult> results;
db->SearchKnn(vec, &results);

// Close
db->Close();

Example: Python Quick Start

import lsm_vec
import os

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
db_dir = "./run/db/"
opts.vector_file_path = os.path.join(db_dir, "vectors.bin")
opts.reinit = True

db = lsm_vec.LSMVecDB.open(db_dir, opts)

db.insert(1, [0.1] * 128)

# Search (uses k and ef_search from opts)
results = db.search_knn([0.1] * 128)
print(results[0].id, results[0].distance)

Example: NumPy

import numpy as np
import lsm_vec

opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.k = 5
opts.ef_search = 100
opts.vector_file_path = "./run/db/vectors.bin"
opts.reinit = True

os.makedirs("./run/db/", exist_ok=True)
db = lsm_vec.LSMVecDB.open("./run/db/", opts)

vec = np.random.rand(128).astype(np.float32)
db.insert(42, vec)

query = np.random.rand(128).astype(np.float32)
results = db.search_knn(query)
for r in results:
    print(f"id={r.id}  distance={r.distance:.4f}")

db.close()

Implementation Details

LSMVecDB is the public API layer, handling database lifecycle (Open/Close), input validation, metadata serialization, and deleted-ID tracking. It delegates all indexing operations to LSMVec.

LSMVec implements the HNSW algorithm. Layer-0 graph edges are persisted in Aster's RocksGraph (an LSM-tree backed graph store), while upper-layer edges are kept in an in-memory map for fast hierarchical navigation.

Raw vector data is managed by IVectorStorage, which offers two backends: BasicVectorStorage (a contiguous flat file addressed by ID offset) and PagedVectorStorage (4 KB page-managed layout with a user-space FIFO page cache that co-locates vectors sharing the same HNSW entry point for better spatial locality).

LSMVecDB (Public API) LSMVec (HNSW Index) RocksGraph (Aster) IVectorStorage (Basic / Paged)