Quick Start (C++)
Prerequisites
Compiler & Tools
- C++17 compiler (GCC 8+ or Clang 10+)
- CMake >= 3.10
- GNU Make
- Boost (headers only)
System Libraries
Ubuntu / Debian
sudo apt-get update
sudo apt-get install -y \
build-essential cmake libboost-dev \
libzstd-dev libsnappy-dev liblz4-dev libbz2-dev zlib1g-dev
macOS (Homebrew)
brew install cmake boost zstd snappy lz4 bzip2
Step 1 -- Build Aster
Aster is included as a Git submodule. Initialize it and build the static library:
git submodule update --init --recursive
make aster
This produces lib/aster/librocksdb.a. The build typically takes a few minutes and uses all available cores automatically.
Step 2 -- Build the LSM-Vec Libraries
make lib
Outputs:
| File | Description |
|---|---|
build/lib/liblsmvec.a | Static library |
build/lib/liblsmvec.so (Linux) / liblsmvec.dylib (macOS) | Shared library |
Step 3 -- Build the Test Binary
make bin
The executable lsm_vec is placed in build/bin/. It reads a vector dataset, builds the HNSW index, runs k-NN queries, and compares results against a ground truth file.
One-Shot Build
make # equivalent to: make lib bin
Cleaning
make clean # removes the build/ directory
Quick Start (Python)
Prerequisite: Aster must be built first (see Step 1 above). And please install ninja-build first.
git submodule update --init --recursive # if not done already
make aster # builds lib/aster/librocksdb.a
python -m pip install . # builds and installs the lsm_vec module
python -m pip install . handles the entire compilation internally via scikit-build-core. You do not need to run make lib beforehand.
To verify:
python -c "import lsm_vec; print('OK')"
C++ API: Open / Close
All types live in the lsm_vec namespace. Every mutating or querying method returns a Status object (aliased from RocksDB). Call .ok() to check for success and .ToString() for an error message.
#include "lsm_vec_db.h"
using namespace lsm_vec;
// Configure
LSMVecDBOptions opts;
opts.dim = 128; // required
opts.vector_file_path = "./db/vectors.bin"; // required
opts.reinit = true; // true = start fresh
// Open
std::unique_ptr<LSMVecDB> db;
Status s = LSMVecDB::Open("./db", opts, &db);
if (!s.ok()) { /* handle error */ }
// ... use the database ...
// Close
db->Close();
LSMVecDB::Open
static Status Open(const std::string& path,
const LSMVecDBOptions& opts,
std::unique_ptr<LSMVecDB>* db);
Creates or opens a database at the given directory.
| Parameter | Type | Description |
|---|---|---|
path | const std::string& | Directory for database files. Created automatically if it does not exist. |
opts | const LSMVecDBOptions& | Configuration. opts.dim must be > 0. |
db | std::unique_ptr<LSMVecDB>* | On success, *db holds the opened database handle. |
LSMVecDB::Close
Status Close();
Flushes pending writes and releases all resources. The handle is unusable after this call.
C++ API: Insert
std::vector<float> vec(128, 0.5f);
db->Insert(42, Span<float>(vec));
LSMVecDB::Insert
Status Insert(node_id_t id, Span<float> vec);
| Parameter | Type | Description |
|---|---|---|
id | node_id_t (uint64_t) | Unique identifier for this vector. |
vec | Span<float> | Vector data. Length must equal opts.dim. A std::vector<float> converts implicitly. |
Inserts a new vector and builds its HNSW graph connections. Returns InvalidArgument if the dimension mismatches.
C++ API: Search
std::vector<float> query(128, 0.1f);
SearchOptions search_opts;
search_opts.k = 10; // number of neighbors
search_opts.ef_search = 128; // candidate pool size (higher = better recall)
std::vector<SearchResult> results;
db->SearchKnn(Span<float>(query), search_opts, &results);
for (const auto& r : results) {
std::cout << "id=" << r.id << " dist=" << r.distance << "\n";
}
LSMVecDB::SearchKnn
Status SearchKnn(Span<float> query,
const SearchOptions& options,
std::vector<SearchResult>* out);
| Parameter | Type | Description |
|---|---|---|
query | Span<float> | Query vector. Length must equal opts.dim. |
options | const SearchOptions& | Search parameters (see below). |
out | std::vector<SearchResult>* | Results sorted by ascending distance, up to k entries. |
SearchOptions
| Field | Type | Default | Description |
|---|---|---|---|
k | int | 1 | Number of nearest neighbors to return. |
ef_search | int | 64 | Candidate pool size during search. Must be >= k. Higher values improve recall at the cost of latency. |
SearchResult
| Field | Type | Description |
|---|---|---|
id | node_id_t | Identifier of the matched vector. |
distance | float | Distance from the query vector. |
C++ API: Get / Update / Delete
LSMVecDB::Get
Status Get(node_id_t id, std::vector<float>* vec);
Retrieves the vector for a given ID. vec is resized to opts.dim on success. Returns NotFound if the ID does not exist.
LSMVecDB::Update
Status Update(node_id_t id, Span<float> vec);
Replaces the vector data for an existing ID and rebuilds its graph connections. Returns NotFound if the ID does not exist.
LSMVecDB::Delete
Status Delete(node_id_t id);
Marks a vector as deleted. Deleted vectors are excluded from future search results.
LSMVecDB::printStatistics
void printStatistics() const;
Prints I/O and timing statistics to stdout. Only meaningful when opts.enable_stats = true.
C++ API: Configuration
LSMVecDBOptions
Pass this struct to LSMVecDB::Open().
LSMVecDBOptions opts;
opts.dim = 128;
opts.metric = DistanceMetric::kL2;
// ... set other fields as needed ...
| Field | Type | Default | Description |
|---|---|---|---|
dim | int | 0 | Required. Dimensionality of vectors. |
metric | DistanceMetric | kL2 | Distance metric (kL2 or kCosine). |
m | int | 8 | HNSW: bi-directional links created per node at layer 0. |
m_max | int | 16 | HNSW: max neighbors per node at upper layers. |
ef_construction | float | 32.0 | HNSW: candidate pool size during index construction. |
vec_file_capacity | size_t | 100000 | Initial vector file capacity. Auto-expands with PagedVectorStorage. |
paged_max_cached_pages | size_t | 4096 | Number of 4 KB pages in the user-space page cache. |
vector_storage_type | int | 1 | 0 = BasicVectorStorage (flat file), 1 = PagedVectorStorage (paged + cached). |
db_target_size | uint64_t | ~100 GiB | Target file size hint for Aster (RocksDB). |
random_seed | int | 12345 | RNG seed for HNSW level generation. |
enable_stats | bool | false | Collect I/O and timing statistics. |
enable_batch_read | bool | true | Group vector reads by page during search. |
reinit | bool | false | true = wipe existing data; false = reopen. |
vector_file_path | string | "" | Path for vector storage file. |
log_file_path | string | "" | Path for log file (empty = no file logging). |
Tuning tips:
- Higher
mandef_construction→ better recall, slower indexing. - Higher
ef_search→ better recall, slower queries. - Higher
paged_max_cached_pages→ more RAM, fewer disk reads.
C++ API: Types
DistanceMetric
enum class DistanceMetric {
kL2, // Euclidean distance
kCosine, // 1 - cosine_similarity
};
node_id_t
using node_id_t = std::uint64_t;
Unique 64-bit vector identifier.
Span<T>
A lightweight, non-owning view over contiguous memory (similar to C++20 std::span). You rarely need to construct one explicitly because std::vector<float> converts to Span<float> implicitly.
std::vector<float> vec(128);
db->Insert(0, vec); // implicit Span<float>(vec)
Python API: Installation
git submodule update --init --recursive
make aster
python -m pip install .
Verify:
python -c "import lsm_vec; print('OK')"
Python API: Open / Close
import os
import lsm_vec
opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.vector_file_path = "./db/vectors.bin"
opts.reinit = True
os.makedirs("./db", exist_ok=True)
db = lsm_vec.LSMVecDB.open("./db", opts)
# ... use the database ...
db.close()
LSMVecDB.open
db = lsm_vec.LSMVecDB.open(path: str, opts: LSMVecDBOptions) -> LSMVecDB
Opens or creates a database. Raises ValueError on invalid arguments, RuntimeError on I/O errors.
db.close
db.close() -> None
Python API: Insert
Accepts both Python lists and NumPy arrays.
db.insert(0, [0.1] * 128)
import numpy as np
db.insert(1, np.random.rand(128).astype(np.float32))
db.insert
db.insert(id: int, vector: list[float] | numpy.ndarray) -> None
| Parameter | Type | Description |
|---|---|---|
id | int | Unique vector identifier. |
vector | list[float] or numpy.ndarray (float32, 1-D) | Must have length opts.dim. |
Raises ValueError on dimension mismatch.
Python API: Search
# With SearchOptions
search_opts = lsm_vec.SearchOptions()
search_opts.k = 10
search_opts.ef_search = 128
results = db.search_knn([0.1] * 128, search_opts)
# Or with k and ef_search directly
results = db.search_knn(query_array, k=10, ef_search=128)
for r in results:
print(f"id={r.id} distance={r.distance:.4f}")
db.search_knn
# Option A: with SearchOptions
db.search_knn(query, opts: SearchOptions) -> list[SearchResult]
# Option B: with explicit parameters
db.search_knn(query, k: int, ef_search: int) -> list[SearchResult]
query can be a list[float] or a 1-D numpy.ndarray of float32.
SearchOptions
| Property | Type | Default | Description |
|---|---|---|---|
k | int | 1 | Number of nearest neighbors. |
ef_search | int | 64 | Candidate pool size. Must be >= k. |
SearchResult
| Property | Type | Description |
|---|---|---|
id | int | Vector identifier (read-only). |
distance | float | Distance from query (read-only). |
Python API: Get / Update / Delete
db.get
vec = db.get(id: int) -> numpy.ndarray # float32, 1-D
Raises KeyError if the ID does not exist.
db.update
db.update(id: int, vector: list[float] | numpy.ndarray) -> None
Raises KeyError if the ID does not exist.
db.delete
db.delete(id: int) -> None
Python API: Configuration
LSMVecDBOptions
All properties are read-write.
opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.metric = lsm_vec.DistanceMetric.Cosine
| Property | Type | Default | Description |
|---|---|---|---|
dim | int | 0 | Required. Vector dimensionality. |
metric | DistanceMetric | L2 | L2 or Cosine. |
m | int | 8 | HNSW connections per node. |
m_max | int | 16 | Max neighbors at upper layers. |
ef_construction | float | 32.0 | Construction-time candidate pool. |
vec_file_capacity | int | 100000 | Initial vector file capacity. |
paged_max_cached_pages | int | 4096 | Page cache size (4 KB pages). |
vector_storage_type | int | 1 | 0 = basic, 1 = paged. |
db_target_size | int | ~100 GiB | Aster target file size. |
random_seed | int | 12345 | RNG seed. |
enable_stats | bool | False | Collect statistics. |
enable_batch_read | bool | True | Batch vector reads by page. |
reinit | bool | False | Wipe existing data on open. |
vector_file_path | str | "" | Vector storage file path. |
log_file_path | str | "" | Log file path. |
DistanceMetric
lsm_vec.DistanceMetric.L2 # Euclidean distance
lsm_vec.DistanceMetric.Cosine # 1 - cosine similarity
Python API: Error Handling
| C++ Status | Python Exception | When |
|---|---|---|
InvalidArgument | ValueError | Dimension mismatch, bad parameters. |
NotFound | KeyError | Vector ID not found. |
| Other errors | RuntimeError | I/O failures, database errors. |
Example: C++ Embedding
Include headers from include/ and link against liblsmvec.a (static) or liblsmvec.so / liblsmvec.dylib (shared). Transitive link dependencies: rocksdb (Aster), zstd, snappy, lz4, bz2, z, pthread, dl. On macOS, jemalloc is also required.
#include "lsm_vec_db.h"
lsm_vec::LSMVecDBOptions opts;
opts.dim = 128;
opts.vector_file_path = "./db/vectors.bin";
std::unique_ptr<lsm_vec::LSMVecDB> db;
auto s = lsm_vec::LSMVecDB::Open("./db", opts, &db);
// Insert
std::vector<float> vec(128, 0.1f);
db->Insert(0, vec);
// Search (uses k and ef_search from opts)
std::vector<lsm_vec::SearchResult> results;
db->SearchKnn(vec, &results);
// Close
db->Close();
Example: Python Quick Start
import lsm_vec
import os
opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
db_dir = "./run/db/"
opts.vector_file_path = os.path.join(db_dir, "vectors.bin")
opts.reinit = True
db = lsm_vec.LSMVecDB.open(db_dir, opts)
db.insert(1, [0.1] * 128)
# Search (uses k and ef_search from opts)
results = db.search_knn([0.1] * 128)
print(results[0].id, results[0].distance)
Example: NumPy
import numpy as np
import lsm_vec
opts = lsm_vec.LSMVecDBOptions()
opts.dim = 128
opts.k = 5
opts.ef_search = 100
opts.vector_file_path = "./run/db/vectors.bin"
opts.reinit = True
os.makedirs("./run/db/", exist_ok=True)
db = lsm_vec.LSMVecDB.open("./run/db/", opts)
vec = np.random.rand(128).astype(np.float32)
db.insert(42, vec)
query = np.random.rand(128).astype(np.float32)
results = db.search_knn(query)
for r in results:
print(f"id={r.id} distance={r.distance:.4f}")
db.close()
Implementation Details
LSMVecDB is the public API layer, handling database lifecycle (Open/Close), input validation, metadata serialization, and deleted-ID tracking. It delegates all indexing operations to LSMVec.
LSMVec implements the HNSW algorithm. Layer-0 graph edges are persisted in Aster's RocksGraph (an LSM-tree backed graph store), while upper-layer edges are kept in an in-memory map for fast hierarchical navigation.
Raw vector data is managed by IVectorStorage, which offers two backends: BasicVectorStorage (a contiguous flat file addressed by ID offset) and PagedVectorStorage (4 KB page-managed layout with a user-space FIFO page cache that co-locates vectors sharing the same HNSW entry point for better spatial locality).