In a previous article on the filter when searching for nearest neighbors, we discussed the theoretical background: What can be modified in the HNSW algorithm to make it usable for search with constraints?
This time I am going to present a C++ implementation with Python bindings.
As a base implementation of HNSW I took hnswlib, stand-alone header-only implementation of HNSW. This library is highly performance-oriented, so it used some low-level optimization tricks which I had to remove. Because of it, unfortunately, I couldn’t make a pull request to the original repository.
What was changed?
I introduced tags. Now you can assign an arbitrary number of uint32 numbers (representing tags) to any point.
# ids - list of point ids
# tag - uint32 representing some tag
hnsw.add_tags(ids, tag)
The group of points under the same tag could be searched separately from others.
vector_to_search = ...
tag_to_search_in = 42 # Search among points with this tag
condition = [[(False, tag_to_search_in)]]
found_labels, found_dist = hnsw.knn_query(vector_to_search, k=10, conditions=condition)
These groups could also be combined using boolean expressions.
For example (A | !B) & C
is represented as [[(0, A), (1, B)], [(0, C)]]
, where A
, B
, C
are logical clauses if respective tag is assigned to a point.
Let’s say we need to search among multiple tags, but only a point does not have the tag 100
:
tags_to_search_in = [1,2,3,4,5,6]
condition = [
[(0, x) for x in tags_to_search_in],
[(1, 100)]
]
found_labels, found_dist = hnsw.knn_query(vector_to_search, k=10, conditions=condition)
If the group is large enough ( >> 1/M
fraction of all points), knn_query
should work fine.
But if the group is smaller, it may need to build additional connections in the HNSW graph for these groups.
hnsw.index_tagged(tag=42, m=8)
If we want to search within multiple small groups, we should build connections among them:
hnsw.index_cross_tagged(tags=[1,2,3], m=8)
Additional indexing of multiple tags could also be useful if you need to emulate the filtering of continuous values like numerical or geo. In this case, you need to index neighbor tags, so the graph will stay connected within any selected range.
You can find full example here.
Making vector search with support of geo-filtering.
Based on the HNSW with categorical filtering, it is possible to build build a tool that can search in specified geo-region only. A full example could be found on GitHub and here I will reveal some details.
Initially we instantiate Index object with some parameters:
hnsw = hnswlib.Index(space='cosine', dim=dim)
hnsw.init_index(max_elements = elements, M = 16, random_seed=45)
hnsw.set_num_threads(2)
Then we generate some vectors (points) and corresponding geo-coordinates:
# Generate random vectors and geo points
points, geo_points = get_random_data(elements, dim, from_lat, to_lat, from_lon, to_lon)
Next we add this points to the index in regular way:
hnsw.add_items(points)
After that, we need to assign geo-hashes for each point. We will add hashes of different lengths to support searches on different scales.
tags_to_index = defaultdict(int)
# Collect geo-hashes for indexing
for idx, geo_point in enumerate(geo_points):
lat, lon = geo_point
ghsh = geohash.encode(lat, lon, precision=max_precision)
# List all hashes in hierarchy: 'u337jk' -> ['u', 'u3', 'u33', 'u337', 'u337j', 'u337jk']
tags = [ghsh[:i + 1] for i in range(max_precision)]
# Save small geo-hashes for further indexing
tags_to_index[ghsh[:max_precision]] += 1
tags_to_index[ghsh[:max_precision - 1]] += 1
# Assign geo-hashes to points
for tag in tags:
hnsw.add_tags([idx], geohash2int(tag))
Now we can additionally index selected regions:
# Additionally index points inside small regions
for tag in tqdm.tqdm(tags_to_index):
# This will create additional links in a graph for each geohash region.
# So search should work on nodes inside this region only.
hnsw.index_tagged(geohash2int(tag))
# With M=16 additional indexing is only required for regions containing less than ~5% of all points
# Additional info here: https://blog.vasnetsov.com/posts/categorical-hnsw/
for tag in tqdm.tqdm(tags_to_index):
# This code will also create additional connections between points in neighbor regions.
# So search in multiple neighbor regions will also work
neighbors = [geohash2int(ntag) for ntag in geohash.neighbors(tag) if ntag in tags_to_index]
hnsw.index_cross_tagged(neighbors)
And finally, we can query for the closest vector in a specified region:
# Generate integer tag from geohash
target_ghsh = geohash.encode(target_lat, target_lon, precision=target_precision)
target_tag = geohash2int(target_ghsh)
# Obtain search condition from geohash
# You can also search in multiple squares with conjunction:
# [[(False, hash1), (False, hash2), ..., (False, hashN)]]
condition = [[(False, target_tag)]]
found, dist = hnsw.knn_query(target_query, k=3, conditions=condition)
Further work
Sometimes the selected subset is too small, so it would be faster to just brute-force all points. In other cases, if query constraints consist of multiple tag conjunctions, it might be faster to parallel this query. In other words, it would be useful to have a high level query planner for the searcher.
Another potential challenge one can face building a vector search system is scaling. One solution is quantization, but it does not provide a horizontal scaling of the search. Another approach used, for example, in ElasticSearch, is to shard data on multiple machines. In this case, each worker searches in its part of the data. And the final result is obtained from a combination of top shard hits.
The implementation of these ideas can result in an exciting project. It could be a service providing API for searching and indexing data of different types but based on categorical HSNW vector search rather than a regular inverted index. This service should take on the maintenance of data integrity, sharding and query planning.
The closest analog that I could find is a milvus, and it does not support features, presented in this note.
Feel free to contact me if you need any additional information on this or any other of my projects.