Unified Database: Laying the foundation for large language model vertical applications

Published

A diagram showing splitting vector partitions and reallocating vectors in partitions to adapt to changes in data distribution.

Large language models (LLMs) have become a valuable technology in areas such as content creation, language comprehension, and intelligent dialogue, or interactions between people and computer systems. However, these models generate responses based on patterns and rules observed in fixed training data, which can potentially lead them to produce erroneous and even fictitious information. The models can also struggle with real-time knowledge updates. One technique known as retrieval augmented generation (RAG) can organically combine fresh external information with LLMs, putting relevant and precise knowledge into context to help guide the answer generation process, enhancing their performance and reliability.
 
One of the core components of RAG, the vector database, significantly differs from traditional relational databases in its storage and query mechanisms. This presents a challenge to the unified management of increasingly diverse and multimodal knowledge bases. Researchers from the Systems and Networking Group at Microsoft Research Asia believe that a unified database capable of managing rich attributes and modalities of external knowledge will support widespread application and improved reliability of LLMs.
 
“As the capabilities of large models continue to improve, various types of data, such as text, images, and videos, can be encoded into high-dimensional vectors using machine learning technology. Detailed attributes of knowledge, such as the type of images, user preferences, etc., can be converted into different data features. It becomes difficult to achieve efficient and accurate query results among these mixed types of information. Therefore, a unified database is needed to effectively manage the data, providing a more solid knowledge foundation for LLMs,” said Qi Chen, a principal researcher at the Microsoft Research Asia lab in Vancouver, Canada.

VBase query system: Providing a unified foundation for vector index and scalar index scanning

Vector databases and scalar databases have different index scan patterns. Therefore, the lack of a unified foundation is the first problem to be solved in building a unified database.
 
Scalar indexes are based on numerical order, and the scanning of these indexes follows a strictly increasing or decreasing order. This is the primary reason why relational databases can efficiently execute queries. For example, when searching for clothes priced between 100 and 200 Canadian dollars on a shopping platform, the system starts scanning from the price of C$100, and the query stops once the price exceeds C$200. Clearly, this monotonicity-based scalar query is highly efficient.
 
In contrast, vector indexes are built on proximity in high-dimensional space, and index traversal cannot follow a strict order, so they lack monotonicity. Vector indexes only provide approximate spatial navigation for queries, to estimate the nearest subspace. To achieve early termination, the vector index scanning process relies on the TopK algorithm to provide the temporary order. Although the order can be used to terminate the execution in advance, this method is inefficient.

Diagrams illustrating query execution on scalar database and vector database. Left: Scalar index diagram with ordered scanning; Right: Vector index diagram with no strict order.
Figure 1. Query execution on scalar database and vector database

For example, suppose a person has a picture of a garment and wants to find similar items on a shopping platform that are priced below C$200. The traditional method is to first conduct a similarity query to get a large number of candidates, and then filter based on price. For instance, to find the top 10 most similar and appropriately priced results, one can first set the search range to 1,000 candidates, and then filter one by one according to the price condition until 10 results appear that meet the requirements. If the results are insufficient, the search range can be expanded to 2,000 or 3,000 until the requirements are met. 

This method is designed to convert the retrieval results of vector data into a temporary scalar index that follows strict monotonicity and then perform scalar queries. 

The problem with this method is that it cannot guarantee that the K results returned will meet the final filtering query requirements. Therefore, to ensure that the filtering results meet the requirements, either TopK needs to perform a wider similarity query, returning more Ks; or when K is insufficient, TopK queries are repeated. But both methods will lead to suboptimal query performance.

By analyzing a large number of vector indices, researchers have found that vector index queries do not require strict monotonicity for early termination. The traversal of vector indices exhibits a kind of relaxed monotonicity. The traversal of scalar indices is a special case of this relaxed monotonicity.

Based on this discovery, researchers have developed the VBase unified database system. This system provides a unified foundation for efficient scanning of vector indices and scalar indices, making the scanning of various indices follow the same interface and early termination conditions. This innovation not only improves the performance of vector databases in executing complex queries by 10 to 1,000 times, but also improves the accuracy of queries.

VBase makes it possible to build a unified database capable of executing various complex relational vector and scalar mixed queries. Currently, based on the VBase (opens in new tab) system, an open-source database platform has successfully built its own multimodal vector database.

Spotlight: Blog post

Eureka: Evaluating and understanding progress in AI

How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings. 

SPFresh: First vector index that supports real-time in-place incremental update 

The RAG technology based on vector database retrieval significantly improves the accuracy of the generation results of LLMs. However, this improvement requires real-time updates to the data in the vector database. For vectors with hundreds to thousands of dimensions, updating is not easy – it can take days to reconstruct the vector index. 

Scalar databases typically use a B-tree or B+ tree index, which can complete updates by directly inserting information after finding the specified location through binary search. However, updating a vector database is much more complicated. 

Take the currently popular fine-grained graph-based vector index and coarse-grained cluster-based vector index as examples. When inserting or deleting vectors in the fine-grained graph vector index, it is necessary to perform large-scale graph scanning to find the appropriate neighbors for update, which requires a lot of computational resources.

Meanwhile, an insufficient update can weaken performance and accuracy. In the update of the coarse-grained cluster index, although the insertion or deletion of vectors only involves the modification of the nearest partition, the cost is lower. But as partition updates accumulate, the data distribution can become unbalanced, which could affect query latency and accuracy, leading to a decline in index quality.

Existing vector index update methods rely on periodic global rebuilding, which is slow and resource intensive. Although performance and accuracy are immediately improved after rebuilding, they gradually decline between rebuilds. In addition, the cost of global rebuilding is very high, requiring more than 10 times the resources of traditional indexing, possibly exceeding the cost of index search services. 

To solve these problems, researchers from Microsoft and their collaborators have proposed SPFresh, which is the first vector index that supports real-time, in-place, incremental updating of unified databases. The core of SPFresh is LIRE – a lightweight incremental rebalancing protocol used to dynamically split or merge vector partitions and reallocate vectors in partitions to adapt to changes in data distribution. LIRE achieves low-resource vector updates by reallocating vectors only at nearby partitions. 

A diagram showing splitting vector partitions and reallocating vectors in partitions to adapt to changes in data distribution.
Figure 2. Partition splitting requires reallocating vector data

Compared to existing periodic index rebuilding methods, SPFresh can greatly reduce the resources required for index rebuilding, and can always maintain a stable high recall rate, low latency, and high query throughput, effectively adapting to dynamic changes in data distribution in a timely manner.

Vector databases are widely used in fields such as natural language processing, information retrieval, and recommendation systems, providing efficient solutions for handling unstructured data. However, various encoding methods for vector data exist, with sparse and dense vectors each having their own advantages for different types of tasks. For example, sparse vectors are suitable for keyword matching tasks, while dense vectors are better for extracting semantic information. Therefore, in practical applications, multi-index mixed queries are widely used, especially in mixed data sets, where the method of finding similar items by combining sparse and dense feature collaborative filtering has been proven to improve the accuracy of query results.

However, due to the special traversal manner of vector indexes, the intersection between multiple vector indexes cannot be directly pushed down, making it difficult to combine search results from multiple indexes.

To overcome this challenge, researchers from Microsoft and their collaborators have introduced OneSparse, a unified index system catering to both sparse and dense vectors. OneSparse enables the execution of multi-index mixed queries and dynamically generates the optimal merge plan, facilitating rapid intersection and union operations within a single index during index traversal.

OneSparse unifies sparse indexes and dense indexes into a single inverted index and rearranges all posting lists according to the document ID, ensuring efficient execution, even when performing complex queries for both semantic and keyword matching. The technology has been successfully applied in Microsoft Bing’s web search and sponsored search. 

Diagram illustrating the OneSparse index overview. For sparse data, OneSparse maintains one dimension of the sparse vectors (i.e., term) per inverted posting list, which allows fast lookup to all relevant documents of a word in a query. The values stored in an inverted posting list are pairs of ID and a single dimensional feature (e.g., term frequency). For dense vectors, OneSparse clusters them into several posting lists by SPANN. Besides, it builds a SPTAG in-memory ANN index on cluster centroids to quickly navigate to the nearest SPANN posting lists. The values stored in a SPANN posting list are pairs of ID and dense vector in this cluster. All inverted posting lists and SPANN posting lists are saved on disk.
Figure 3. OneSparse index overview

Unified databases accelerate the development of LLMs and hardware innovation 

As early as 2018, Microsoft Research Asia began in-depth research on vector data systems. “At that time, we realized that vectorization would become the cornerstone of deep learning applications,” Qi Chen said. “Therefore, we developed SPTAG (opens in new tab) and SPANN (opens in new tab) technologies one after another, successfully solving the generalization and scalability problems of vector indexing, and applied it to Microsoft Bing search, achieving the world’s largest vector semantic search system.”
 
Researchers at Microsoft Research Asia continue to explore vector database technology. Based on the relaxed monotonicity and the lightweight update method of the LIRE protocol, they have built a unified database system MSVBASE (opens in new tab), which has been open-sourced on GitHub. The MSVBASE system can be used for semantic analysis of multimodal data, providing developers with powerful tools for researching and utilizing the RAG mechanism and designing more complex RAG retrieval queries. RAG technology will not only be able to perform TopK-based vector queries but also make use of more high-dimensional vector data and attributes for retrieval, achieving more accurate query results. 

In the current age of extensive knowledge expansion, unified databases offer better knowledge transfer between multimodal data types. They provide substantial corpus support for large models and are poised to drive innovation in underlying hardware, laying the foundation for data-enhanced AI in the future.

Related publications

Continue reading

See all blog posts