Open Source Similarity Search Toolkit

Introduction to the Project


At the Succinct Information Processing Unit in RIKEN-AIP, we continuously explore the potential of data, dedicating ourselves to the development of technologies for more efficient and accurate information processing. In our past research, we have developed a plethora of similarity search software to accommodate various data formats and search needs.


The purpose of this homepage is to widely publish and share these research outcomes. The similarity search tools we have developed will assist researchers and developers across various fields in delving deeper into data and analyzing it more effectively. Alongside detailed descriptions of our tools, we also provide access to the source code, allowing everyone to freely use and further develop our technologies.


We hope that our technologies contribute to your research or projects, helping us all explore the world of data together. Please feel free to browse our homepage and make the most of our achievements.

Introduction to the Similarity Search Toolkit

The similarity search tools developed in our laboratory aim to efficiently locate similar data within large datasets, with the main features being as follows:

Thanks to these features, our toolkit of similarity search tools serves as a powerful tool in various fields, rapidly and efficiently extracting valuable information from large amounts of data.

The following two tables summarize the currently available tools in our toolkit.

"One-vs-All" Similarity Search Toolkit

The "One-vs-All" Similarity Search Toolkit is a collection of tools that are available for a wide range of applications, accommodating different data formats and similarity measures.

gWT: gWT is a similarity search tool specialized for graph data. For example, it performs excellently with graph data representing compound structures and can conduct rapid and accurate similarity searches in large databases containing over 50 million compounds, such as PubChem. Cosine similarity is used to measure similarity.

SMBT: SMBT is a similarity search tool optimized for vector data composed of 0s and 1s. It is highly efficient with sparse data such as compound fingerprints, and it calculates similarity using Jaccard similarity.

bST and DyFT: These tools provide adaptable similarity search for vector data, calculating similarity using Min-Max similarity. They enable efficient searching even in large datasets.

frechet_simsearch: This tool specializes in similarity search for trajectory data, making it suitable for analyzing movement patterns, such as in sports data analysis. It measures similarity using Frechet distance, allowing for applications such as analyzing the movements of basketball players.

This toolkit is specialized for different data formats and similarity measures, enabling their use in a variety of scenarios. With fast and memory-efficient implementations, they exhibit excellent performance even on large datasets, expanding their range of applications in research and industry.

Related Literatures

"All-vs-All" Similarity Search Tool Suite

“All-vs-All” similarity search refers to the process of calculating the similarity between all pairs of data within a database to identify pairs of data that are similar. The suite of tools in this category developed by our laboratory is primarily based on the fast similarity search algorithm “SketchSort”.

Examples of uses for "All-vs-All" similarity search tools include the removal of duplicates in images or text, the discovery of similar pairs in a compound database, and the extraction of specific motifs from protein sequences. These tasks aim to find important relationships within a large amount of data, requiring fast and accurate similarity searches.

The "All-vs-All" similarity search tools we provide are compatible with various data formats and similarity measures, allowing users to choose the most suitable tool for their needs. Each tool has its unique characteristics, aiming to operate quickly and efficiently with minimal memory use.

By selecting the tool most suited to a specific application area or purpose, efficient and accurate similarity searches can be conducted. Our suite of tools enhances the speed and accuracy of data analysis, aiding in the extraction of valuable information from large datasets.

Related Literatures