Open Source Similarity Search Toolkit
Introduction to the Project
At the Succinct Information Processing Unit in RIKEN-AIP, we continuously explore the potential of data, dedicating ourselves to the development of technologies for more efficient and accurate information processing. In our past research, we have developed a plethora of similarity search software to accommodate various data formats and search needs.
The purpose of this homepage is to widely publish and share these research outcomes. The similarity search tools we have developed will assist researchers and developers across various fields in delving deeper into data and analyzing it more effectively. Alongside detailed descriptions of our tools, we also provide access to the source code, allowing everyone to freely use and further develop our technologies.
We hope that our technologies contribute to your research or projects, helping us all explore the world of data together. Please feel free to browse our homepage and make the most of our achievements.
Introduction to the Similarity Search Toolkit
The similarity search tools developed in our laboratory aim to efficiently locate similar data within large datasets, with the main features being as follows:
Diversity in Search Types: This suite of tools offers two types of similarity searches: “One-vs-All” and “All-vs-All.” In “One-vs-All” search, a single query data is compared against all data in the database to identify all data with a similarity above a set threshold. On the other hand, “All-vs-All” search calculates the similarity between all pairs of data in the database, extracting all data pairs with a similarity above a threshold.
Versatility in Data Format: The tool supports various data formats, including graphs, trajectory data, and vectors, making it applicable for a wide range of uses.
Diverse Methods of Similarity Calculation: The tool is compatible with various methods of similarity calculation, such as cosine similarity, Min-Max method, and Jaccard similarity. This allows for the selection of the most appropriate method of similarity calculation according to the characteristics of the data and the purpose of the search.
Fast and Memory-Efficient: The tool realizes fast search performance and low memory usage even when dealing with large datasets. This makes it possible to efficiently search for similar data in resource-constrained environments.
Thanks to these features, our toolkit of similarity search tools serves as a powerful tool in various fields, rapidly and efficiently extracting valuable information from large amounts of data.
The following two tables summarize the currently available tools in our toolkit.
"One-vs-All" Similarity Search Toolkit
The "One-vs-All" Similarity Search Toolkit is a collection of tools that are available for a wide range of applications, accommodating different data formats and similarity measures.
gWT: gWT is a similarity search tool specialized for graph data. For example, it performs excellently with graph data representing compound structures and can conduct rapid and accurate similarity searches in large databases containing over 50 million compounds, such as PubChem. Cosine similarity is used to measure similarity.
SMBT: SMBT is a similarity search tool optimized for vector data composed of 0s and 1s. It is highly efficient with sparse data such as compound fingerprints, and it calculates similarity using Jaccard similarity.
bST and DyFT: These tools provide adaptable similarity search for vector data, calculating similarity using Min-Max similarity. They enable efficient searching even in large datasets.
frechet_simsearch: This tool specializes in similarity search for trajectory data, making it suitable for analyzing movement patterns, such as in sports data analysis. It measures similarity using Frechet distance, allowing for applications such as analyzing the movements of basketball players.
This toolkit is specialized for different data formats and similarity measures, enabling their use in a variety of scenarios. With fast and memory-efficient implementations, they exhibit excellent performance even on large datasets, expanding their range of applications in research and industry.
Related Literatures
Kanda, S., Tabei, Y.: Dynamic similarity search on integer sketches, In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 2020.
Kanda, S., Takeuchi, K., Fujii, K., Tabei, Y.: Succinct trit-array trie for scalable trajectory similarity search, In Proceedings of the 28th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2020.
Kanda, S., Tabei, Y.: b-bit sketch trie: scalable similarity search on integer sketches, In Proceedings of the 2019 IEEE International Conference on BigData (IEEE BigData), 2019.
Tabei, Y., Simon, J. P.: Scalable similarity search for molecular descriptors, In Proceedings of the 10th International Conference on Similarity Search and Applications, 2017.
Tabei, Y., Kishimoto, A., Kotera, M., Yamanishi, Y.: Succinct interval splitting tree for scalable similarity search of compound-protein pairs with property constrains, In Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2013.
Tabei, Y.: Succinct Multibit Tree: Compact representation of multibit trees by using succinct data structures in chemical fingerprint searches, In Proceedings of the 12th Workshop on Algorithms in Bioinformatics (WABI), 2012.
Tabei, Y., Tsuda, K.: Kernel-based similarity search in massive graph databases with wavelet trees, In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), 2011.
"All-vs-All" Similarity Search Tool Suite
“All-vs-All” similarity search refers to the process of calculating the similarity between all pairs of data within a database to identify pairs of data that are similar. The suite of tools in this category developed by our laboratory is primarily based on the fast similarity search algorithm “SketchSort”.
Examples of uses for "All-vs-All" similarity search tools include the removal of duplicates in images or text, the discovery of similar pairs in a compound database, and the extraction of specific motifs from protein sequences. These tasks aim to find important relationships within a large amount of data, requiring fast and accurate similarity searches.
The "All-vs-All" similarity search tools we provide are compatible with various data formats and similarity measures, allowing users to choose the most suitable tool for their needs. Each tool has its unique characteristics, aiming to operate quickly and efficiently with minimal memory use.
By selecting the tool most suited to a specific application area or purpose, efficient and accurate similarity searches can be conducted. Our suite of tools enhances the speed and accuracy of data analysis, aiding in the extraction of valuable information from large datasets.
Related Literatures
Ito, J., Tabei, Y., Shimizu, K., Tsuda, K., Tomii, K.: PoSSuM: A database of similar protein–ligand binding and putative pockets, Nucleic Acids Research, DB issue 40:D541-8, 2012.
Ito, J., Tabei, Y., Shimizu, K., Tomii, K., Tsuda, K.: PDB-scale analysis of known and putative ligand binding sites with structural sketches, Proteins, 80, 747-763, 2012
Tabei, Y., Tsuda, K.: SketchSort: Fast all pairs similarity search for large databases of molecular fingerprints, Molecular Informatics, 30, 801-807, 2011.
Tabei, Y., Uno, T., Sugiyama, M., Tsuda, K.: Single versus multiple sorting in all Pairs similarity search, In Proceedings of the 2nd Asian Conference on Matching Learning (ACML), 2010.