Software Repository

Technology-Enabled Learning

#	Name	Publication Info	Description	Repository Access
5	ARENA	Publication (SIGMOD 2023, SIGMOD 2026)	A key learning goal of learners taking a database systems course is to understand how SQL queries are processed in an RDBMS in practice. To this end, comprehension of different alternative query plans (AQPs) that may be considered during the selection of the query execution plan (QEP) of a query is paramount. ARENA is a novel and generic system that facilitates exploration of informative AQPs of a given SQL query to aid the comprehension of QEP selection.	Software
4	MOCHA	Publication (VLDB 2022)	This software presents a novel system coined MOCHA that facilitates exploration and visualization of the impact of alternative physical operator choices on the QEP of a given SQL query. MOCHA accepts an SQL query as input, and compares and visualizes the QEP and alternative plans which are selected based on learner-specified operator preferences. Furthermore, it intuitively explains why the key operators in a QEP are chosen by connecting them to established knowledge in the literature.	Software
3	JQA	Publication (NCAA 2022)	This software presents an innovative end-to-end framework called Judicial Questioning Aid (JQA) which is capable of proactively leading a multi-party court debate by asking useful questions to a litigant given previous context. It can be used in judicial education by facilitating trainees and newly-appointed judges gain experience with various legal scenarios and supplement their training with additional practice and court rehearsal without significant cost.	Code
2	LANTERN	Publication (SIGMOD 2021, SIGMOD 2022)	An RDBMS typically exposes a query execution plan (QEP) in a visual or textual format, which describes the execution steps for a given query. However, it is often daunting for a learner to comprehend these QEPs containing vendor-specific implementation details. LANTERN is a novel, generic, and portable system that generates a natural language (NL)-based description of the execution strategy chosen by the underlying RDBMS to process a query. It provides a declarative framework called POOL for subject matter experts to efficiently create and manipulate the NL descriptions of physical operators of any RDBMS. It then exploits POOL to generate the NL descriptions of QEPs by integrating a rule-based and a deep learning-based technique to infuse language variability in the descriptions. Such an NL generation strategy mitigates the impact of boredom on learners caused by repeated exposure of similar text generated by a rule-based system.	Software
1	NEURON	Publication (SIGMOD 2019)	NEURON is a novel system that facilitates natural language interaction with relational query execution plan (QEP), which represents an execution strategy for an SQL query, to enhance its understanding. It accepts an SQL query (which may include joins, aggregation, nesting, among other things) as input, executes it, and generates a simplified natural language description (both in text and voice form) of the execution strategy deployed by the underlying RDBMS. Furthermore, it facilitates understanding of various features related to a QEP through a natural language question answering (NLQA) framework. NEURON, world's first of its kind, is a tool that can greatly enhance students' learning of the query processing topic.	Software

Data Management and Analytics

#	Name	Publication Info	Description	Repository Access
30	leSAX	Publication (ICDE 2025)	Time series similarity search is a fundamental task across various applications, including classification, motif discovery, and anomaly detection. However, existing iSAX-based index methods, while known for their efficiency, often rely on hand-crafted techniques (e.g., PAA and SAX) for z-normalized time series data. These techniques do not fully exploit the full representation space and pose challenges to indexing. This software implements a novel learned index for facilitating time series similarity search.	Code
29	DARKER	Publication (VLDB 2024)	Transformer-based models have facilitated numerous applications with superior performance. A key challenge in transformers is the quadratic dependency of its training time complexity on the length of the input sequence. This software implements an efficient transformer with a novel data-driven kernel-based attention mechanism for time series data.	Code
28	DKWS	Publication (TKDE 2024)	This software implements a novel distributed keyword search framework on graphs.	Code
27	Temporal JSON Keyword Search	Publication (SIGMOD 2024)	This software implements temporal keyword search features on JSON documents. It showcases the support for temporal features at a modest cost.	Code
26	Prilo	Publication (SIGMOD 2023)	This software implements a privacy preserving query service for localized graph pattern queries that enables users to privately obtain the query results.	Code
25	Plug-and-Play SQL	Publication (ER 2023)	This software implements a conceptual model for a database query’s input type. The input type is the shape of the data needed by a query. Pairing a conceptual model with a query creates a plug-and-play query that can be type matched to a database’s schema to determine whether the query can be safely evaluated. The software showcases the portability, ease-of-use, and type safety of plug-and-play queries.	Code
24	IPS	Publication (ICDE 22)	Time series shapelets (shapelets) are discriminative subsequences that have been recently found both effective and interpretable for solving time series classification problems. However, shapelet discovery is known to be computationally costly. IPS is a solution to address this problem that utilizes the instance profile (IP) to capture the characteristics of shapelets in a robust manner to discover high-quality shapelets efficiently.	Code
23	MIDAS	Publication (ACM SIGMOD 2021)	This software is built on top of CATAPULT and enables efficient and effective maintenance of canned patterns of a visual graph query interface as the underlying collection of small- or medium-sized data graphs evolve. Specifically, MIDAS adopts a selective maintenance strategy that guarantees progressive gain of coverage of the patterns without sacrificing diversity and cognitive load.	Download
22	SSA	Publication (ICDE 2021)	This software implements privacy preserving query services for strong simulation queries in the database outsourcing paradigm. In such a paradigm, clients send their queries to a third-party service provider (SP), who has the outsourced large graph data, and the SP computes the query answers. However, as the SP may not always be trusted, the sensitive information of the clients’ queries, importantly, the query structures, should be protected. This software adopts strong simulation as a practical query semantic for this paradigm.	Download
21	ShapeNet	Publication (AAAI 2021)	This software implements a novel algorithm called ShapeNet, which embeds shapelet candidates from different lengths into the unified space for shapelets selection. The network is trained using our cluster-wise triplet loss, which considers the distance between anchor and multiple positive (negative) samples and the distance among positive (negative) samples. Then, it computes representative and diversified final shapelets rather than directly using all the embeddings for model building to avoid a large fraction of computing non-discriminative shapelet candidates. A classical classifier (e.g., SVM) is then adopted.	Download
20	BSPCover	Publication (IEEE TKDE 2022)	Time-series shapelets are discriminative subsequences, recently found effective for time series classification (TSC). It is evident that the quality of shapelets is crucial to the accuracy of TSC. However, the majority of research has focused on building accurate models from some shapelet candidates. This software implements a novel efficient shapelets discovery method, called BSPCOVER, to discover a set of high-quality shapelet candidates for model building.	Download
19	PANE	Publication (VLDB 2021)	Given a graph where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node to a compact vector, which can be used in downstream machine learning tasks. PANE is an effective and scalable approach to ANE computation for massive graphs that achieves state-of-the-art result quality on multiple benchmark datasets, measured by the accuracy of common prediction tasks.	Download
18	AURORA	Publication (SIGMOD 2020)	AURORA is a plug-and-play visual subgraph query interface (VQI) for a large collection of small- or medium-sized data graphs that constructs the query interface in a data-driven manner. One can simply install it on top of any such graph database and use it to generate data-specific VQI to facilitate top-down and bottom-up visual subgraph query formulation.	Download
17	FERRARI	Publication (VLDB J 2020, ICDE 2019)	This software implements a novel visual exploratory subgraph search paradigm on a large collection of small- or medium-sized data graphs. A preliminary version of the software was demonstrated in VLDB 2017.	Download
16	G-CARE	Publication (SIGMOD 2020)	This software realizes the world's first framework for benchmarking graph cardinality estimation techniques for subgraph matching queries.	Download
15	LATTE	Publication (SIGMOD 2020)	This software is a user-friendly visual interface for constructing Solidity smart contracts. It is targeted for end users who do not have programming skills or background in Solidity. The system can also serve expert users who can generate the initial code using LATTE and then augment it to their need.	Download
14	NRP	Publication (VLDB 2020)	Homogeneous network embedding (HNE) maps the graph structure in the vicinity of a node to a compact, fixed-dimensional feature vector. This software focuses on HNE for massive graphs, e.g., with billions of edges. On this scale, most existing approaches fail, as they incur either prohibitively high costs, or severely compromised result utility. Our proposed solution, called Node-Reweighted PageRank (NRP), is based on a classic idea of deriving embedding vectors from pairwise personalized PageRank (PPR) values.	Download
13	PPKWS	Publication (IEEE ICDE 2020)	This software implements a new keyword search framework, called public-private keyword search (PPKWS), on public-private graph models. PPKWS consists of three major steps: partial evaluation, answer refinement, and answer completion.	Download
12	BigIndex	Publication (TKDE 2020)	This software implements a generic ontology-based indexing framework for keyword search for graphs.	Download
11	FROST	Publication (ACM TIST 2020)	Facility relocation (FR) problem, which aims to optimize the placement of facilities to accommodate the changes of users’ locations, has a broad spectrum of applications. Despite the significant progress made by existing solutions to the FR problem, they all assume each user is stationary and represented as a single point. Unfortunately, in reality, objects (e.g., people, animals) are mobile. Consequently, these efforts may fail to identify superior solutions to the FR problem. For the first time, this software takes into account movement history of users to address the above limitation.	Download
10	CATAPULT	Publication (ACM SIGMOD 2019)	This software automatically selects canned patterns for a visual graph query interface designed for a large collection of small- or medium-sized data graphs (e.g., chemical compounds). Given a data graph collection and a pattern budget, it automatically selects the canned patterns to be displayed on a GUI by optimizing coverage, diversity, and cognitive load of the patterns in the underlying data repository. CATAPULT is a core component for realizing plug-and-play visual graph query interfaces.	Download
9	TEA/TEA+	Publication (ACM SIGMOD 2019)	This software captures the implementation of two novel local graph clustering algorithms based on Heat Kernel PageRank (HKPR) to address the efficiency and accuracy limitations of existing local clustering techniques. Specifically, these algorithms provide non-trivial theoretical guarantees in relative error of HKPR values and time complexity. The basic idea is to utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte Carlo random walks to refine the results in an optimized and non-trivial way.	Download
8	PANDA	Publication (VLDB J 2017, VLDB 2018)	This software implements a novel graph querying paradigm called partial topology-based network search and a query processing system called PANDA to efficiently find top-k matches of a partial topology query (PTQ) in a single machine. A PTQ is a disconnected query graph containing multiple connected query components. PTQs allow an end user to formulate queries without demanding precise information about the complete topology of a query graph.	Download
7	AutoG	Publication (VLDB J 2017, VLDB 2016)	This software implements a novel framework for subgraph query autocompletion (called AUTOG). Given an initial query q and a user’s preference as input, AUTOG returns ranked query suggestions Q′ as output. Users may choose a query from Q′ and iteratively apply AUTOG to compose their queries.	Download
6	PINOCCHIO	Publication (TKDE 2016)	The location selection problem, which aims to mine the optimal location from a set of candidates to place a new facility such that a score (i.e., benefit or influence on some given objects) can be maximized, has drawn significant research attention in recent years. State-of-the-art LS techniques assume each object is static and can only be influenced by a single facility. However, in reality, objects (e.g., people, vehicles) are mobile and are influenced by multiple facilities, which prevents classical LS solutions from selecting accurate results. This software takes mobility and probability factors into consideration to address the aforementioned limitations. Specifically, given a set of candidate locations, it aims to mine the optimal location which can influence the most number of moving objects.	Download
5	DUALSIM	Publication (SIGMOD 2016)	Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. DUALSIM is a disk-based, single machine parallel subgraph enumeration solution that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, it implements a novel concept of the dual approach for subgraph enumeration, which swaps the roles of the data graph and the query graph. DUALSIM outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.	Download
4	Structure-Preserving Query Service	Publication (ICDE 2015, TKDE 2015)	This software implements the first practical private approach for subgraph query services, asymmetric structure-preserving subgraph query processing, where the data graph is publicly known and the query structure/topology is kept secret. Such query services are useful when the query computation is outsourced to a third-party service provider.	Download
3	ASTERIX	Publication (SIGIR 2017, SIGMOD 2013)	Existing XML keyword search (XKS) engines primarily suffer from two limitations. First, although the smallest lowest common ancestor (SLCA) algorithm (or a variant, e.g., ELCA) is widely accepted as a meaningful way to identify subtrees containing the query keywords, SLCA typically performs poorly on documents with missing elements, i.e., (sub)elements that are optional, or appear in some instances of an element type but not all. Second, since keyword search can be ambiguous with multiple possible interpretations, it is desirable for an XKS engine to automatically expand the original query by providing a classification of different possible interpretations of the query w.r.t. the original results. However, existing XKS systems do not support such result-based query expansion. ASTERIX is an innovative XKS engine that addresses these limitations.	Download
2	Generalized Subgraph Search	Publication (CIKM 12)	This software implements a new type of graph queries, which injectively maps its edges to paths of the graphs in a given database, where the length of each path is constrained by a given threshold specified by the weight of the corresponding matching edge.	Download
1	MustBlend	Publication (DASFAA 2013, ICDE 09, ICDE 06)	MUSTBLEND (MUlti-Source Twig BLENDer) is a novel visual XML querying paradigm where the visual query formulation and processing are interleaved. A key practical feature of MUSTBLEND is its portability as it does not employ any special-purpose storage, indexing, and query cost estimation schemes.	Download

Social Analytics

#	Name	Publication Info	Description	Repository Access
3	Kandinsky Mobile	Publication (IEEE Data Engineering Bulletin)	The software implements a novel framework to visualize social discussions in YouTube that is inspired by the abstract arts of Wassily Kandinsky, the father of abstract art.	Code
2	CHASE	Publication (ACM SIGIR 2025)	The software implements the framework to evaluate cohesiveness of existing community search algorithms through the lens of social psychology.	Code
1	PIANO	Publication (IEEE TCSS 2023)	The software implements a novel paradigm where influence maximization meets deep reinforcement learning to estimate the expected influence. Specifically, it realizes a framework called PIANO that incorporates network embedding and reinforcement learning techniques to address the IM problem.	Download

Biological Data Science

#	Name	Publication Info	Description	Repository Access
10	PANACEA	Publication (BCB 2024)	This software is designed to profile known cancer target combinations in cancer type-specific signaling networks. Given a large signaling network for a cancer type, known targets from approved anticancer drugs, a set of cancer mutated genes, and a combination size parameter k, it automatically generates a delta histogram that depicts the distribution of k-sized target combinations based on their topological influence on cancer mutated genes and other nodes. PANACEA can significantly reduce the candidate k-node combination exploration space, addressing a longstanding challenge for tasks such as in silico target combination prediction in large signaling networks.	Download
9	ArcheGEO	Publication (BCB 2022)	Transcriptomic data stored in the Gene Expression Omnibus (GEO) serves thousands of queries per day, but a lack of standardized machine-readable metadata causes many searches to return irrelevant hits, which impede convenient access to useful data in the GEO repository. ArcheGEO is a novel end-to-end software that improves results from the GEO Browser by automatically determining the relevance of these results. Unlike existing tools, ArcheGEO reports on the irrelevant results and provides reasoning for their exclusion. Such reasoning can be leveraged to improve annotations of metadata.	Download
8	TROVE	Publication (Bioinformatics 2017)	Cancer hallmarks, a concept that seeks to explain the complexity of cancer initiation and development, provide a new perspective of studying cancer signaling which could lead to a greater understanding of this complex disease. However, to the best of our knowledge, there is currently a lack of tools that support such hallmark-based study of the cancer signaling network, thereby impeding the gain of knowledge in this area. TROVE is a user-friendly and novel software that facilitates hallmark annotation, visualization and analysis in cancer signaling networks. It can be used to build further network-based analytics applications for cancer.	Download
7	TINTIN	Publication (ACM BCB 2017)	A network-based approach that ranks a given set of networks based on its "similarity" to a reference network. TINTIN exploits target feature-based network similarity in order to determine if two networks are similar. Specifically, it leverages topological and dynamic features of targets to compute similarity distances between signaling networks and rank them accordingly. TINTIN is useful to address problems such as target prioritization and drug target repositioning.	Download
6	TAPESTRY	Publication (ACM BCB 2016)	Target prioritization ranks molecules in biological networks according to a score that seeks to identify molecules that fulfill particular roles (e.g., drug targets). TAPESTRY is a network-based approach that prioritizes candidate targets in a given signaling network with unknown targets by utilizing knowledge (target characteristics) gained from curated targets in another set of signaling networks. It exploits a knowledge base of characterization models and predictive topological features of a set of signaling networks (candidate networks) with curated targets. Given a signaling network G with unknown targets, TAPESTRY identifies a candidate network most similar to G and selects its characterization model as a prioritization model for computing a topological feature-based rank of each candidate node in G. Then, a dynamic feature-based rank is computed for these nodes by leveraging the time-series curves of ODEs associated with the edges in G. Finally, these two ranks are integrated and used for prioritizing candidate targets.	Download
5	TENET	Publication (Bioinformatics 2015)	A network-based approach that characterizes known targets in signaling networks using topological features. TENET first computes a set of topological features and then leverages a support vector machine-based approach to identify predictive topological features that characterize known targets. A characterization model is generated and it specifies which topological features are important for discriminating the targets and how these features should be combined to quantify the likelihood of a node being a target.	Download
4	DUALALIGNER	Publication (Bioinformatics 2014)	DualAligner performs dual network alignment, in which both region-to-region alignment, where a whole subgraph of one network is aligned to a subgraph of another, and protein-to-protein alignment, where individual proteins in networks are aligned to one another, are performed to achieve higher accuracy network alignments. Dual network alignment is achieved in DualAligner via background information provided by a combination of Gene Ontology annotation information and protein interaction network data.	Download
3	DiffNet	Publication (Methods 2014)	The study of genetic interaction networks that respond to changing conditions is an emerging research problem. Bandyopadhyay et al. (2010) proposed a technique to construct a differential network (dE-MAP network) from two static gene interaction networks in order to map the interaction differences between them under environment or condition changes (e.g., DNA-damaging agents). This differential network is then manually analyzed to conclude that DNA repair is differentially affected by the condition change. Unfortunately, manual construction of a differential functional summary from a dE-MAP network that summarizes all pertinent functional responses is time-consuming, laborious and error-prone, impeding large-scale analysis on it. DiffNet is a novel data-driven algorithm that leverages Gene Ontology (GO) annotations to automatically summarize a dE-MAP network to obtain a high-level map of functional responses due to condition changes.	Download
2	FACETS	Publication (Bioinformatics 2012)	FACETS is a novel PPI network decomposition algorithm to make sense of the deluge of interaction data using Gene Ontology (GO) annotations. It finds not just a single functional decomposition of the PPI network, but a multi-faceted atlas of functional decompositions that portray alternative perspectives of the functional landscape of the underlying PPI network. Each facet in the atlas represents a distinct interpretation of how the network can be functionally decomposed and organized. Our algorithm maximizes the interpretive value of the atlas by optimizing inter-facet orthogonality and intra-facet cluster modularity.	Download
1	BIDEL	Publication (DASFAA 2007)	Warehousing heterogeneous, dynamic biological data is a key technique for biological data integration as it greatly improves performance. However, it requires complex maintenance procedures to update the warehouse in light of changes to the sources. Consequently, a key issue to address is how to detect changes to the underlying biological data sources. BIDEL is software for detecting exact changes to biological annotations. In our approach, we transform heterogeneous biological data to XML format and then detect changes between two versions of the XML representation of biological data.	Download

Sourav S Bhowmick

Software Repository

We make source code of selected research software publicly available. Please note that our software/code are freely available for non-commercial use only.

Technology-Enabled Learning

Data Management and Analytics

Social Analytics

Biological Data Science