Hadoop Data Explorer

 

The system prototype interface is available at: 

http://lincs.ischool.drexel.edu/sgbrowser/hdex/

 

Large-scale Data Exploration

Hadoop is a powerful framework for processing large-scale data but works primarily in the batch mode without user interaction. There are many scenarios in which users such as business analysts and data scientists need to:

• Browse and interact with large-scale data;
• Identify major themes and patterns in the data;
• Explore and scrutinize subsets of data;
• Find and collect data that can be used for further investigation.
 

In this project, we use Hadoop/MapReduce to perform data-intensive pre-computation, which enables efficient interactive text clustering for on-line data exploration. The system supports on-the-fly Scatter/Gather, allowing the user to iteratively select data clusters and zoom in/out on selected subsets. 

 

 

 

System Architecture

The system architecture is based on the Hadoop framework for distributed clustering and Hadoop Distributed File Systems (HDFS) for data storage. 
 
 

* Off-line Text Clustering on Hadoop

Hadoop nodes perform data clustering to generate a hierarchical structure and redo (incremental) clustering only when there are changes in the data on HDFS. This process can be done in the batch mode using any classic hierarchical clustering algorithm such as hierarchical agglomerative clustering or bisecting (canopy) k-means. Pre-computed hierarchies are preserved in a relational meta-data store.

* On-line Interactive Clustering using Metadata

Highly efficient clustering can be performed using the hierarchical meta-data to support interactive Scatter/Gather. Given the user's selection of clusters, identification of sub-clusters and related documents can be efficiently done based on simple cut-tree operations.

* Hadoop Data Explorer Interface

A system prototype has been developed to support user's data interaction and exploration. The interface allows the user to visualize major themes in the data and iteratively zoom in/out on multiple data subsets.