Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Abstract
MapReduce is a simplified programming model of distributed parallel computing．It is an important technology of
Google，and is commonly used for data—intensive distributed parallel computing.Cluster analysis is the most important
data mining methods. Efficient parallel algorithms and frameworks are the key to meeting the scalability and performance
requirements entailed in such scientific data analyses. In order to improve the deficiency of the long time in large-scale data
sets clustering on the single computer, we design and realize a parallel K- Means algorithm based on MapReduce
framework. This algorithm runs on Hadoop cluster. The results show that the MapReduce framework K-Means clustering
algorithm can obtain a higher performance when handling large-scale document automatic classification, and they prove
the effectiveness and accuracy of the algorithm.

Keywords: MapReduce; Hadoop; Parallel K-Means Clustering; Large-Scale data Sets

1. Introduction
Cluster analysis is one of the most important data mining methods. Document clustering is the act of
collecting similar documents into classes, where similarity is some function on a document. Document
clustering does not need separate training process, nor manual tagging group in advance. The documents in
the same clusters are more similar, while the documents in different clusters are more dissimilar.
MapReduce is a parallel programming technique derived from the functional programming concepts and
is proposed by Google for large-scale data processing in a distributed computing environment. Hadoop
project[1] provides a distributed file system(HDFS). Hadoop is a Java-based software framework that
enables data-intensive application in a distributed environment. Hadoop enables applications to work with
thousands of nodes and terabyte of data, without concerning the user with too much detail on the allocation
and distribution of data and calculation. Hadoop and MapReduce[2] realize the mass data storage, analysis
and exchange management technology.
In this paper, we introduce MapReduce parallel programming model; propose a MapReduce and the
Hadoop distributed clustering algorithm, the design and the implementation of the large-scale data
processing model based on MapReduce and Hadoop; and give some experimental results of the distributed
clustering based on MapReduce and Hadoop , and discuss the results.

2. Background
2.1. Hadoop Overview
When data sets go beyond a single storage capacity, it is necessary to distribute them to multiple
independent computers. Trans-computer network storage file management system is called distributed file
system. A typical Hadoop distributed file system contains thousands of servers, each server stores partial
data of file system. HDFS cluster configuration is simple. It just needs more servers and some simple
configuration to improve the Hadoop cluster computing power, storage capacity and IO bandwidth. In
addition HDFS achieves reliable data replication, fast fault detection and automatic recovery,etc..

2.2. MapReduce Overview
In a distributed data storage, when parallel processing the data, we need to consider much, such as
synchronization, concurrency, load balancing and other details of the underlying system. It makes the
simple calculation become very complex. MapReduce programming model [3] was proposed in 2004 by
the Google, which is used in processing and generating large data sets implementation. This framework
solves many problems, such as data distribution, job scheduling, fault tolerance, machine to machine
communication, etc.
MapReduce is applied in Google's Web search. Programmers need to write many programs for the
specific purpose to deal with the massive data distributed and stored in the server cluster, such as crawled
documents, web request logs, etc., in order to get the results of different data, such as inverted indices, web
document, different views, worms collected the number of pages for each host a summary of a given date
within the collection of the most common queries and so on.

3. MapReduce Programming Model
MapReduce programming model, by map and reduce function realize the Mapper and Reducer interfaces.
They form the core of task.
3.1. Mapper
Map function requires the user to handle the input of a pair of key value and produces a group of
intermediate key and value pairs. consists of two parts, value stands for the data related to the
task, key stands for the "group number " of the value . MapReduce combine the intermediate values with
same key and then send them to reduce function.
Map algorithm process is described as follows:

Step1. Hadoop and MapReduce framework produce a map task for each InputSplit, and each InputSplit
is generaed by the InputFormat of job. Each corresponds to a map task.

Step2. Execute Map task, process the input to form a new . This process is
called "divide into groups". That is, make the correlated values correspond to the same key words. Output
key value pairs that do not required the same type of the input key value pairs. A given input value pair can
be mapped into 0 or more output pairs.

Step3. Mapper's output is sorted to be allocated to each Reducer. The total number of blocks and the
number of job reduce tasks is the same. Users can implement Partitioner interface to control which key is
assigned to which Reducer.

3.2. Reducer
Reduce function is also provided by the user, which handles the intermediate key pairs and the value set
relevant to the intermediate key value. Reduce function mergers these values, to get a small set of values.
the process is called "merge ". But this is not simple accumulation. There are complex operations in the
process. Reducer makes a group of intermediate values set that associated with the same key smaller.
In MapReduce framework, the programmer does not need to care about the details of data
communication, so is the communication interface for the programmer in MapReduce model.
can be seen as a "letter", key is the letter’s posting address, value is the letter’s content. With
the same address letters will be delivered to the same place. Programmers only need to set up correctly
, MapReduce framework can automatically and accurately cluster the values with the same key
together.
Reducer algorithm process is described as follows:

Step1. Shuffle. Input of Reducer is the output of sorted Mapper. In this stage, MapReduce will assign
related block for each Reducer.

Step2. Sort. In this stage, the input of reducer is grouped according to the key (because the output of
different mapper may have the same key). The two stages of Shuffle and Sort are synchronized;

Step3. Secondary Sort. If the key grouping rule in the intermediate process is different from its rule
before reduce. we can define a Comparator. The comparator is used to group intermediate keys for the
second time.
Map tasks and Reduce task is a whole, can not be separated. They should be used together in the
program. We call a MapReduce the process as an MR process. In an MR process, Map tasks run in parallel,
Reduce tasks run in parallel, Map and Reduce tasks run serially. An MR process and the next MR process
run in serial, synchronization between these operations is guaranteed by the MR system, without
programmer’s involvement.

4. MapReduce Parallel K-Means Based Clustering Algorithm
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

4.1. Document Vector Representation
Vector space model is the most widely used method in information retrieval. This model uses feature
entries and their weights to express the document information. Vector ( , , , , ) w1 w2 w3 wm
d = ⋯ stands for
the feature entry and its weight in document d, m is the number of all feature entries. (iw m),,1 i = ⋯ is the
weight of the entry ti
in document d. The document vector set is the pattern or data object of the document
clustering.
Web document vector similarity function is a cosine function, whose returned value is [0,1]. The greater
the returned value is, the greater the similarity is.

4.2. Document Preprocessing Based on MapReduce
We introduce the calculation of word frequency by MapReduce. When calculating TF, map function reads
text. Each line of the text represents a document and it’s type. Key is the document’s type, value is the
document’s content. In the Intermediate output data, keys are the document’s type and the word in the
document. Values are the frequency of the word in the document. Reduce function accumulates all the
values with the same key to get the frequency of the word t in all documents. The calculation of DF, cluster
number and the document corpus and the total number of words is similar to that of TF.
First, we need to preprocess the document. That is, extract every word of the document, and calculate its
frequency. In the Map stage, when we meet a word, >. Is output. Hadoop
platform ensures that all the values with the same key are assigned to the same reducer.
Doci figure above that documents, wij that word, nij is word frequency.
In the Map phase, each input of Map is a document type idn and its content d. First, map function creates
a set H, then calculate t the frequency in H. Finally, output all elements of H. The keys of output are
pairs, values is the t frequency list.
In Reduce stage, every reduce receives a key value pair. INITIALIZE function creates a list PostingList,
then processes each input. If meet a new word, output the existing list, otherwise, then add the document id
and document frequency into the list PostingList. Finally, output all the words and their corresponding
PostingList.

4.3. K-Means Clustering Methods
K-Means algorithm[4,5,6] is widely used clustering algorithm. Firstly, the algorithm randomly select
initial objects. Each one represents a cluster center. The rest of the objects will be assigned to the nearest
cluster, according to their distances to different centers. Then calculate every center again. This operation is
repeated until the criterion function converges.
Algorithm description as following:
Input: The number of clusters k and n documents
Output: k clusters
Step1. Randomly select k documents from n documents as the initial cluster centers.
Step2. Calculate the distances of the rest documents to the every center of the clusters, and assign each
of the rest documents to the nearest cluster.
Step3. Calculate and adjust each cluster center.
Step4. Iterate Step2 ~ Step3 until the criterion function convege. The program ends.

4.4. Parallel K-Means Clustering Algorithm Based on MapReduce and Hadoop
Step1. The first stage is the document preprocessing. we divide a data set D into m subsets. There are
has two MapRedue jobs in this stage. The first job calculates the parameters that are required in the next
step. The following MapReduce job calculates the DFR and TFIDF of each term, and then extracts terms
and generates VSM. We create a file which contains : iteration number、cluster id、cluster coordinates、
number of documents assigned to the cluster.

Step2. The second stage is the map function, the distance between each document and each cluster
center is calculated, then reads the input data and calculates the distance to each center. For each document,
it produces an output pair with: . It produces a lot of
data in this stage. we can use a combine function to reduce the size before sending it to Reduce. Combine
function is described as following.
The Combine function calculates the average of the coordinates for each cluster id, along with the
number of documents. All data of the same current cluster are sent to a single reducer.

Step3. The third stage is reduce function. In the reduce function, compute new cluster center coordinates .
Its output is written to the cluster file, and contain: iteration number、cluster id、the cluster center
coordinates、the size of the cluster.

Step4. At last the new cluster coordinates are compared to the original ones. If the criterion function
converges, the program end, and we have found the clusters. If not, use the newly generated cluster centers
and iterate Step 2 to 4.

6. Conclusions and Future Work
This work represents only a small first step in using the MapReduce programming technique in the process
of large-scale data Sets.
We take the advantage of the parallelIism of MapReduce to design a parallel K-Means clustering
algorithm based on MapReduce. This algorithm can automatically cluster the massive data, making full use
of the Hadoop cluster performance. It can finish the text clustering in a relatively short period of time.
Experiments show that it achieves high accuracy.
The main deficiency: we have to set the number k (the number of clusters to be generated) in advance,
and this algorithm is sensitive to initial values. Different initial values may lead to different results. And it
is sensitive to the "noise" and the outlier data.
We need to optimize from three aspects to reduce the running time based on Hadoop platform:,
The following three aspects can be optimized to reduce the level of the Hadoop cluster platform in the
run-time.

(1) Reduce program runs among multi-nodes instead of a single node.
(2) Platform optimization, we have not optimized the platform yet. The platform spends extra management time in each iteration. We can reduce platform management time.
(3) Optimization of the algorithm itself. We need to custom outputs of mapper and reducer functions.
There is more efficient data format that the intermediate results can be stored. These options will enable
faster read and write.

Tech Unpacked – Research & Fundamentals with Nitin Sharma

Popular Posts

Search This Blog

Wednesday, February 12, 2014

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

No comments:

My Profile

Featured Post

🚀 Introducing the Universal API Testing Tool — Built to Catch What Manual Testing Misses

!! IMPORTANT LINKS !!

!! INTERESTING TALKS !!

Contact Form

Labels

Total Pageviews