Hadoop
Hadoop for Beginners: Full Tutorial
Hadoop is an open-source framework developed by the Apache Software Foundation. It's used to store and process large datasets across clusters of computers using simple programming models. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks.
Key Concepts of Hadoop
Distributed Storage: Hadoop stores data in a distributed manner across a network of machines.
Fault Tolerance: Even if some machines in the cluster fail, Hadoop continues to work by replicating data.
Scalability: You can easily add more nodes to the cluster to handle more data.
Batch Processing: Hadoop is optimized for processing huge volumes of data in batches.
Components of Hadoop
Hadoop consists of four main components:
HDFS (Hadoop Distributed File System):
Purpose: Distributed storage system.
Function: Breaks large data files into blocks (default 128MB) and distributes them across multiple nodes. Each block is replicated (by default three times) to ensure fault tolerance.
YARN (Yet Another Resource Negotiator):
Purpose: Resource management.
Function: Allocates system resources to applications running in the Hadoop cluster. It also schedules tasks and ensures they are completed.
MapReduce:
Purpose: Distributed data processing framework.
Function: Processes large datasets in parallel across a Hadoop cluster. It splits data into independent tasks and processes them in two stages:
Map: Breaks down a task into smaller sub-tasks.
Reduce: Aggregates and combines the outputs of the sub-tasks into a final result.
Hadoop Common:
Purpose: Provides the libraries and utilities needed by the other components.
Function: Includes Java libraries and scripts required for Hadoop to run.
How Hadoop Works
Hadoop works by dividing large datasets into smaller chunks and distributing them across the nodes (machines) in a cluster. Then, it processes the data in parallel using the MapReduce programming model.
Example: Processing Large Data with Hadoop
Data Splitting: Suppose you have a 1TB file. Hadoop splits the file into blocks (default size 128MB or 64MB), and distributes these blocks across the machines.
Processing (Map Phase): Each machine processes the data blocks it has, performing some computation (e.g., counting the number of times a word appears).
Aggregation (Reduce Phase): The results from each machine are combined and aggregated to produce a final result (e.g., total word count).
Key Benefits of Hadoop
Cost-Effective: Uses commodity hardware, so no need for expensive servers.
Fault-Tolerant: Replicates data across machines, so no data is lost if one machine fails.
Scalable: Can easily add more machines to a cluster to handle growing data.
Flexible: Can handle structured, semi-structured, and unstructured data (text, images, videos).
Setting up Hadoop
Prerequisites:
Java: Hadoop is written in Java, so you need to have the Java Development Kit (JDK) installed.
SSH: Hadoop requires Secure Shell (SSH) for communication between nodes.
Linux/Unix OS: Most commonly used, though it can be installed on Windows with some additional effort.
Single Node Setup (Pseudo-Distributed Mode)
Install Java:
Install JDK (Java Development Kit) if not already installed:
sudo apt-get install openjdk-8-jdk
Download and Install Hadoop:
Download Hadoop from the official website:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Extract the Hadoop file:
tar -xzf hadoop-3.3.0.tar.gz
Configure Hadoop:
Set environment variables in
~/.bashrc
:export HADOOP_HOME=/path/to/hadoop export PATH=$PATH:$HADOOP_HOME/bin export JAVA_HOME=/path/to/java
Update
hadoop-env.sh
to specify the Java path:export JAVA_HOME=/path/to/java
Configure HDFS:
Edit the
core-site.xml
file to specify the default file system:<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Edit the
hdfs-site.xml
to specify the replication factor:<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Format the Namenode:
Before starting Hadoop, format the namenode:
hdfs namenode -format
Start Hadoop Services:
Start the HDFS and YARN services:
start-dfs.sh start-yarn.sh
Access Hadoop UI:
Namenode web UI:
http://localhost:9870/
ResourceManager web UI:
http://localhost:8088/
Running a Sample Job
After setting up Hadoop, you can run a MapReduce example to test the setup:
Copy input files into HDFS:
hdfs dfs -mkdir /input hdfs dfs -put /path/to/local/file /input
Run the MapReduce example (e.g., WordCount):
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /input /output
View the results:
hdfs dfs -cat /output/part-r-00000
Conclusion
Hadoop is a powerful framework that enables efficient storage and processing of big data. As a beginner, starting with a single-node setup will help you learn the core concepts before moving on to multi-node clusters. Once comfortable, you can explore other tools in the Hadoop ecosystem like Hive, Pig, HBase, and Spark for advanced data processing.
Next Steps
Learn HDFS in-depth: Study how data is stored, replicated, and accessed in HDFS.
Explore MapReduce: Understand its working mechanism by writing simple MapReduce jobs.
Get familiar with the Hadoop ecosystem: Tools like Hive (SQL for Hadoop), Pig (high-level data flow language), and HBase (NoSQL database) extend Hadoop’s functionality.