Hadoop

Hadoop for Beginners: Full Tutorial

Hadoop is an open-source framework developed by the Apache Software Foundation. It's used to store and process large datasets across clusters of computers using simple programming models. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks.

Key Concepts of Hadoop

  1. Distributed Storage: Hadoop stores data in a distributed manner across a network of machines.

  2. Fault Tolerance: Even if some machines in the cluster fail, Hadoop continues to work by replicating data.

  3. Scalability: You can easily add more nodes to the cluster to handle more data.

  4. Batch Processing: Hadoop is optimized for processing huge volumes of data in batches.

Components of Hadoop

Hadoop consists of four main components:

  1. HDFS (Hadoop Distributed File System):

    • Purpose: Distributed storage system.

    • Function: Breaks large data files into blocks (default 128MB) and distributes them across multiple nodes. Each block is replicated (by default three times) to ensure fault tolerance.

  2. YARN (Yet Another Resource Negotiator):

    • Purpose: Resource management.

    • Function: Allocates system resources to applications running in the Hadoop cluster. It also schedules tasks and ensures they are completed.

  3. MapReduce:

    • Purpose: Distributed data processing framework.

    • Function: Processes large datasets in parallel across a Hadoop cluster. It splits data into independent tasks and processes them in two stages:

      • Map: Breaks down a task into smaller sub-tasks.

      • Reduce: Aggregates and combines the outputs of the sub-tasks into a final result.

  4. Hadoop Common:

    • Purpose: Provides the libraries and utilities needed by the other components.

    • Function: Includes Java libraries and scripts required for Hadoop to run.

How Hadoop Works

Hadoop works by dividing large datasets into smaller chunks and distributing them across the nodes (machines) in a cluster. Then, it processes the data in parallel using the MapReduce programming model.

Example: Processing Large Data with Hadoop
  1. Data Splitting: Suppose you have a 1TB file. Hadoop splits the file into blocks (default size 128MB or 64MB), and distributes these blocks across the machines.

  2. Processing (Map Phase): Each machine processes the data blocks it has, performing some computation (e.g., counting the number of times a word appears).

  3. Aggregation (Reduce Phase): The results from each machine are combined and aggregated to produce a final result (e.g., total word count).

Key Benefits of Hadoop

  • Cost-Effective: Uses commodity hardware, so no need for expensive servers.

  • Fault-Tolerant: Replicates data across machines, so no data is lost if one machine fails.

  • Scalable: Can easily add more machines to a cluster to handle growing data.

  • Flexible: Can handle structured, semi-structured, and unstructured data (text, images, videos).

Setting up Hadoop

Prerequisites:

  • Java: Hadoop is written in Java, so you need to have the Java Development Kit (JDK) installed.

  • SSH: Hadoop requires Secure Shell (SSH) for communication between nodes.

  • Linux/Unix OS: Most commonly used, though it can be installed on Windows with some additional effort.

Single Node Setup (Pseudo-Distributed Mode)

  1. Install Java:

    • Install JDK (Java Development Kit) if not already installed:

        sudo apt-get install openjdk-8-jdk
      
  2. Download and Install Hadoop:

    • Download Hadoop from the official website:

        wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
      
    • Extract the Hadoop file:

        tar -xzf hadoop-3.3.0.tar.gz
      
  3. Configure Hadoop:

    • Set environment variables in ~/.bashrc:

        export HADOOP_HOME=/path/to/hadoop
        export PATH=$PATH:$HADOOP_HOME/bin
        export JAVA_HOME=/path/to/java
      
    • Update hadoop-env.sh to specify the Java path:

        export JAVA_HOME=/path/to/java
      
  4. Configure HDFS:

    • Edit the core-site.xml file to specify the default file system:

        <configuration>
          <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
          </property>
        </configuration>
      
    • Edit the hdfs-site.xml to specify the replication factor:

        <configuration>
          <property>
            <name>dfs.replication</name>
            <value>1</value>
          </property>
        </configuration>
      
  5. Format the Namenode:

    • Before starting Hadoop, format the namenode:

        hdfs namenode -format
      
  6. Start Hadoop Services:

    • Start the HDFS and YARN services:

        start-dfs.sh
        start-yarn.sh
      
  7. Access Hadoop UI:

Running a Sample Job

After setting up Hadoop, you can run a MapReduce example to test the setup:

  1. Copy input files into HDFS:

     hdfs dfs -mkdir /input
     hdfs dfs -put /path/to/local/file /input
    
  2. Run the MapReduce example (e.g., WordCount):

     hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /input /output
    
  3. View the results:

     hdfs dfs -cat /output/part-r-00000
    

Conclusion

Hadoop is a powerful framework that enables efficient storage and processing of big data. As a beginner, starting with a single-node setup will help you learn the core concepts before moving on to multi-node clusters. Once comfortable, you can explore other tools in the Hadoop ecosystem like Hive, Pig, HBase, and Spark for advanced data processing.

Next Steps

  • Learn HDFS in-depth: Study how data is stored, replicated, and accessed in HDFS.

  • Explore MapReduce: Understand its working mechanism by writing simple MapReduce jobs.

  • Get familiar with the Hadoop ecosystem: Tools like Hive (SQL for Hadoop), Pig (high-level data flow language), and HBase (NoSQL database) extend Hadoop’s functionality.