Who is Hadoop?

Hadoop is a new way for enterprises to store

and analyze data.


Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding data.

Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old and new data sets in powerful new ways.

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.

Originally developed and employed by dominant Web companies like Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other markets. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information and questions.

Hadoop Overview

Hadoop is a scalable, fault-tolerant system for data storage and processing. Hadoop is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware.

Hadoop excels at doing complex analyses, including detailed, special-purpose computation, across large collections of data. Hadoop handles search, log processing, recommendation systems, data warehousing and video/image analysis. Unlike traditional databases, Hadoop scales to address the needs of data-intensive distributed applications in a reliable, cost-effective manner.

HDFS and MapReduce

Hadoop creates clusters of machines and coordinates work among them. Clusters can be built and scaled out with inexpensive computers.

The Hadoop software package includes the robust, reliable Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.

In addition, Hadoop includes MapReduce, a parallel distributed processing system that is different from most similar systems on the market. It was designed for clusters of commodity, shared-nothing hardware. No special programming techniques are required to run analyses in parallel using MapReduce; most existing algorithms work without changes. MapReduce takes advantage of the distribution and replication of data in HDFS to spread execution of any job across many nodes in a cluster.

If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. It automatically creates an additional copy of the data from one of the replicas it manages. As a result, clusters are self-healing for both storage and computation without requiring intervention by systems administrators.

Why Hadoop?

Hadoop is an ideal platform for consolidating data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets. Examples include:


  • Recommendation engines — increase average order size by recommending complementary products based on predictive analysis for cross-selling.
  • Cross-channel analytics — sales attribution, average order value, lifetime value (e.g., how many in-store purchases resulted from a particular recommendation, advertisement or promotion).
  • Event analytics — what series of steps (golden path) led to a desired outcome (e.g., purchase, registration).


  • Merchandizing and market basket analysis.
  • Campaign management and customer loyalty programs.
  • Supply-chain management and analytics.
  • Event- and behavior-based targeting.
  • Market and consumer segmentations.

Financial Services

  • Compliance and regulatory reporting.
  • Risk analysis and management.
  • Fraud detection and security analytics.
  • CRM and customer loyalty programs.
  • Credit scoring and analysis.
  • Trade surveillance.


  • Revenue assurance and price optimization.
  • Customer churn prevention.
  • Campaign management and customer loyalty.
  • Call Detail Record (CDR) analysis.
  • Network performance and optimization.


  • Fraud detection and cyber security.
  • Compliance and regulatory analysis.
  • Energy consumption and carbon footprint management.

Web & Digital Media Services

  • Large-scale clickstream analytics.
  • Ad targeting, analysis, forecasting and optimization.
  • Abuse and click-fraud prevention.
  • Social graph analysis and profile segmentation.
  • Campaign management and loyalty programs.

Health & Life Sciences

  • Campaign and sales program optimization.
  • Brand management.
  • Patient care quality and program analysis.
  • Supply-chain management.
  • Drug discovery and development analysis.
Did you know?

Hadoop was created by Doug Cutting who named it after his child’s stuffed elephant. It was originally developed to support distribution for the Nutch search engine project.

IBM Adobe Amazon Facebook Yahoo LinkedIn