Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding data.
Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old and new data sets in powerful new ways.
Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
Originally developed and employed by dominant Web companies like Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other markets. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information and questions.
Hadoop is a scalable, fault-tolerant system for data storage and processing. Hadoop is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware.
Hadoop excels at doing complex analyses, including detailed, special-purpose computation, across large collections of data. Hadoop handles search, log processing, recommendation systems, data warehousing and video/image analysis. Unlike traditional databases, Hadoop scales to address the needs of data-intensive distributed applications in a reliable, cost-effective manner.
Hadoop creates clusters of machines and coordinates work among them. Clusters can be built and scaled out with inexpensive computers.
The Hadoop software package includes the robust, reliable Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.
In addition, Hadoop includes MapReduce, a parallel distributed processing system that is different from most similar systems on the market. It was designed for clusters of commodity, shared-nothing hardware. No special programming techniques are required to run analyses in parallel using MapReduce; most existing algorithms work without changes. MapReduce takes advantage of the distribution and replication of data in HDFS to spread execution of any job across many nodes in a cluster.
If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. It automatically creates an additional copy of the data from one of the replicas it manages. As a result, clusters are self-healing for both storage and computation without requiring intervention by systems administrators.
Hadoop is an ideal platform for consolidating data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets. Examples include:
Web & Digital Media Services
Health & Life Sciences
Hadoop was created by Doug Cutting who named it after his child’s stuffed elephant. It was originally developed to support distribution for the Nutch search engine project.