Hadoop High Availability strategies

Scalability, Availability, Resilience – those are just common examples of computer system requirements which forms an overall application architecture very strongly and have direct impact to “indicators” such as Customer Satisfaction Ratio, Revenue, Cost, etc. The weakest part of the system has the major impact on those parameters. Topic of this post availability is defined as percentage of time that a system is capable of serving its intended function.

In BigData era Apache Hadoop is a common component to nearly every solution. As the system requirements are shifting for purely batch oriented systems to near-to-real-time systems this just adds pressure on systems availability. Clearly if system in batch mode runs every midnight than 2 hours downtime is not such a big deal as opposed to near-to-real-time systems where result delayed by 10 min is pointless.

I this post I will try to summarize Hadoop high availability strategies as a complete and ready to use solutions I encountered during my research on this topic.

In Hadoop 1.x the well known fact is that the Name Node is a single point of failure and as such all high availability strategies tries to cope with that – strengthen the weakest part of the system. Just to clarify widely spread myth – Secondary Name Node isn’t a back up or recovery node by nature. It has different tasks than Name Node BUT with some changes Secondary Name Node can be started in the role of Name Node. But neither this doesn’t work automatically nor that wasn’t the original role for SNN.

High availability strategies can be categorized by the state of standby: Hot/Warm Standby or Cold Standby. This has direct correlation to fail over(start up) time. To give an raw idea(according to doc): Cluster with 1500 nodes with PB capacity – the start up time is close to one hour. Start up consists of two major phases: restoring the metadata and then every node in HDFS cluster need to report block location.

Typical solution for Hadoop 1.x which makes use of NFS and logical group of name nodes. Some resources claim that in case of NFS unavailability the name node process aborts what would effectively stop the cluster. I couldn’t verify that fact in different sources of information but I feel important to mention that. Writing name node metadata to NFS need to be exclusive to a single machine in order to keep metadata consistent. To prevent collisions and possible data corruption a fencing method needs to be defined. Fencing method assures that if the name node isn’t responsive that he is really down. In order to have a real confidence a sequence of fencing strategies can be defined and they are executed in order. Strategies ranges from simple ssh call to power supply controlled over the network. This concept is sometimes called shot me in the head. The fail over is usually manual but can be automated as well. This strategy works as a cold standby and hadoop providers typically provides this solution in their High Availability Kits.

Because of the relatively long start up time of back up name node some companies (e.g. Facebook) developed their own solutions which provides hot or warm standby. Facebook’s solution to this problem is called avatar node. The idea behind is relatively simple: Every node is wrapped to so called avatar(no change to original code base needed!). Primary avatar name node writes to shared NFS filler. Standby avatar node consist of secondary name node and back up name node. This node continuously reads HDFS transaction logs and keeps feeding those transactions to encapsulated name node which is kept in safe mode which prevents him from performing of any active duties. This way all name node metadata are kept hot. Avatar in standby mode performs duties of secondary name node. Data nodes are wrapped to avatar data nodes which sends block reports to both primary and standby avatar node. Fail over time is about a minute. More information can be found here.

Another attempt to create a Hadoop 1.x hot standby coming form China Mobile Research Institute is based on running synchronization agents and sync master. This solution brings another questions and it seems to me that it isn’t so mature and clear as the previous one. Details can be found here.

An ultimate solution to high availability brings Hadoop 2.x which removes a single point of failure by a different architecture. YARN (Yet Another Resource Negotiator) also called MapReduce 2. And for HDFS there is an another concept called Quorum Journal Manager (QJM) which can use NFS or Zookeeper as synchronization and coordination framework. Those architectural changes provides the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.

This post just scratches the surface of Hadoop High Availability and doesn’t go deep in detail daemon by daemon but I hope that it is a good starting point. If someone from the readers is aware of some other possibility I am looking forward to seeing that in the comment section.