10 February 2010

Hadoop HDFS: Deceived by Xciever

It's undocumented. It's misspelled. And without it your (insert euphemism for "really big" here) Hadoop grid won't be able to crunch those multi-hundred gigabyte input files. It's the "xciever" setting for HDFS DataNode daemons, and the absence of it in your configuration is waiting to bite you in the butt.


The software components that make up HDFS include the NameNode, the SecondaryNameNode, and DataNode daemons. Typical installations place NameNode and SecondaryNameNode on a single "master" node. Installations following best practices go the extra mile and place the NameNode and SecondaryNameNode daemons on separate nodes. DataNode daemons are placed on participating slaves nodes. At the highest level the NameNode/SecondaryNameNode daemons manage HDFS metadata (the data used to map file names to server/block lists, etc.), while the DataNode daemons handle the mechanics of read/writing to disk, and serving up blocks of data to processes on the node or to requesting processes on other nodes.

There are many settings that can be use to configure and tune HDFS (as well as the MapReduce engine and other Hadoop components). The HDFS documentation lists 45 of them (along with their default values) last time I checked. These configuration settings are a somewhat disorganized mix of elements used to configure the NameNode and DataNodes.

The DataNode daemons have an upper limit on the number of threads they can support, which is set to the absurdly small (IMHO) value of 256. If you have a large job (or even a moderately-sized job) that has many files open you can exhaust this limit. When you do exceed this limit there are a variety of failure modes that can arise. Some of the exception traces point to this value being exhausted, while others do not. In either case, understanding what has gone wrong is difficult. For example, one failure mode observed
raised
java.io.EOFException, with a traceback flowing down into DFSClient.

The solution to all this is to configure the Xcievers setting to raise the maximum limit on the number of threads the DataNode daemons are willing to manage. For modern Hadoop releases this should be done in conf/hdfs-site.xml. Here is an example:

< property > 
    < name >dfs.datanode.max.xcievers< /name > 
    < value >4096< /value > 
< /property >

[Notez-bien: I apologize for the lame formatting above. I really need to improve my template]

Interestingly, HBase users seem to have more conversations about this setting and related issues than do predominantly MapReduce users.

The central reason for this posting is to point out this setting as it might be helpful to others. We encountered this problem while porting from an older version of Hadoop to 0.20 (via the Cloudera distribution, which I highly recommend). The failure mode was not immediately evident and we wasted a lot of time debugging and chasing assorted theories before we recognized what was happening and changed the configuration.  Ironically the bread crumbs which led us to change this setting came from a JIRA I opened regarding a scalability issue in 2008.  As I recall the setting was not supported by the version of Hadoop we were using at that time.

A secondary reason for this posting is to point out the crazy name associated with this setting. The term "xceiver" ("i" before "e" except after "c" blah blah blah...) is short-hand for transceiver, which means a device which transmits and receives. But does the phrase "transceiver" really describe "maximum threads a DataNode can support"? At the very least the spelling of this setting should be corrected and the underlying code should recognize both spellings. What would be even better would be add a setting called "transceivers" or even more explicitly "concurrent_threads" like this:

dfs.datanode.max.transceivers
or
dfs.datanode.max.concurrent_threads

Finally, why is the default value for this setting so low?  Why not have it default to a more reasonable value like 2048 or 4096 out of the box?  Memory and CPU is cheap, chasing infrastructure issues is expensive.

So summing it all up, lack of documentation and poor spelling of this setting caused us to lose hours of productivity.  If you're deploying a big grid, be sure to configure this properly.  If you're seeing strange failures related to I/O within your Hadoop infrastructure be sure to check for this value, and potentially increase it.

2 comments:

Edward Capriolo said...

I feel you. If you are talking about why things are tuned soo low... As funny as it sounds mysql server has out of the box settings for a machine with 8 MB of ram.

Eli said...

Hey Christopher,

Thanks for the suggestion. I filed a jira (http://issues.apache.org/jira/browse/HDFS-1861) and submitted a patch.

Feel free to chime in on the mailing lists (http://hadoop.apache.org/mailing_lists.html) with additional suggestions.

Thanks,
Eli