25 March 2008

Hadoop Summit: Bryan Duxberry and HBase

Rapleaf is a people search, profile aggregation, and Data API

Application: custom Ruby web crawler - indexes structured data from profiles. Currently, index page once and then gone forever:

Internet -> Web Crawler ->Index Processing -> MySQL

Using HBase to store pages, HBase via REST servlet. Allows: reindexing at a later date, better factored workflow, easier debugging:

Internet -> Web Crawler -> HBase -> Index Processing -> MySQL Database

Data similar to webtable: keyed on person no URL, webtable less structured, rapleaf table not focused on links.

Schema: 2 column families: content (stores search and profile pages), meta (stores info about the retrieval [when, how long, who it's about]). Keys are tuples of (person ID, site ID)

Cluster specs: HDFS/HBase cluster of 16 machines, 2TB disk, 64 cores, 64G of memory

Load pattern: Approx 3.6T per month (830G compressed), average row size 64K, 14K gzipped
.
Performance: average write time 31 seconds (yikes! accounts for ruby time, base64, sending to servlet, writing to HBase, etc.)...median write time is 0.201 seconds. Max write time 359 seconds. Reads not used much at the moment. Note some of these perf. issues are due to Ruby on GreenThreads instead of native threads, etc....haven't profiled the REST servlet either.

General Observations: Record compression hurt performance on testing; compressing in client gives a big boost; possible addition to standard HBase client. HBase write logs stored in HDFS which don't exist until closed, which means HBase has durability issues. (this will be resolved by Hadoop-1700).


No comments: