HBase: Distributed database modelled on B igTable (Google)
Runs on top of Hadoop core
Column-store: Wide tables cost only the data sotred, NULLs in rows are "free", columns compress well
Not a SQL database: no joins, no transactions, no column typing, no sophisticated query engine, no SQL, ODBC, etc.
Use HBase: Scale is large; access can be basic and mostly table scans.
The canonical use case is web-table: table of web crawls keyed by URL, columns of webpage, parse, attributes, etc.
Data Model: Table of rows X columns; timestamp is 3rd dimension
Cel uninterpretted byte arrays
Rows can be any Text value, e.g. a URL: rows are ordered lexicographically, row writes are atomic
Columns grouped into column families: have prefix and a qualifier (attribute: mimetype, attribute: language)
Members of column family group have similar charcter/access
Implementation: Client, a Master, and one or more region servers (analagous to slaves in Hadoop).
Cluster carries 0->N Tables
Master assigns table regions to region servers
region server vcarries 0->N regions: region server keeps write-ahead log of every update; used recovering lost regionservers; edits first go to WAL, then to Region
Each region is made of MemCache and Stores
Row CRUD: client initially goes to Master for row loaction; client caches and goes direct to region server therafter; on fault returns to master to freshen cache
When regions get too big they are split: regionserver manages split, master is informed parent is off-lined, new daughters deployed.
All data persisted to HDFS
Connecting: java is first -class client; Non-Java clients: thrift server hosting hbase client instance (ruby, c++, java). Also REST server hosts hbase client (ruby gem, Active Record via RESt).; SQL-like shell (HQL); TableInput/OutputFormat for map/reduce
History: 2006 Powerset interest in Bigtable; 02/2007 Mike Cafarella provides initial code drop - cleaned up by powerset and added as Hadoop contrib; First usable HBase in Hadoop 0.15.0; 01/2008 HBase subprojects of of Hadoop; Hadoop 0.16.0 incorporates into code base
HBase 0.1.0 release candidate is effectively Hadoop 0.16.0 with logs of fixes...HBase now stands outside of Hadoop contrib.
Focus on developing use/developer base: 3 committers, tech support (other people's clusters via VPN), 2 User Group meetings - more to follow; working on documentation and ease-of-use.
Performance: Performance good and improving
Known users: powerset and rapleaf, worldlingo, wikia
Near Future: release 0.2.9 to follow release of Hadoop 017.0, theme robustness and scalability - rebalancing of regions of cluster, replace HQL with jirb, jython, or beanshell, repair and debugging tools.
HBase and HDFS: No appends in HDFS: data loss...WAL is useless without it, Hadoop-1700 making progress
HBase usage pattern is not same as Map/Reduce - random reading and keep files open raises "too many open files" issues
HBase or HDFS errors: Hard to figure without debug logging enabled
Visit hbase.org: mailing lists, source, etc.
25 March 2008
Hadoop Summit: Michael Stack and HBase
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment