ccg.tech

CENSORED

2012-01-17T23:41:00.000-05:00

For the next 24 hours this blog is censored. This is what blogs look like on SOPA (and PIPA). Read up and make your own decision: http://projects.propublica.org/sopa/ . And if you agree, join Wikipedia, Reddit, and others by "going dark" for 24 hours. You can support the protest via Facebook, Twitter, and Google+ by changing your profile picture: http://www.blackoutsopa.org/

Hadoop HDFS: Deceived by Xciever

2010-02-10T15:13:00.000-05:00

It's undocumented. It's misspelled. And without it your (insert euphemism for "really big" here) Hadoop grid won't be able to crunch those multi-hundred gigabyte input files. It's the "xciever" setting for HDFS DataNode daemons, and the absence of it in your configuration is waiting to bite you in the butt.

The software components that make up HDFS include the NameNode, the SecondaryNameNode, and DataNode daemons. Typical installations place NameNode and SecondaryNameNode on a single "master" node. Installations following best practices go the extra mile and place the NameNode and SecondaryNameNode daemons on separate nodes. DataNode daemons are placed on participating slaves nodes. At the highest level the NameNode/SecondaryNameNode daemons manage HDFS metadata (the data used to map file names to server/block lists, etc.), while the DataNode daemons handle the mechanics of read/writing to disk, and serving up blocks of data to processes on the node or to requesting processes on other nodes.

There are many settings that can be use to configure and tune HDFS (as well as the MapReduce engine and other Hadoop components). The HDFS documentation lists 45 of them (along with their default values) last time I checked. These configuration settings are a somewhat disorganized mix of elements used to configure the NameNode and DataNodes.

The DataNode daemons have an upper limit on the number of threads they can support, which is set to the absurdly small (IMHO) value of 256. If you have a large job (or even a moderately-sized job) that has many files open you can exhaust this limit. When you do exceed this limit there are a variety of failure modes that can arise. Some of the exception traces point to this value being exhausted, while others do not. In either case, understanding what has gone wrong is difficult. For example, one failure mode observed
raised java.io.EOFException, with a traceback flowing down into DFSClient.

The solution to all this is to configure the Xcievers setting to raise the maximum limit on the number of threads the DataNode daemons are willing to manage. For modern Hadoop releases this should be done in conf/hdfs-site.xml. Here is an example:

< property >
< name >dfs.datanode.max.xcievers< /name >
< value >4096< /value >
< /property >

[Notez-bien: I apologize for the lame formatting above. I really need to improve my template]

Interestingly, HBase users seem to have more conversations about this setting and related issues than do predominantly MapReduce users.

The central reason for this posting is to point out this setting as it might be helpful to others. We encountered this problem while porting from an older version of Hadoop to 0.20 (via the Cloudera distribution, which I highly recommend). The failure mode was not immediately evident and we wasted a lot of time debugging and chasing assorted theories before we recognized what was happening and changed the configuration. Ironically the bread crumbs which led us to change this setting came from a JIRA I opened regarding a scalability issue in 2008. As I recall the setting was not supported by the version of Hadoop we were using at that time.

A secondary reason for this posting is to point out the crazy name associated with this setting. The term "xceiver" ("i" before "e" except after "c" blah blah blah...) is short-hand for transceiver, which means a device which transmits and receives. But does the phrase "transceiver" really describe "maximum threads a DataNode can support"? At the very least the spelling of this setting should be corrected and the underlying code should recognize both spellings. What would be even better would be add a setting called "transceivers" or even more explicitly "concurrent_threads" like this:

dfs.datanode.max.transceivers

dfs.datanode.max.concurrent_threads

Finally, why is the default value for this setting so low? Why not have it default to a more reasonable value like 2048 or 4096 out of the box? Memory and CPU is cheap, chasing infrastructure issues is expensive.

So summing it all up, lack of documentation and poor spelling of this setting caused us to lose hours of productivity. If you're deploying a big grid, be sure to configure this properly. If you're seeing strange failures related to I/O within your Hadoop infrastructure be sure to check for this value, and potentially increase it.

Down Time == Productive Time

2008-12-31T02:54:00.009-05:00

Back in my college days - long enough ago that mainframe computing was still the rage - I discovered the pleasure of The Holiday Break. Classes were over, everybody went home, the university was largely empty, and as a result you could grab all the computing time you wanted, and you could work for hours undisturbed. The Christmas holiday was always the best.

Fast forward to present day, and I continue to find the time from just before Christmas until just after the New Year a sort of magic time to get things done. Many offices are shut down for the holiday, so my unpredictable 45-minutes-or-maybe-3-hours commute becomes predictable. The shock jocks and on-air "personalities" on drive time FM radio are all on vacation, and so the radio actually plays music. Things in the office are quiet - folks don't normally schedule releases, death marches, etc., around the holidays. The time of year puts people in a mellow mood. All-in-all, it's a great time of year to grab some quiet time.

Quiet time becomes productive time in unexpected ways. Since I'm not in the throes of a major release crunch I actually have time to catch up on some reading. Today, for example, I read about 20-30 pages from some of the books shown on my current reading list at the top of the blog. I also read a few great blog posts about a number of things technical. So great, I'm reading and surfing...but is it productive?

Oh yeah, it sure is. I concocted a new way to visualize how well or poorly a very complicated n-tier application is performing, in part based on some blog reading. Very cool stuff, for which I cannot take all the credit, but about which I will blog more in the near future.

I also got a couple hours to write up some much needed documentation, and spent time catching up with colleagues who actually have time to talk about what they are working on and what sorts of challenges they are facing. This naturally leads to more ideas, and the ball starts rolling.

People often talk about the December holidays as down-time for business, and I'm sure that it is. But the flip-side of the coin is that it can be a great time to open up the mind, solve some problems, and come up with good stuff to tackle in the upcoming year. Down time offers the chance to productively daydream without (usually) getting off schedule. So for engineers, down time can be super productive time. And maybe that's why when the weather gets cold, and the Christmas bell-ringers appear on every street corner, my mind turns to thoughts of software design, architecture, and coding.

Visible Measures wins 2008 MITX Technology Award!

2008-06-03T23:06:00.004-04:00

Visible Measures (my current gig) won an award tonight at the Massachusetts Innovation and Technology Exchange (MITX) 2008 What's Next Forum and Technology Awards. We were recognized in the Analytics and Business Intelligence category - the same category that Compete (my old company) was entered in last year.

A lot of great companies were finalists in our category, including Progress Software, salary.com, SiteSpect, and Lexalytics. This was tough competition, which made winning this award all the more sweet. A big shout out to Version 2 Communications as well - we were their guests at the awards.

Visible Measures is an awesome company, with an extremely hard-working highly motvated team. I am extremely proud and humbled to be part of this company.

The MITX event was very nice. There were plenty of opportunities to network with a lot of interesting people doing a lot of cool stuff. It was great to listen to Larry Weber (Chairman of the Board for MITX and founder of W2 Group) host the awards and dispense free advice ("...with 37 offices worldwide - that's too much overhead..."). MITX honored Amar Bose, who gave a very interesting talk. Bose is legenday - at least in the New England high tech community and particularly within MIT, so hearing him speak live is a privilege.

The only downside to the evening was the fire alarm going off mid-way through the ceremony. This lead to a rather awkward pause in the action while the fire department made sure nothing was wrong.

Hadoop Summit Slides

2008-04-22T16:40:00.003-04:00

A few weeks ago I went to California for the Hadoop Summit. I posted a bunch of notes in real-time during the summit until the network connection became too flakey to continue.

The Yahoo folks have come to the rescue however. The slides from the presentations, which are tons better than my notes, are freely available on-line here. There are also slides from the Data Intensive Computing Symposium which was held the next day.

I wish I had know about the Data Intensive symposium as it looks really really interesting (not to mention an excuse to stay in Califorinia one more day...).

Infrant/NetGear ReadyNAS NV+

2008-04-10T01:57:00.002-04:00

Just picked up one of these last week. The plan is to use the box as a shared storage resource to back up family data (pictures, etc.), and to back up other systems, and the grid machines in the rack.

I was originally going to build a box to handle the task, but a friend of mine recommended the ReadyNAS server as a cost effective (and less labor intensive) alternative. This box is basically plug-and-play...the operating system is delivered in firmware, and you configure and operate the box via a web interface and with a program called RAIDar. The box speaks a variety of protocols and can talk to Windows, Linux, Macs, and streaming media players so it should get along well with all the servers, workstations, etc.

I bought a diskless version, and populated it with 2x500G Western Digital drives. Initially nothing worked and for a brief time I thought the server was DOA. After a bunch of trial and error I concluded that one of the WD drives was DOA. I brought the box up on 1 drive, configured things, and it just worked. NewEgg RMA'd the bad drive (and even gave me freebie shipping label to send the bad device back...good stuff).

I've got 2 more 500G drives arriving tomorrow - the box is hot-pluggable so in theory installation is simple. It should be interesting to get the box up to 2T with X-RAID and do some performance testing.

Product reviews of the ReadyNAS have been widely varied, but so far all things look positive. I'll post more about the box once I get my bad drive issues sorted out...

United Economy Section not so good for laptops users...

2008-04-05T23:18:00.003-04:00

Thankfully I don't have to travel too much on business - there are plenty of things to do right here in the office most of the time. However, I do get a chance to escape the office now and again, and the Hadoop Summit in California a couple weeks ago was one such opportunity.

The summit was awesome, you can see some of the notes I took in earlier postings here. I talked to a lot of people, heard a lot of good presentations, and got tons of good information about the Hadoop roadmap, future directions, etc. All very good stuff.

The point of this post, however, is not to heap praises on the Yahoo! folks for such a great meeting, it's to chat a bit about Economy Class on United Airlines. When I did the on-line check-in thing for the flight from BOS to SFO, United offered me the chance to upgrade to "Economy Plus" for an extra $60 to get more legroom. "No thanks", I clicked and thought nothing further about it. Then I got on the airplane (wasn't it George Carlin who said "Let Evil Kneivel get ON the plane, I am getting IN the plane...", but I digress...). As we reached the altitude "at which it is safe for portable electronic devices to be used", I reach for my laptop...just as the guy ahead of me reclined for what turned out to be a 6 hour snooze across the country. Hmmm, uhhhhh......not enough room between seats to uhhh open the lid on my Lenovo T60. Ugh. So I broke out the pad of paper and pen to jot down some notes and uhhhhhhh not enough room to even write comfortably.

Fortunately, the guy sitting next to me turned out to be an interesting fellow starting up a hedge fund, and so I spent the rest of the trip happilly chatting away about everything from technology to politics.

On the return flight I'm nobody's fool. I really really want to get a good 5 hours of work done on the plane (or should that be "in the plane"...) and so I pony up the extra $60 for the Economy Plus seat. It was a night-and-day difference. I flipped open the laptop, stretched out, and used up both batteries writing code.

Now I'm not gonna gripe too much about the Welcome-To-Economy-Sorry-About-The-Laptop seat on the flight out...the ticket was cheap enough, particulary for a non-stop flight ($339, BOS->SFO), but I would like to:

A. Recommend the hell out of Economy Plus if you have to fly United and you want to work
B. Encourage the airlines to think about the impact such narrow seating has on business travelers
C. Remind myself to get to the gym a little more often...maybe if I skinny up enough the Econony section won't be so bad...

Given the State of the Airline Industry I am pessimistic about anything happening with respect to (B) above, but I sure do feel better getting this rant off my chest....

Hadoop Summit: Internet connectivity..

2008-03-25T18:23:00.002-04:00

The conference is rolling along - a lot of great information and good presentations all around. Unfortunately, there has been some network flakiness particularly during the afternoon...so I've stopped trying to blog each talk.

I'll try to summarize some of the more interesting points later...

Hadoop Summit: Bryan Duxberry and HBase

2008-03-25T16:48:00.001-04:00

Rapleaf is a people search, profile aggregation, and Data API

Application: custom Ruby web crawler - indexes structured data from profiles. Currently, index page once and then gone forever:

Internet -> Web Crawler ->Index Processing -> MySQL

Using HBase to store pages, HBase via REST servlet. Allows: reindexing at a later date, better factored workflow, easier debugging:

Internet -> Web Crawler -> HBase -> Index Processing -> MySQL Database

Data similar to webtable: keyed on person no URL, webtable less structured, rapleaf table not focused on links.

Schema: 2 column families: content (stores search and profile pages), meta (stores info about the retrieval [when, how long, who it's about]). Keys are tuples of (person ID, site ID)

Cluster specs: HDFS/HBase cluster of 16 machines, 2TB disk, 64 cores, 64G of memory

Load pattern: Approx 3.6T per month (830G compressed), average row size 64K, 14K gzipped
.
Performance: average write time 31 seconds (yikes! accounts for ruby time, base64, sending to servlet, writing to HBase, etc.)...median write time is 0.201 seconds. Max write time 359 seconds. Reads not used much at the moment. Note some of these perf. issues are due to Ruby on GreenThreads instead of native threads, etc....haven't profiled the REST servlet either.

General Observations: Record compression hurt performance on testing; compressing in client gives a big boost; possible addition to standard HBase client. HBase write logs stored in HDFS which don't exist until closed, which means HBase has durability issues. (this will be resolved by Hadoop-1700).

Hadoop Summit: Michael Stack and HBase

2008-03-25T16:17:00.003-04:00

HBase: Distributed database modelled on B igTable (Google)

Runs on top of Hadoop core

Column-store: Wide tables cost only the data sotred, NULLs in rows are "free", columns compress well

Not a SQL database: no joins, no transactions, no column typing, no sophisticated query engine, no SQL, ODBC, etc.

Use HBase: Scale is large; access can be basic and mostly table scans.

The canonical use case is web-table: table of web crawls keyed by URL, columns of webpage, parse, attributes, etc.

Data Model: Table of rows X columns; timestamp is 3rd dimension

Cel uninterpretted byte arrays

Rows can be any Text value, e.g. a URL: rows are ordered lexicographically, row writes are atomic

Columns grouped into column families: have prefix and a qualifier (attribute: mimetype, attribute: language)

Members of column family group have similar charcter/access

Implementation: Client, a Master, and one or more region servers (analagous to slaves in Hadoop).
Cluster carries 0->N Tables
Master assigns table regions to region servers
region server vcarries 0->N regions: region server keeps write-ahead log of every update; used recovering lost regionservers; edits first go to WAL, then to Region
Each region is made of MemCache and Stores
Row CRUD: client initially goes to Master for row loaction; client caches and goes direct to region server therafter; on fault returns to master to freshen cache
When regions get too big they are split: regionserver manages split, master is informed parent is off-lined, new daughters deployed.
All data persisted to HDFS

Connecting: java is first -class client; Non-Java clients: thrift server hosting hbase client instance (ruby, c++, java). Also REST server hosts hbase client (ruby gem, Active Record via RESt).; SQL-like shell (HQL); TableInput/OutputFormat for map/reduce

History: 2006 Powerset interest in Bigtable; 02/2007 Mike Cafarella provides initial code drop - cleaned up by powerset and added as Hadoop contrib; First usable HBase in Hadoop 0.15.0; 01/2008 HBase subprojects of of Hadoop; Hadoop 0.16.0 incorporates into code base

HBase 0.1.0 release candidate is effectively Hadoop 0.16.0 with logs of fixes...HBase now stands outside of Hadoop contrib.

Focus on developing use/developer base: 3 committers, tech support (other people's clusters via VPN), 2 User Group meetings - more to follow; working on documentation and ease-of-use.

Performance: Performance good and improving

Known users: powerset and rapleaf, worldlingo, wikia

Near Future: release 0.2.9 to follow release of Hadoop 017.0, theme robustness and scalability - rebalancing of regions of cluster, replace HQL with jirb, jython, or beanshell, repair and debugging tools.

HBase and HDFS: No appends in HDFS: data loss...WAL is useless without it, Hadoop-1700 making progress

HBase usage pattern is not same as Map/Reduce - random reading and keep files open raises "too many open files" issues

HBase or HDFS errors: Hard to figure without debug logging enabled

Visit hbase.org: mailing lists, source, etc.

Hadoop Summit: Ben Reed and Zookeeper

2008-03-25T14:50:00.002-04:00

Zookeeper: General, robust coordination service...

Observations: Distributed systems always need some form of coordination. Programmers cannot use locks correctly (note that google uses "chubby" lock service): you can learn locks in 5 minutes, but spend a lifetime learning how to use them properly; Message based coodination can be hard to use in some apps.

Wants: Simple, robust, good performane; tuned for read-dominant workloads; familiar models and interfaces, wait-free: failed client wil not interfere with requests of a fast client; need to be able to wait efficiently.

Design starting point: start with File API and strip out what we don't need: partial writes/reads, name. Add what is needed: ordered upates and strong persistance guarantees, conditional updates, watched for data changes, ephemeral nodes (client create files, if client goes away the file goes away), generated file names (i.e. mktemp()).

Data Model: Hierarchical namespace, each znode has data and children, data is read and written in its entirety

Zookeeper API: create(), delete), setData(), getData(), exists(), getChildren(), sync()...also some security APIs ("but nobody cares about security"...). getChildren() is like "ls".

Create flags: epheeral - the znode gets deleted with the session that created it times out; sequence the path name will have a monotonically increasing counter relative to the parent appended.

How Zookeeper works: Service made up of a set of machines. Each server stores a full copy of data in-memory....gives very low latency and high throughput. A leader is elected at startup, the rest of the servers become followers. Followers service clients, all updates go through leader. Update responses are sent when a majority of servers agree.

Example of Zookeeper service: Hadoop on Demand. I got distracted by something during this part of the presentation and didn't get all the basic pieces here, but this uses Zookeeper to track interaction with Torque, and handles graceful shutdown, etc.

Status: Project started in 2006, prototyped in fall 2006, initial implementation March 2007, code moved to zookeeper.sourceforge.net and Apache License November 2007. Everything is pure Java: quorum version and standalone version. Clients are Java and C clients.

Hadoop Summit: Andy Konwinski & X-trace

2008-03-25T14:19:00.004-04:00

Monitoring Hadoop using X-trace: Andy Konwinski is from UC Berkeley RAD Lab

Motivation: Hadoop style processing masks failures and performance problems

Objectives: Help Hadoop developers debug and profile Hadoop; help operators monitor and optimize Map/Reduce jobs

[CCG: This is cool, we REALLY need this stuf NOW]

RAD Lab: Reliable Adaptive Distributed Systems

Approach: Instrument Hadoop using X-Trace framework. Trace Analysis: virtualizatioin via web-based UI; statistical analysis and anomaly detection

So what's X-Trace: Path-based tracing framework; generate event graph to capture causality of events across network (RPC, HTTP, etc.). Annotate message with trace metadata carried along execution path (instrument protocol APIs and RPC libraries). Within Hadop the RCPC library has been instrumented.

DFS Write Message Flow Example: client -> DataNode1 -> DataNode2 -> DataNode3
Report Graph Node: Report Label, Trace ID#, Report ID3, Hostname, timestamp

builds a graph which can then be walked.

[CCG: again you need to see the picture, but you get the idea right?]

Andy showed some cool graphs representing map/reduce operations in flight...

Properties of Path Tracing: Deterministic causality and concurrence; control over which events get traced; cross-layer; low-overhead; modest modification of the app source code ( <>

Architecture: Xtrace front ends on each Hadoop node, communicates with Xtrace backend via TCP/IP, backend stores data using BDB, trace analysis web UI communicats with Xtrace backend. Also cool fault detection programs can be run - interact with backend via HTTP.

Trace Analysis UI: "we have pretty pictures which can tell you a lot": perf stats, graphcs of utilization, critical path analysis...

Applications of Trace Analysis: Examined perf of Apache nutch web indexing engine oin a Wikipedia cral. Time to create an inverted link index of a 50G crawl - with default configuration, ran in 2 hours.

Then by using trace analysis, was able to make changes to run same workload in 7 minutes. Used workload analysis to determine that one single reduce task which actually fails several times at the beginning: 3 ten minute timeouts. Bumped max reducers to 2 per node, and dropped execution time to 7 minutes.

Behold the power of pretty pictures!

Off-line machine learning: faulty machine detection, buggy software detection. Current work oin graph processing and analysis.

Future work: Tracing more production map/reduce applications. More advanced trace processing tools, migrating code into Hadoop codebase.

Hadoop Summit: Michael Isard and DryadLINQ

2008-03-25T13:47:00.002-04:00

Michael Isard is from Microsoft Research:

"Dryad: Beyond Map-Reduce"

What is Map-Reduce: An implementation, a computational model, a programming model.

Implementation Performance: Literally map, reduce, and that's it...reducers write to replicated storage. Complex jobs require pipeline multipl stages....no fault tolerance between stages. Output of Reduce: 2 network copies, 3 disks

Computational Model: Join combines inputs of diff types, "split" produced outputs of different types. This can be done with map-reduce, but leads to ugly programs. Hard to avoid performance penalty described above. Some merge-joins are very expensive. Finally, baking in more operators adds complexity.

Dryad Middleware Layer: Address flexibility and performance issues, more generalized than map-reduce, interface is more complex.

Computational Model: Job is a DAG, each node takes any number of inputs and produces any number of outputs (you need to see the picture).

DAG Abstraction Layer: Scheduler handles arbitrary graphs independent of vertext semantics, simple uniform state machine for scheduling and fault-tolerance. Higher levels build plan from application code: Layers isolated, many optimizations are simple graph manipulations, graph can be modified at runtime.

MapReduce Programming Model: opaque pairs flexible. Front-ends like sawzall and pig help, but domain specific simplifications limit some applications.

Moving beyond simple data-mining to machine learning, etc.

LINQ: Extensions to .Net in Visual Studio in 2008...general purpose data-parallel programming constructs. Data elements are arbitrary .NET types, combined in generalized framework

DryadLINQ: Automagically distribute a LINQ program; some Dryad-specific extensions: same source program runs on single-core to multi-core to cluster. Execution model depends on data source.

LINQ designed to be extensible, LINQ+C# provides parsing, thype checking. LINQ builds expressioin tree. Root provider class called on evaluation (has access to entire tree, reflection allows very powerful operations). Can add custom operators.

PLINQ - Running Queries on Multi-Core Processors - parallel implementation of LINQ.

SQL server to LINQ: cluster computers run run SQL Server. Partitioned tables in local SQLServer DBs. DryadLINQ process can use "SQL to LINQ" provider - "best of both worlds".

Continuing research:L Applicatioinlevel research (what can we do), system level research (how can we improve performance), LINQDB?

Hadoop Summit: Kevin Beyer and JAQL...

2008-03-25T12:59:00.002-04:00

Kevin Beyer is from IBM Almaden Research Center

JAQL (pronounced "jackel") - Query Language for JSON

JSON: JavaScript Object Notation: simple, self-describing and designed for data
Why: want complete entity in one place, support schema that vary or evolve over time; standard: used in web 2.0 applications, bindings available for many languages, etc.; not XML - XML designed for doc markup, not for data (hooray, thanks for saying THAT).

JAQL processes any data that can be interpreted as JSON (JSON text files, binary, CSV, etc.). Internally, JAQL processees binary data-structures.

JAQL similar to Pig Latin, goal is to get JAQL accepted anywhere JSON might be used (document analytics, doc management [couchDB], ETL, Mashups,....

Immediate Goal: Hide grungy details of writin map-reduce jobs for ETL and analysis workloads; compile JAQL queries into map/reduce jobs.

JAQL Goals: designed for JSON data, functional query language (few side effects, not a scripting language - set-oriented, highly transformed) , composable expressions, draws on other languages, operator plan -> query (rewrites are transforms within the language, any plan is representabke in the language itself).

Core operations: iterations, grouping, joining, combining, sorting, projection, constructors for arrays, records, atomic values, "unnesting", function definition/evaluation.

Some good examples were presented that are too long to type in...wait for the presentations to appear on-line I guess...sorry. Good stuff though, I am liking the language presented more than Pig Latin.

ImplementationL JAQL input/output designed for extensibility...basically reads/writes JSON values. Examples: Hadoop InputFormat and OutputFormat.

Roadmap: Another release is imminent, next release this summer (open the source, indexing support, native callout, server implementation with a REST interface).

Hadoop Summit - Christopher Olsten and PIG

2008-03-25T12:33:00.003-04:00

I saw Chris talk about Pig at MIT a few weeks ago...this looks like the same presentation..

Example: Tracking users who visit good pages (pagerank type thing). Typical operations: loading data, canonicalizing, database JOIN-type operation, database GROUP BY operation, leading to a something which computes the pagerank.

note: the drawings within the presentation make the above very clear. Striping visits and pages across multipe servers, highly parallel processing..fairly straightforward approach.

But using just map/reduce: Write the join yourself (ccg - been there done that, thanks Joydeep Sen Sharma for getting me started). Hadoop users tend to share code among themselves for doing JOINs, and how best to do the join operation, etc. In short, things get ugly in a hurry - gluing map/reduce jobs together, etc. You have to do a lot of low-level operations by hand, etc. It's potentially hard to understand and maintain code.

So: A data flow language could easily synthesize map/reduce sequences

PIG accomplishes this, using a dataflow language called Pig Latin. Essentially a terse language where each step is loading data, doing some sort of filtering/canonical operation, or doing custom work (via map/reduce). Operators includ: FILTER, FOREACH, GENERATE, GROUP, JOIN, COGROUP, UNION. Also support for sorting, splitting, etc. Goal is a very simple language to do powerful things.

Related languages:
SQL - declarative language (i.e. what, not how).
Pig Latin: Sequence of simple steps - close to imperative/procedural programming, semantic order of operations is obvious, incremental construction, debug by viewing intermediate results, etc.

Map/Reduce: welds together primitives (process records - > create groups -> process groups)
Pig Latin: Map/Reduce is basically a special case of Pig, Pig adds built-in primitives for most-used transformations.

So: Is Pig+Hadoop a database system ? Not really.....

Workload: DBMS does tons of stuff, P+H does bulk reads/writes only...just sequential scans
Data Representation: DBMS controls format...must predeclare schema, etc, Pigs eat anything :-)
Programming Style: DMBS - system of constraints, P+H: sequence of steps
Custom Processing: DBMS: functions second class to logic expressions, P+H: Easy to extend

Coming Soon to Pig: Streaming (external executables), static type checking, error handling (partial evaluation), development environment (Eclipse).

Hadoop Summit: Doug Cutting and Eric Baldeschweiler

2008-03-25T11:58:00.005-04:00

Hadoop Overview - Doug Cutting and Eric Baldeschwieler

Doug Cutting - pretty much the father of Hadoop gave an overview of Hadoop history. Interesting comment was that Hadoop has achieved web-scale in early 2008...

Eric14 (Eric B...): Grid computing at Yahoo. 500M unique users per month, billions of interesting events per day. 'Data analysis is the inner loop" at Yahoo.

Y's vision and focus: On-demand shared access to vast pool of resources, support for massively parallel execution. , Data Intensive Super Computer (DISC), centrally provisioned and managed, service oriented...Y's focus is not grid computing in in terms of Globus, etc., not focused on external usage ala Amazon EC2/S3. Biggest grid is about 2,000 nodes.

Open Source Stack: Commitment to open source developent, Yahoo is an Apache Platinum Sponsor

Tools used to implement Yahoo's grid vision: Hadoop, Pig, Zookeeper (high avail directory and config sevices), Simon (cluster and app monitoring).

Simon: Very early days, internal to Yahoo right now...similar to Ganglia "but more configurable". Highly configurable aggregation system - gathering data from various nodes to produce (customizable?) reports.

HOD - Hadoop On Demand. Current Hadoop scheduler currently FIFO - jobs will run in parallel to the extent that the previous job doesn't saturate the node. HOG is built on Torque (http://www.clusterresources.com/) to build virtual clusters, separate file systems, etc. Yahoo has taken development about as far as they want...cleaning up code, etc. Future direction for Yahoo is to invest more heavily in the Hadoop scheduler. Does HOG disrupt data locality - yup, it does. Good news: Future Hadoop work will improve rack locality handling significantly.

Hadoop, HOD, Pig all part of Apache today,

Multiple grids inside Yahoo: tens of thousands of nodes, hundreds of thousands of cores, TBs of memory, PBs of disk...ingests TBs of data daily.

M45 Project: Open Academic Cluster in collaboration with CMU: 500 nodes, 3TB RAM, 1.5P disk, high bandwith located conveniently in a semi-truck trailer

Open source project and Apache: Goal is for Hadoop to remain a viable open source project. Yahoo has invested heavily...very excited to see additional contributors and commiters. "Yahoo is very proud of what we've done with Hadoop over the past few years." Interesting metric: Megawatts of Hadoop

Hadoop Interest Growing: 400 people expressed interest in today's conference, 28 organizations registered their Hadoop usage/cluster, in use in universities on multple continents, Y is now started hiring employees with Hadoop experience.

GET INVOLVED: Fix a bug, submit a test case, write some docs, help out!

Random notes: More than 50% conf attendees running Hadoop, many with grids more than 20 nodes, and several with grids > 100 nodes.

Yahoo just announced collaboration with Computational Research Labs (CRL) in India to "jointly support cloud computing research"...CRL runs EKA - the 4th fastest supercomputer on the planet.

Hadoop Summit - 25-March-2008 8:45AM

2008-03-25T11:40:00.004-04:00

Huge crowd, various Yahoo celebrities like JeremyZ and EricB, Doug Cutting floating around...it's fun to put faces to names with the various mailing list participants. Looks like good network access and there are even plug strips lying around everywhere to plug in laptops. Looking forward to hearing some excellent presentations...

Hadoop Summit 25-March-2008

2008-03-22T00:34:00.002-04:00

I'm packing my bags for the 6-1/2 hour flight from right coast to left for the Hadoop Summit next week. This promises to be a great event with a lot of good material on the agenda, plus the opportunity to chat with Hadoop contributors and application developers as well. I just wish it was longer than 1 day...

There are 215 registered attendees - which seems to be a larger number than Yahoo anticipated when the event was initially announced - so there is evidently significant interest in Hadoop despite map/reduce technology being declared "a major step backwards" by Mike Stonebraker and Dave DeWitt. Can't we all just get along?

I anticipate having network access at the summit, so I'll try to blog about the presentations and discussions throughout the day.

Compete acquired by Taylor Nelson Sofres

2008-03-16T00:51:00.003-04:00

Compete, a company I worked for as Chief Software Architect, was just recently acquired by Taylor Nelson Sofres. This is a remarkable achievement for Compete and a great deal for both Compete and TNS. If you're interested, you can read the official postings about the acquisition. Congratulations to everyone at Compete and TNS!

Learning about this acquisition made me feel very proud and quite nostalgic. It was back in April 2001 that I got involved with Compete. I had wrapped up a long contract at Oracle, spent a month in India with my wife visiting her extended family, and was back in Nashua, NH contemplating the irrational exuberance of what we would all soon call "dot bomb." I had been contracting for a while, and while that was good fun I felt it was time to do something a bit more daring/challenging.

One thing led to another and I found myself in Boston, on oh-so-chic Newbury Street of all places, sitting in a literally transparent office on the upper floors of a converted church building, discussing a company called Compete with its CTO (and now my good friend) David Cancel. I wound up joining Compete a few weeks later as its Chief Software Architect, a position that I held for almost 5 years. I think I was the 11th or 12th person Compete hired. Compete rode the waves of dot com and dot bomb, and eventually found its way to become the strong player it is today, thanks to its excellent management team and a lot of hard work by everybody.

Fast forward 7 years and Compete is now a recognized industry force - an established player with a work force of close to 100 extremely talented and dedicated individuals. Expect more great things from this company in the future!

The Grid in My Basement, part 3: That Sinking Feeling

2007-01-14T02:35:00.000-05:00

Size matters. At least when you are building rackmount machines. Of course were I not suffering from sleep deprivation when I made my hardware purchasing decisions I would have realized that you can't put a MASSIVE heat sink into a tiny space, but such is life.

Anyway, the very spiffy Blue Orb II CPU cooler is never ever gonna fit in the 2U case I bought. That was evident by inspection before I even unpacked the coolers. Had I done my homework on the motherboard and case dimensions I would have realized that a package with a combined fan + heatsink height of 90.3mm would never fit. Not only that, but the heatsink has length and width dimensions of 140x140 which means it might not fit the motherboard at all. There's a huge row of capacitors next to the retention module base, and the DIMM sockets are proximate on the other side. This is all badness from the perspective of installing a massive heatsink.

So with a heavy sigh I file for my first RMA from Newegg and package the Orbs for shipment back. Bummer drag, they looked so cool too. So I start looking for an appropriate K8 heatsink for my new nodes, and the fun really begins.

First, you may be wondering why I didn't use the cooler that came with the CPU when I bought it. Well, in order to save money I bought the OEM version of everthing I could. That eliminates a lot of unnecssary packaging, instruction manuals, and in some cases features - like the CPU cooler on my AMD X2s. So I need to buy a cooler on the aftermarket.

The assumption that manufacturers make is that you WILL overclock if you are buying an aftermarket cooler. Therefore, the heatsinks reflect this assumption and most are massive. Looking at heatsinks in close up is sorta like looking at big scary machinery. Pipes and tubes run in all directions, massive banks of fins jut out at weird angles and rise up toward to sky towering over the motherboard. None of these devices are particularly well suited for the tight space of a 2U (or heaven help you a 1U) case.

I start shopping around for low-profile CPU coolers for 2U cases and run into several problems. First, there aren't too many cooler vendors out there that make this stuff. Second, the ones that do aren't terribly interested in Socket 939 applications. Third, the low-profile stuff tends to be crazy expensive - $95 for a low-profile heat sink and fan? No thanks...

So I pick up a ruler, open up the case, and start measuring. And measuring. After a good deal of plotting, I calculate that my heat sink can be no more than 70 x 70 x 65mm. And then I start shopping. And shopping.

Finally, after literally 2 evenings wasted googling around, I hit on a cooler/heatsink sold by ASUS - the same manufacturer that makes the motherboard I am using. I look at the height dimension and am psyched - the combined total of both devices is only 55mm tall! The bad news is that the heatsink runs 77 x 68 x 40mm - meaning that it's too big potentially. I look on the ASUS website (sidebar your honor: remind me to rant and rant later about web sites that provide everything BUT the information you need) and find nothing helpful regarding compatibility with their own motherboards.

So I reason as follows: The height dimension will fit just fine; the heatsink will probably fit an ASUS motherboard since ASUS makes both; the absence of a compatibility list means it's compatible with all their offerings or somebody is just lazy. So I bite the bullet and order up the ASUS Crux K8 MH7S 70mm Hydraulic CPU Cooling Fan with Heatsink and hope for the best.

2 days later I get the parts, and a couple days after that I open up the build-in-progress machine and install the heatsink. Have I mentioned how stressful putting a heatsink in can be? I mean there you are with all this expensive hardware that looks pretty darn fragile, and you are pushing down on it with no small amount of force trying to more-or-less permanently mate the CPU to the heatsink. Every time I do this I expect the motherboard to crack or something equally as awful.

Good news! The new cooler fits perfectly. It clears the lid of the case beautifully, and the dimensions of both heatsink and cooler are within the perimter of the retention module.

The Grid in my Basement, part 2: Parts is Parts...

2007-01-14T01:23:00.000-05:00

When I began this project the original goal was "reasonable performance for $500 per machine". That is turning out to be a bit of a challenge, especially since I decided not to cut corners on the rackmount chassis. There is nothing like working in a case for an hour and emerging with half a dozen cuts to your hands from rough edges to cost justify a clean well-made chassis. Further challenging the $500 bottom line was the desire to run either a dual core or dual CPU configuration.

Form Factor, Topology, etc.
Socket 939 is fading away, and my research showed me that prices for 939 gear were fading similarly. So as a money saving technique, I decided to actively seek out socket 939 hardware for this project. I also decided to focus on a good quality motherboard while not necessarily using a server motherboard...this may turn out to be a poor decision - we'll see once things are up and running. After reviewing the data sheets and specs on a number of motherboards I decided to use an ATX form factor.

Performance and Cost Considerations

I want good performance without breaking the bank. While a sweet dual core/dual core system with tons of memory and a massive SCSI array would make me smile, it would put the project way beyond budget. So here are the tradeoffs I made:

Running a single dual-core CPU instead of dual CPUs with dual cores. This means that each box will only be 2-way instead of 4-way, then again with 4 nodes running a clustering tool like openMosix I will have an 8-way box which is still pretty cool.
Have you noticed how pricey memory is lately? I'll start out with 1G per node, but make sure that my motherboard can support at least 4G for future expansion. Note to self: hoard memory later when it's cheap and make a killing on eBay when prices go up again.
SAS or SCSI gives killer I/O performance but at a price. I'll build these machines with SATA-II devices in the 250-320G range, perhaps spending a little more for a larger on-device cache.
My original plan was to build a blade server system, using DIY parts from a vendor like ATXBlade. But in analyzing the cost - $550 for blade storage unit, $325 for each blade chassis - I decided that I didn't really need to build a dense server farm. After all, I have a full height rack and will probably not build out enough systems to exceed the capacity of the rack.

Parts Manifest
After a significant amount of consideration, here is the parts manifest for each node:

Motherboard: ASUS A8N5X Socket 939 NVIDIA nForce4 ATX AMD motherboard - retail packaging
CPU: AMD Athlon 64 X2 4400+ (Toledo code) 2.2GHz processor - OEM packaging
Memory: Kingston ValueRAM 1G 184 pin (PC3200) DDR400 memory - retail packaging
Disk: Western Digital Caviar SE16 320G 7200RPM SATA 3.0Gb/s hard drive - OEM packaging
CPU cooler: Thermaltake CL-P0257 "Blue Orb II" CPU cool for K8 - retail packaging

Here is parts manifest for the chassis:

Case: iStarUSA Storm Series D-200 Black 2U rackmount case - retail packaging
Rails: iStarUSA TC-RSL-20 sliding rail kit for rackmount chassis - retail packaging

The links above are all to Newegg.com product pages, because that's where I bought everything. The CPU I spec above is presently not available, but this CPU has very similar specs.

Motherboard selection was driven by the following selection criteria:

Socket 939, ATX form factor
Able to support AMD Athlon X2
At least 4GB memory
Support for at least 4 SATA-II devices
On-board RAID support for RAID-0, RAID-1, RAID 0+1, and JBOD
On-board video support
Front Side Bus speed of at least 1000MgHz
On-board gigabit network support

The up-to-speed reader will note that the motherboard I chose does not come with on-board video support. I noticed that too - AFTER I had ordered the motherboards. There is a whole story behind this that I'll write down later. There is also a question about SATA performance - some spec sheets state the motherboard is SATA-I (1.5Gb/sec) while other spec sheets state it's SATA-II (3.0Gb/sec). I think the board was rev'd at some point and this may have been part of the rev. At any rate, if it turns out to be SATA-I the I can still do some benchmarking and perhaps install a SATA-II card later.

The Grid in My Basement, part 1

2007-01-04T01:47:00.000-05:00

The name of the game is parallelism...in short: take apart a problem, break it up into independent pieces, and run as many of those independent pieces at once on separate computers (well at least on separate CPUs). This is nothing new...parallel computing has been around since Cro-Magnon Geek solved problems by dropping boxes of punch cards bearing almighty FORTRAN into card reading machines and then lurking impatiently in the line printer room for pages of greenbar while multi-gazillion dollar CPUs the size of refrigerators cogitated about his fast fourier transforms and what not. What is compelling in THIS century is that you can do it cost effectively. Ok, not even cost effectively - downright cheap.

In cheapest form, parallel computing has become nearly free. Witness the Elastic Compute Cloud over at Amazon (buzz kill: ECC is in beta and they aren't accepting new users at the moment; double plus bad: the largest number of machines you can have is something like 20) where you can rent time on virtual machines for cents per hour. In more expensive forms parallel computing is, well, still pricey. If you're a high energy physicist or financial type with a big grid and you are running name brand hardware you are putting up some mighty big dollars.

So I find myself - as an employed practitioner wanting to test my employer's new software, and as an entreprenuer designing and trying new concepts in search of the Next Big Geeky Thing - wanting to have access to my very own grid. Thus is born the idea of the Grid in my Basement, or as I prefer to call it: The Data Basement.

My plan is to build and deploy a basic grid capable of doing real work in a cost effective manner. I want to accomplish this with some fairly real-world parameters, so I need real computing horsepower. So after hours and hours of combing through CPU specs, motherboard specs, CPU cooler specs (naw, I would NEVER overclock...), power supply specs and the like I now have boxes and boxes of cool stuff en route from my good friends at NewEgg (what self-respecting techno-weenie doesn't love newegg?). Over the next few weeks I'll write about the Data Basement as it gets built out and evolves into something useful. With any luck I'll also publish a tutorial with photos about rolling your rackables.

Budgetary Issues and Physical Plant

I need to build my Data Basement on a budget. After all, I still need to pay the mortgage, feed the family, buy Hoosiers for the racecar, and pay for all the electricity my spiffy new grid will need. I randomly chose a budget in the range of $3,000 - $3,500. The grid will live in the basement, and I want to save space there, so I will use a single standard rack that I picked up in the past. The rack will sit on a wooden platform I'll build from scrap wood (cheap protection in case of minor flooding). Since my basement has reasonably high humidity, I'll put a dehumidifier plumbed to a wasteline to keep humidity levels under control. The grid will pull between around 1.4Kw (about 13 amps @ 110VAC), so there will not be a need for any special electrical work. I want UPS support eventually on the grid, but initially I'll just use some spike filters on each AC line. I already have broadband via DSL with a wired/wireless router, but I'll need a gig switch with several ports so that the grid nodes can communicate with each other at speed.

The next posting will discuss parts selection for the individual compute nodes.

65 nanometer chip from AMD: The war continues...

2006-12-05T02:01:00.000-05:00

If you're a fan of AMD, and it's a poorly kept secret that I am one, then you'll be pleased to read the latest blurb in CNET regarding new 65 nanometer chips from AMD:

http://news.com.com/2100-1006_3-6140764.html?part=rss&tag=2547-1_3-0-5&subj=news

This is good stuff, and helps AMD to catch up with Intel who rolled out 65nm process over a year ago.

Netting out all the process changed, the new chips will draw about 30% less power and will be cheaper to produce. This means AMD can compete more heavily against Intel. All this, of course, is great news to chip purchasers.

More performance, less power, less money...gotta love it!

Good reading...

2005-06-30T20:36:00.000-04:00

My interests these days are around searching, indexing, and processing large data sets...datamining on steroids to some extent. Think terabytes not gigabytes, think hundreds of thousands of files instead of thousands, and that's getting to the scale that interests me. Parallel programming, grid processing, and all those new (old) technologies are a big part of making that sort of processing scalable and affordable, especially for small companies. Since I've been a research/learn/academic state of mind this summer, here's part of my current reading list:

Hariri, Parashar et. al., "Tools and Environments for Parallel and Distributed Computing" (Wiley, 2004)
Chakrabarti, Soumen, "Mining the web: Discovering Knowledge from Hypertext Data" (Morgan Kaufmann, 2003)
Morse, H. Stephen, "Practical Parallel Computing" (Academic Press, 1994)

The "Tool and Environments" book is a solid overview/review. The Chakrabarti book is fascinating...I can nit a bit about it being dated (alta vista was the BIG search engine when the book was written), but the presentation of algorithms, and the depth of detail is impressive...a real "page turner" in a geeky sort of way. I bought "Practical Parallel Computing" back in 1994 when I was prepping for an interview with Thinking Machines (remember them?). The material is a bit dated, but if you view "Tools and Environments" and "Practical" as a combined skim-through/review it's worth the time.

I really have enjoyed reading Chakrabarti thus far...not yet sure what to do with all the ideas coming out of the reading, but I've got some interesting thoughts.