25 March 2008

Hadoop Summit - Christopher Olsten and PIG

I saw Chris talk about Pig at MIT a few weeks ago...this looks like the same presentation..

Example: Tracking users who visit good pages (pagerank type thing). Typical operations: loading data, canonicalizing, database JOIN-type operation, database GROUP BY operation, leading to a something which computes the pagerank.

note: the drawings within the presentation make the above very clear. Striping visits and pages across multipe servers, highly parallel processing..fairly straightforward approach.

But using just map/reduce: Write the join yourself (ccg - been there done that, thanks Joydeep Sen Sharma for getting me started). Hadoop users tend to share code among themselves for doing JOINs, and how best to do the join operation, etc. In short, things get ugly in a hurry - gluing map/reduce jobs together, etc. You have to do a lot of low-level operations by hand, etc. It's potentially hard to understand and maintain code.

So: A data flow language could easily synthesize map/reduce sequences

PIG accomplishes this, using a dataflow language called Pig Latin. Essentially a terse language where each step is loading data, doing some sort of filtering/canonical operation, or doing custom work (via map/reduce). Operators includ: FILTER, FOREACH, GENERATE, GROUP, JOIN, COGROUP, UNION. Also support for sorting, splitting, etc. Goal is a very simple language to do powerful things.

Related languages:
SQL - declarative language (i.e. what, not how).
Pig Latin: Sequence of simple steps - close to imperative/procedural programming, semantic order of operations is obvious, incremental construction, debug by viewing intermediate results, etc.

Map/Reduce: welds together primitives (process records - > create groups -> process groups)
Pig Latin: Map/Reduce is basically a special case of Pig, Pig adds built-in primitives for most-used transformations.

So: Is Pig+Hadoop a database system ? Not really.....

Workload: DBMS does tons of stuff, P+H does bulk reads/writes only...just sequential scans
Data Representation: DBMS controls format...must predeclare schema, etc, Pigs eat anything :-)
Programming Style: DMBS - system of constraints, P+H: sequence of steps
Custom Processing: DBMS: functions second class to logic expressions, P+H: Easy to extend

Coming Soon to Pig: Streaming (external executables), static type checking, error handling (partial evaluation), development environment (Eclipse).





No comments: