25 March 2008

Hadoop Summit: Michael Isard and DryadLINQ

Michael Isard is from Microsoft Research:

"Dryad: Beyond Map-Reduce"

What is Map-Reduce: An implementation, a computational model, a programming model.

Implementation Performance: Literally map, reduce, and that's it...reducers write to replicated storage. Complex jobs require pipeline multipl stages....no fault tolerance between stages. Output of Reduce: 2 network copies, 3 disks

Computational Model: Join combines inputs of diff types, "split" produced outputs of different types. This can be done with map-reduce, but leads to ugly programs. Hard to avoid performance penalty described above. Some merge-joins are very expensive. Finally, baking in more operators adds complexity.

Dryad Middleware Layer: Address flexibility and performance issues, more generalized than map-reduce, interface is more complex.

Computational Model: Job is a DAG, each node takes any number of inputs and produces any number of outputs (you need to see the picture).

DAG Abstraction Layer: Scheduler handles arbitrary graphs independent of vertext semantics, simple uniform state machine for scheduling and fault-tolerance. Higher levels build plan from application code: Layers isolated, many optimizations are simple graph manipulations, graph can be modified at runtime.

MapReduce Programming Model: opaque pairs flexible. Front-ends like sawzall and pig help, but domain specific simplifications limit some applications.

Moving beyond simple data-mining to machine learning, etc.

LINQ: Extensions to .Net in Visual Studio in 2008...general purpose data-parallel programming constructs. Data elements are arbitrary .NET types, combined in generalized framework

DryadLINQ: Automagically distribute a LINQ program; some Dryad-specific extensions: same source program runs on single-core to multi-core to cluster. Execution model depends on data source.

LINQ designed to be extensible, LINQ+C# provides parsing, thype checking. LINQ builds expressioin tree. Root provider class called on evaluation (has access to entire tree, reflection allows very powerful operations). Can add custom operators.

PLINQ - Running Queries on Multi-Core Processors - parallel implementation of LINQ.

SQL server to LINQ: cluster computers run run SQL Server. Partitioned tables in local SQLServer DBs. DryadLINQ process can use "SQL to LINQ" provider - "best of both worlds".

Continuing research:L Applicatioinlevel research (what can we do), system level research (how can we improve performance), LINQDB?

No comments: