Michael Isard is from Microsoft Research:
"Dryad: Beyond Map-Reduce"
What is Map-Reduce: An implementation, a computational model, a programming model.
Implementation Performance: Literally map, reduce, and that's it...reducers write to replicated storage. Complex jobs require pipeline multipl stages....no fault tolerance between stages. Output of Reduce: 2 network copies, 3 disks
Computational Model: Join combines inputs of diff types, "split" produced outputs of different types. This can be done with map-reduce, but leads to ugly programs. Hard to avoid performance penalty described above. Some merge-joins are very expensive. Finally, baking in more operators adds complexity.
Dryad Middleware Layer: Address flexibility and performance issues, more generalized than map-reduce, interface is more complex.
Computational Model: Job is a DAG, each node takes any number of inputs and produces any number of outputs (you need to see the picture).
DAG Abstraction Layer: Scheduler handles arbitrary graphs independent of vertext semantics, simple uniform state machine for scheduling and fault-tolerance. Higher levels build plan from application code: Layers isolated, many optimizations are simple graph manipulations, graph can be modified at runtime.
MapReduce Programming Model: opaque
Moving beyond simple data-mining to machine learning, etc.
LINQ: Extensions to .Net in Visual Studio in 2008...general purpose data-parallel programming constructs. Data elements are arbitrary .NET types, combined in generalized framework
DryadLINQ: Automagically distribute a LINQ program; some Dryad-specific extensions: same source program runs on single-core to multi-core to cluster. Execution model depends on data source.
LINQ designed to be extensible, LINQ+C# provides parsing, thype checking. LINQ builds expressioin tree. Root provider class called on evaluation (has access to entire tree, reflection allows very powerful operations). Can add custom operators.
PLINQ - Running Queries on Multi-Core Processors - parallel implementation of LINQ.
SQL server to LINQ: cluster computers run run SQL Server. Partitioned tables in local SQLServer DBs. DryadLINQ process can use "SQL to LINQ" provider - "best of both worlds".
Continuing research:L Applicatioinlevel research (what can we do), system level research (how can we improve performance), LINQDB?
25 March 2008
Hadoop Summit: Michael Isard and DryadLINQ
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment