Thursday, September 20, 2012

Pregel:Google's way of processing graphs


Introduction
Many applications can be modeled as graphs. Google has introduced the Pregel to process large scale graph processing applications. Lets delve more into this Pregel framework.

Few inspirations towards graph processing frameworks:
1) Facebook’s social graph contains 721 million users, 69 billion friendship links. Average distance between two users are 4.74
2) Representation of Wikipedia articles in a graph
3) Database trend is moving towards Graph Database.

Pros:
Large scale graph processing distributed system
Provides fault-tolerance capabilities through checkpointing
Performance of system is improved by bulk synchronous computation
API is modeled as a ‘think like a vertex’

Overall, Pregel has influenced state-of-art towards graphs.

Cons:
Lets talk about issues with Pregel using an example.
In one case,micro data centers are distributed geographically. Nodes in a data center are heterogeneous in nature.
In other case, graph is processed in homogenous mega data center. Performance of Pregel(considering a design in paper) in case 2 is much better than case 1 for following reasons.
1) Pregel doesn’t consider heterogeneous nature of system while partitioning the data. Slow node gets same size of data but can’t perform in similar speed as a fast node.
2) Due to single synchronization barrier, slow node slows down entire whole graph computation.
3) What if Pregel partitions graph data such that there involved huge communication between nodes which are far away.
4) Shape of graph is not considered in partitioning data.

Current state:
Twitter has introduced Cassovary, another big graph processing library. Many projects has inspired from Pregel. Open source projects are Apache Hama and Giraph. Giraph is has strong contributors from Twitter, Facebook, LinkedIn.

Brief descripion of inspired projects:

Apache HamaPure BSP implementation over Hadoop
GiraphIt is almost similar to Hama
HipGJava based library and no single synchronization barrier
Signal/Collectgives same importance to vertices & edges instead of focusing on vertex.
PhoebusPregel in Erlang


Discussion Points:
MapReduce vs Pregel
Issues with single synchronization barrier
Best way of partitioning graph data



2 comments:

jon_weissman said...

good point about load balancing. what does the paper claim about this? does it allocate an equal number of nodes to each worker or is it doing something smarter?

Unknown said...

Paper didn't discuss about design decisions on load balancing.

Even during experiments, graph partitions are assigned to nodes based random hash function. Though authors mentioned topology aware assignment might give better results.