Introduction
Many applications can be modeled as graphs. Google has introduced the Pregel to process large scale graph processing applications. Lets delve more into this Pregel framework.
Few inspirations towards graph processing frameworks:
1) Facebook’s social graph contains 721 million users, 69 billion friendship links. Average distance between two users are 4.74
2) Representation of Wikipedia articles in a graph
3) Database trend is moving towards Graph Database.
Pros:
Large scale graph processing distributed system
Provides fault-tolerance capabilities through checkpointing
Performance of system is improved by bulk synchronous computation
API is modeled as a ‘think like a vertex’
Overall, Pregel has influenced state-of-art towards graphs.
Cons:
Lets talk about issues with Pregel using an example.
In one case,micro data centers are distributed geographically. Nodes in a data center are heterogeneous in nature.
In other case, graph is processed in homogenous mega data center. Performance of Pregel(considering a design in paper) in case 2 is much better than case 1 for following reasons.
1) Pregel doesn’t consider heterogeneous nature of system while partitioning the data. Slow node gets same size of data but can’t perform in similar speed as a fast node.
2) Due to single synchronization barrier, slow node slows down entire whole graph computation.
3) What if Pregel partitions graph data such that there involved huge communication between nodes which are far away.
4) Shape of graph is not considered in partitioning data.
Current state:
Twitter has introduced Cassovary, another big graph processing library. Many projects has inspired from Pregel. Open source projects are Apache Hama and Giraph. Giraph is has strong contributors from Twitter, Facebook, LinkedIn.
Brief descripion of inspired projects:
Apache Hama | Pure BSP implementation over Hadoop |
Giraph | It is almost similar to Hama |
HipG | Java based library and no single synchronization barrier |
Signal/Collect | gives same importance to vertices & edges instead of focusing on vertex. |
Phoebus | Pregel in Erlang |
Discussion Points:
MapReduce vs Pregel
Issues with single synchronization barrier
Best way of partitioning graph data