**Introduction**

Many applications can be modeled as graphs. Google has introduced the Pregel to process large scale graph processing applications. Lets delve more into this Pregel framework.

Few inspirations towards graph processing frameworks:

1) Facebook’s social graph contains 721 million users, 69 billion friendship links. Average distance between two users are 4.74

2) Representation of Wikipedia articles in a graph

3) Database trend is moving towards Graph Database.

Pros:

Large scale graph processing distributed system

Provides fault-tolerance capabilities through checkpointing

Performance of system is improved by bulk synchronous computation

API is modeled as a ‘think like a vertex’

Overall, Pregel has influenced state-of-art towards graphs.

Cons:

Lets talk about issues with Pregel using an example.

In one case,micro data centers are distributed geographically. Nodes in a data center are heterogeneous in nature.

In other case, graph is processed in homogenous mega data center. Performance of Pregel(considering a design in paper) in case 2 is much better than case 1 for following reasons.

1) Pregel doesn’t consider heterogeneous nature of system while partitioning the data. Slow node gets same size of data but can’t perform in similar speed as a fast node.

2) Due to single synchronization barrier, slow node slows down entire whole graph computation.

3) What if Pregel partitions graph data such that there involved huge communication between nodes which are far away.

4) Shape of graph is not considered in partitioning data.

Current state:

Twitter has introduced Cassovary, another big graph processing library. Many projects has inspired from Pregel. Open source projects are Apache Hama and Giraph. Giraph is has strong contributors from Twitter, Facebook, LinkedIn.

Brief descripion of inspired projects:

Many applications can be modeled as graphs. Google has introduced the Pregel to process large scale graph processing applications. Lets delve more into this Pregel framework.

Few inspirations towards graph processing frameworks:

1) Facebook’s social graph contains 721 million users, 69 billion friendship links. Average distance between two users are 4.74

2) Representation of Wikipedia articles in a graph

3) Database trend is moving towards Graph Database.

Pros:

Large scale graph processing distributed system

Provides fault-tolerance capabilities through checkpointing

Performance of system is improved by bulk synchronous computation

API is modeled as a ‘think like a vertex’

Overall, Pregel has influenced state-of-art towards graphs.

Cons:

Lets talk about issues with Pregel using an example.

In one case,micro data centers are distributed geographically. Nodes in a data center are heterogeneous in nature.

In other case, graph is processed in homogenous mega data center. Performance of Pregel(considering a design in paper) in case 2 is much better than case 1 for following reasons.

1) Pregel doesn’t consider heterogeneous nature of system while partitioning the data. Slow node gets same size of data but can’t perform in similar speed as a fast node.

2) Due to single synchronization barrier, slow node slows down entire whole graph computation.

3) What if Pregel partitions graph data such that there involved huge communication between nodes which are far away.

4) Shape of graph is not considered in partitioning data.

Current state:

Twitter has introduced Cassovary, another big graph processing library. Many projects has inspired from Pregel. Open source projects are Apache Hama and Giraph. Giraph is has strong contributors from Twitter, Facebook, LinkedIn.

Brief descripion of inspired projects:

Apache Hama | Pure BSP implementation over Hadoop |

Giraph | It is almost similar to Hama |

HipG | Java based library and no single synchronization barrier |

Signal/Collect | gives same importance to vertices & edges instead of focusing on vertex. |

Phoebus | Pregel in Erlang |

Discussion Points:

MapReduce vs Pregel

Issues with single synchronization barrier

Best way of partitioning graph data

Discussion Points:

MapReduce vs Pregel

Issues with single synchronization barrier

Best way of partitioning graph data

## 2 comments:

good point about load balancing. what does the paper claim about this? does it allocate an equal number of nodes to each worker or is it doing something smarter?

Paper didn't discuss about design decisions on load balancing.

Even during experiments, graph partitions are assigned to nodes based random hash function. Though authors mentioned topology aware assignment might give better results.

Post a Comment