In this section we provide a brief tutorial on how the pieces of GraphLab come together to form a simple GraphLab program. In this tutorial we construct a synthetic application which uses many of the GraphLab concepts. However, many real GraphLab applications will not need all the pieces described in this tutorial. For a syntax highlighted (condensed version) of the demo.cpp see demo.cpp
We begin by first describing the synthetic problem we will be solving. We are given a d by d grid of magnetic coins. Each coin only interacts with its 4 neighboring coins as illustrated by the edges in Figure 1(a). When a coin is flipped one of the following two outcomes occurs:
(a) | (b) | (c) |
This picture illustrates the grid of red/black coins. Each coin is connected to its 4 cardinal neighbors (as shown by the edges). (a) Illustrates an intermediate state of the system. (b,c) The two stable states of the system. |
The only two stable joint states are when all coins are black, Figure 1(b), or where all coins are red, Figure 1(c). We are interested in computing the average number of flips for each coin before a stable state is reached.
We first need to include some headers: graphlab.hpp
includes everything you will ever need from GraphLab.
// standard C++ headers #include <iostream> // includes the entire graphlab framework #include <graphlab.hpp>
In some of our demo code you may occasionally encounter the following additional included header:
#include <graphlab/macros_def.hpp>
If you use macros_def.hpp
in a header file, it must be paired with a matching #include <graphlab/macros_undef.hpp>
at the end of the header file.
The first step to designing a GraphLab program is setting up the data graph. To do this we will need to define the data elements and their dependencies. The primary data element in this simple program is the data at each vertex which records the number of flips so far and the current color of the vertex. Here we will assume the color red is true and the color black is false.
struct vertex_data { size_t numflips; bool color; };
GraphLab provides facilities to directly save/load graphs from disk. However, to do so, you must implement a save and load function so that GraphLab can understand your data-structures. The serialization mechanism is simple to use and it understands all basic datatypes as well as standard STL containers. If the STL container contains non-basic datatypes (such as a struct), save/load functions must be written for the datatype. If we wanted to be able to save the graph to a file we would implement the following functions. See Serialization for more details.
struct vertex_data { size_t numflips; bool color; void save(graphlab::oarchive& archive) const { archive << numflips << color; } void load(graphlab::iarchive& archive) { archive >> numflips >> color; } };
In this example, we do not need edge data. However GraphLab currently does not have a mechanism to completely disable the use of edge data. Therefore, we will just put an arbitrary small placeholder type on the edges.
typedef char edge_data;
Note that we do not need to write a save/load function here since GraphLab's serializer already understands basic datatypes. One could imagine an alternative problem where edge weights are associated with each pair of interacting coins.
The GraphLab graph is templatized over the vertex data as well as the edge data. Here we define the type of the graph using a typedef for convenience.
typedef graphlab::graph<vertex_data, edge_data> graph_type;
Since graphlab is heavily templatized and can be inconvenient to use in its standard form, the graphlab::types
structure provides convenient typedefed "shortcuts" to figure out the other graphlab types easily.
typedef graphlab::types<graph_type> gl;
Rather than needing to directly instantiate template interfaces like:
we can use the simpler syntax:
gl::iscope* scope;
The next step in constructing a GraphLab program is to construct the actual graph. The core graph datastructure is documented here: graphlab::graph . To simplify the presentation we will define a function which takes a reference to a graph and populates the graph. Later we will show how to construct the empty graph. The init_graph function takes an additional argument which describes the number of coins along each dimension of the grid.
void init_graph(graph_type& g, size_t dim) {
We first create create d * d vertices. We use the graph's gl::vertex_id_t add_vertex(vertex_data) method which takes the vertex data as input and returns the vertex id of the new vertex. The ids are guaranteed to be sequentially numbered. The graph data structure behaves like an STL container and stores the vertex data by value.
for (size_t i = 0; i < dim * dim; ++i) { // create the vertex data, randomizing the color vertex_data vdata; vdata.numflips = 0; // Flip a uniform coin to obtain the initial color if (graphlab::random::rand_int(1) == 1) vdata.color = true; else vdata.color = false; // create the vertex g.add_vertex(vdata); }
Now we add all the edges. To add edges we use the graph's gl::edge_id_t add_edge(src, target, edgedata) function which creates an edge from src to target with the edge data given by edgedata. The add_edge function then returns the id of the new edge. The ids are guaranteed to be sequentially numbered. GraphLab does NOT support duplicated edges, and currently has no facilities for checking for accidental duplicated edge insertions at the graph construction stage. (It is quite costly to do so) Any duplicated edges will result in an assertion failure at the later finalize stage. Furthermore, the current version does not support vertex or edge removal. These constraints are imposed to enable the efficient construction of massive graphs while retaining fast look-up. For more details about the graph data-structure see graphlab::graph
edge_data edata; for (size_t i = 0; i < dim; ++i) { for (size_t j = 0; j < dim - 1; ++j) { // add the horizontal edges in both directions g.add_edge(dim * i + j, dim * i + j + 1, edata); g.add_edge(dim * i + j + 1, dim * i + j, edata); // add the vertical edges in both directions g.add_edge(dim * j + i, dim * (j + 1) + i, edata); g.add_edge(dim * (j + 1) + i, dim * j + i, edata); } }
The above block of code connects all the vertices in the grid pattern illustrated in Figure 1(a). Now that the graph is fully constructed, we need to call graphlab::graph::finalize graph.finalize().
g.finalize();
} // end of init_graph function
The finalize function reorders the vertex adjacency tables so that the in_edge_ids(vertex_id_t) returns edges in order of the source vertex id and the out_edge_ids(vertex_id_t) returns edges in order of the target vertex id. This sorting also enables O(log(degree)) edge retrieval.
Now we define the update function which represents the basic block of computation in this program. The update function is applied to each vertex and has read/write access to the data at that vertex, as well as all adjacent edges and vertices. You may specify more than one update function, but we only need one for this application. Lets first outline the computation that will occur in the update function. Below is a list of the steps:
Each update function call has the option of inserting new tasks into the scheduler: in this case, its self and its neighbors. The algorithm terminates when there are no tasks remaining. There are other methods for terminating execution, such as registering a termination valuator with the engine, but we are not going to describe that here. We now present the entire update function and then discuss each part individually.
void update_function(gl::iscope& scope, gl::icallback& scheduler) { // Get a reference to the vertex data an in edges vertex_data& curvdata = scope.vertex_data(); gl::edge_list in_edges = scope.in_edge_ids(); // Count the number of red neighbors size_t num_red_neighbors = 0; for (size_t i = 0; i < in_edges.size(); ++i) { // eid is the current edge id size_t eid = in_edges[i]; size_t sourcev = scope.source(eid); const vertex_data& nbrvertex = scope.neighbor_vertex_data(sourcev); if (nbrvertex.color) ++num_red_neighbors; } // get the total number of neighbors we have size_t num_neighbors = in_edges.size(); // Draw the new color bool new_color = graphlab::random::rand01() < (double(num_red_neighbors) / num_neighbors); // Determine if the draw was deterministic bool is_deterministic = num_neighbors == num_red_neighbors || num_red_neighbors == 0; // see if I flip and update the current vertex data. bool color_changed = new_color != curvdata.color; if (color_changed) ++curvdata.numflips; // Assign the new color curvdata.color = new_color; // If I flipped, all my neighbors could be affected, loop through // all my neighboring vertices and add them as tasks. if (color_changed) { for (size_t i = 0; i < in_edges.size(); ++i) { size_t sourcev = scope.source(in_edges[i]); scheduler.add_task(gl::update_task(sourcev, update_function), 1.0); } } // Reschedule myself if this was not a deterministic draw if (is_deterministic == false) { scheduler.add_task(gl::update_task(scope.vertex(), update_function), 1.0); } }
All update functions must have the following form:
void update_function(gl::iscope& scope,
gl::icallback& scheduler) {
The parameters are described here:
|num_vertices| - 1
), and similarly for edges. All edges are directed. The scope provides methods to access the data associated with the vertex its neighbors and any inbound or outbound edges.First we get a mutable reference to the vertex data on this vertex.
vertex_data& curvdata = scope.vertex_data();
Note that the function scope.vertex_data() returns a reference to the vertex data. Modifications to this reference will directly modify the data on the graph. Then we get a constant reference to the vector of edge ids for all edges inbound to this vertex:
const std::vector<gl::edge_id_t>& in_edges = scope.in_edge_ids();
We now compute the total number of red neighbors. This is done by looping through all my neighboring vertices and counting the number of red vertices. To do this I have to look at my neighboring vertices data. The neighbor_vertex_data(vid) function allow me to read the vertex data of a vertex adjacent to the current vertex. If we have edge data, the function edge_data(eid) will return a reference to the edge data on the edge eid. Since I am not going to need to change this data, I can just grab a const reference. You should always try to use const references whenever you know that you will definitely not be changing the data, since GraphLab could make additional optimizations for "read-only" operations. Similarly, you should never cast a constant reference to a regular reference. Modifications to constant references have undefined behavior.
// Count the number of red neighbors size_t num_red_neighbors = 0; for (size_t i = 0; i < in_edges.size(); ++i) { size_t eid = in_edges[i]; size_t sourcev = scope.source(eid); const vertex_data& nbrvertex = scope.neighbor_vertex_data(sourcev); if (nbrvertex.color) ++num_red_neighbors; }
Now we can decide on the new color of the vertex. We either match the majority color, or in the case of no majority color, we flip a coin. Extensive random number support is provided through graphlab::random. random::rand01() provides a random floating point number between 0 and 1. There are a number of other distribution based generators available in Random Number Generators
// Draw the new color bool new_color = graphlab::random::rand01() < (double(num_red_neighbors) / num_neighbors);
Now we have decided on the new color of the vertex, we can go ahead and update the value of the vertex. Once again recall that curvdata is a reference to the actual graph data. Therefore we can just modify it directly. Here we will track whether the flip was deterministic and whether a new color was chosen.
// Determine if the draw was deterministic bool is_deterministic = num_neighbors == num_red_neighbors || num_red_neighbors == 0; // see if I flip and update the current vertex data. bool color_changed = new_color != curvdata.color; if (color_changed) ++curvdata.numflips; // Assign the new color curvdata.color = new_color;
Now for the task creation algorithm. There are 2 basic cases:
if (color_changed) { for (size_t i = 0; i < in_edges.size(); ++i) { size_t sourcev = scope.source(in_edges[i]); scheduler.add_task(gl::update_task(sourcev, update_function), 1.0); } }
Now, there is another special case. If flipped myself on a random number, then I could switch colors when updating myself again. Therefore I should try again and update myself again in the future
// Reschedule myself if this was not a deterministic draw if (is_deterministic == false) { scheduler.add_task(gl::update_task(scope.vertex(), update_function), 1.0); } } // end of update_function
The shared variable system serves two roles. First, it provides access to data which is globally accessible through all update functions. Second, the shared variables provides the capability to compute aggregations of all graph data. The first capability may not appear to be useful in the shared memory setting since one could simply define global variables. However, using the shared data manager, allows GraphLab to manage these "global variables" in a platform independent fashion; such as in the distributed setting and will lead to more portable code. Additionally, the Shared variables system use an RCU mechanism to ensure safe access to shared data.
For this application, we are interested in having an incremental counter which provides the total number of flips executed so far, as well as computing the proportion of red vertices in the graph. We can achieve these tasks using the shared data Sync mechanism.
The Sync mechanism allows you to build a 'Fold / Reduce' operation across all the vertices in the graph, and store the results in a shared variable.
We will therefore define the following 3 shared variables:
gl::glshared_const<size_t> NUM_VERTICES; gl::glshared<double> RED_PROPORTION; gl::glshared<size_t> NUM_FLIPS;
The number of vertices in the graph is a constant, and will not be changed through out execution of the GraphLab program. We simply just set its value using the set() function.
NUM_VERTICES.set(DIM * DIM);
To get the value of a shared variable:
size_t numvertices = NUM_VERTICES.get_val();
A sync is defined by a pair of functions, a reducer, and an apply The reducer is exactly a fold over all the vertices, and the apply takes the final value at the end of the reduce, and performs whatever transformation it needs, before writing it into the Shared variable. For instance, an L2 sum can be computed by having the reducer add the squares of the values at each vertex, then the apply function performs the square root.
We will use this to implement the RED_PROPORTION sync. The way we will implement this is to use the reducer to count the number of Red vertices. The apply function will then divide the result by the value in the NUM_VERTICES table entry. The reducer is a function is of the form:
void reduce_red_proportion( gl::iscope& scope, graphlab::any& accumulator) {
In this reducer, we will simply increment the accumulator if the color of the vertex is red.
if (scope.vertex_data().color) accumulator.as<double>()++; }
The apply function takes the followng 2 parameters
void apply_red_proportion(graphlab::any& current_data, const graphlab::any& new_data) {
The reduced result in new_data, will be the count of the number of red vertices. We can get the total number of vertices from NUM_VERTICES, and compute the proportion of red vertices by dividing the number of red vertices by the total number of vertices.
size_t numvertices = NUM_VERTICES.get(); // new_data is the reduced result, which is the number // of red vertices double numred = new_data.as<double>(); // compute the proportion double proportion = numred / numvertices; // here we can output something as a progress monitor std::cout << "Red Proportion: " << proportion << std::endl; // write the final result into the shared variable current_data.as<double> = (double)proportion; }
Now that both reduce and apply functions have been defined, we can create the sync, by calling in the beginning of the program
core.set_sync(RED_PROPORTION,
reduce_red_proportion,
apply_red_proportion,
double(0),
100);
Here we have set the reduction proportion key to store the result of running reduce_red_proportion to fold over all the vertices with starting value double(0) and result applied by using apply_red_proprotion. The operation will be applied approximately every 100 updates.
GraphLab provides a number of predefined syncing operations which allow simple reductions / applies to be implemented very quickly. For instance, computing sums, sums of squares, etc. We will implement the NUM_FLIPS entry using one of these predefined operations. Since the vertex data could be any arbitrary type, the predefined operations typically require the user to provide a simple function which extracts the information of interest from the vertex data. In this case, we are interested the numflips field.
size_t get_flip(const vertex_data& v) { return v.numflips; }
To create the sync, we use the set_sync function as well, but using functions from graphlab::glshared_sync_ops and graphlab::glshared_apply_ops. In this case, our reduction function is a simply "sum", while our apply function should do nothing more than copy the result of the reduction into the shared variable.
core.set_sync(NUM_FLIPS,
gl::glshared_sync_ops::sum<size_t, get_flip>,
gl::glshared_apply_ops::identity<size_t>,
size_t(0),
100);
We can now write a simple init_shared_data function to take a reference to a core and initialize all the shared variables.
void init_shared_data(gl::core& core, size_t dim) { NUM_VERTICES.set(dim * dim); core.set_sync(RED_PROPORTION, reduce_red_proportion, apply_red_proportion, double(0), 100); core.set_sync(NUM_FLIPS, gl::glshared_sync_ops::sum<size_t, get_flip>, gl::glshared_apply_ops::identity<size_t>, size_t(0), 100); }
A merge operation can also be provided which will allow the sync operation to be parallelized. This is highly recommended.
If a merge function is defined for a sync, the set of vertices will be partitioned into a collection of disjoint sets. The sync function is then performed on each set in parallel, producing a number of partial results. These partial results are then combined with the merge function. Finally, the apply function is executed on the final result and written to the shared variable.
Since the intermediate results of the sync is simply a count of the number of red vertices seen so far, the merge is then just a sum.
void merge_red_proportion(graphlab::any& target, const graphlab::any& source) { target.as<double>() += source.as<double>(); }
To sync using the merge operation, the set_sync is called using:
core.set_sync(RED_PROPORTION,
reduce_red_proportion,
apply_red_proportion,
double(0),
100,
merge_red_proportion);
Similarly, a number pre-defined merge functions are defined in graphlab::glshared_merge_ops , and the NUM_FLIPS sync can be defined in the same way.
core.set_sync(NUM_FLIPS,
gl::glshared_sync_ops::sum<size_t, get_flip>,
gl::glshared_apply_ops::identity<size_t>,
size_t(0),
100,
gl::glshared_merge_ops::sum<size_t>);
The Main function where everything begins. Here, we will demonstrate the minimal code needed to start a GraphLab job using all the parts we defined above. We will use the GraphLab command line tools to setup a basic engine using command line options. We first present the complete main and then discuss each of the parts.
int main(int argc, char *argv[]) { // Parse command line options graphlab::command_line_options opts; size_t dimensions = 20; opts.attach_option("dim", &dimensions, dimensions, "the dimension of the grid"); opts.scheduler_type = "fifo"; opts.scope_type = "edge"; if(!opts.parse(argc, argv)) return EXIT_FAILURE; // Create the core which contains the graph and engine gl::core glcore; // Initialize engine with command line options glcore.set_engine_options(opts); // Initialize the the data structures init_graph(glcore.graph(), dimensions); init_shared_data(glcore, dimensions); // Add all starting tasks glcore.add_task_to_all(update_function, 1.0); // Run the graphlab engine double runtime = glcore.start(); // Output the results std::cout << "Completed in " << runtime << " seconds" << std::endl; glcore.sync_now(NUM_FLIPS); glcore.sync_now(RED_PROPORTION); // now we can look the values using the get() function size_t numberofflips = NUM_FLIPS.get_val(); double redprop = RED_PROPORTION.get_val(); std::cout << "Number of flips: " << numberofflips << std::endl; std::cout << "Red prop: " << redprop << std::endl; // output the graph size_t ctr = 0; for (size_t i = 0;i < dimensions; ++i) { for (size_t j = 0;j < dimensions; ++j) { std::cout << size_t(glcore.graph().vertex_data(ctr).color) << " "; ++ctr; } std::cout << std::endl; } }
Since the GraphLab engine can take many options we have built-in some command line parsing tools. In this program we add an additional command line argument "dim" which specifies the size of the grid. We set the default value to 20. In addition we set the default scheduler to be "fifo" and the default scope type to be "edge". If the user provides different values than the defaults will be replaced. The call to opts.parse(argc,argv) invokes the p arser and fills in the fields.
The gl::core object bundles an empty graph, and engine configuration into a single object. The core.set_engine_options(opts) takes the engine options from the command line and uses them to configure the internal engine. The graph can be retrieved from the glcore by calling the glcore.graph() function. The glcore.add_task_to_all function adds an update task to each vertex with the desired update function and priority value.
Finally, the engine is run by calling glcore.start() which runs until their are no more tasks remaining or until a termination condition is reached.
After compilation,
./demo --help
will produce a list of all the available options.
GraphLab program.: --help Print this help message. --dim arg (=20) the dimension of the grid --ncpus arg (=2) Number of cpus to use. --engine arg (=async) Options are {async, async_sim, synchronous} --affinities arg (=0) Enable forced assignment of threads to cpus --schedyield arg (=1) Enable yielding when threads conflict in the scheduler. --scope arg (=edge) Options are {none, vertex, edge, full} --metrics arg (=basic) Options are {none, basic, file, html} --schedhelp arg Display help for a particular scheduler. --scheduler arg (=fifo) Supported schedulers are: chromatic, sweep, fifo, priority, multiqueue_fifo, multiqueue_priority, splash, round_robin, clustered_priority, sampling. Too see options for each scheduler, run the program with the option ---schedhelp=[scheduler_name]
Observe that the dim
option defined by the attach_option() call in the main function appears as an available option.
The GraphLab command line permit quite flexible manipulation of the scheduler capabilities and options through the command line using the --scheduler arg
options. Running --schedhelp
displays all the available scheduler options.
When this tutorial was written, the output of running ./demo --schedhelp
is
chromatic scheduler -------------------------------------------------- a scheduler which performs #iterations sweeps of the graph using a graph color ordering. Options: max_iterations = [integer, default = 0] update_function = [update_function_type,default = set on add_task] sweep scheduler -------------------------------------------------- very fast dynamic scheduler. Scans all vertices in sequence, running all update tasks on each vertex evaluated. Options: ordering = [string: linear/permute, default=linear] fifo scheduler -------------------------------------------------- Standard FIFO task queue, poor parallelism, but task evaluation sequence is highly predictable. Useful for debugging and testing. Options: priority scheduler -------------------------------------------------- Standard Priority queue, poor parallelism, but task evaluation sequence is highly predictable. Useful for debugging Options: multiqueue_fifo scheduler -------------------------------------------------- One or more FIFO task queues is assigned to each processor, where the queues are stochastically load balanced. Like the fifo scheduler, but less predictable, and much faster. Options: multiqueue_priority scheduler -------------------------------------------------- One or more Priority task queues is assigned to each processor, where the queues are stochastically load balanced. Like the priority scheduler, but less predictable, and much faster. Options: splash scheduler -------------------------------------------------- Similar to the priority queue scheduler, but allows for only one update function. Updates are evaluted in a "splash" ordering Options: splash_size = [integer, default = 100] update_function = [update_function_type,default = set on add_task_to_all] round_robin scheduler -------------------------------------------------- Loops over a sequence of tasks repeatedly for # iterations. Options: max_iterations = [integer, default = 0] start_vertex = [integer, default = 0] clustered_priority scheduler -------------------------------------------------- Like the priority scheduler, but groups vertices into clusters where the entire cluster has a single priority Options: partition_method = [string: metis/random/bfs, default=metis] vertices_per_partition = [integer, default = 100] sampling scheduler -------------------------------------------------- A scheduler which samples vertices to update based on a multinomial probability which can be updated dynamically.
This lists all the available schedulers and the available options for each scheduler. For instance: to run the demo process using the sweep scheduler with a randomly permuted ordering:
./demo --scheduler="sweep(ordering=permute)"
Additional scheduler options are seperated with a comma. For instance to run with the clustered priority scheduler using random partitioning and 50 vertices per partition:
./demo --scheduler="clustered_priority(partition_method=random,vertices_per_partition=50)"
When an update function is executed on a vertex, it can access all graph data on adjacent edges and adjacent vertices. The different scoping consistency models provide different data consistency guarantees when accessing graph data. There are three scoping models, vertex, edge, and full.
The user should try to pick the lowest consistency model which satisfies the needs of the algorithm. For instance, in this demo application, since the update function only requires reading of neighboring vertex data, the edge_consistency model is guaranteed to have sequential consistency, and the algorithm is therefore guaranteed to be correct (assuming GraphLab is bug-free) if executed with the edge consistency model or the full consistency model.
Note that the sync operation is guaranteed to be sequentially consistent
GraphLab provides eight schedulers. The Synchronous scheduler, the Round-robin scheduler, five task schedulers, and the Splash scheduler.
All four task schedulers behave similarly as in the demo application, but each have different set of scheduling guarantees.