First we need to create C++ file for our application and configure it for compilation. This is easy.

**1**. Create directory pagerank under directory apps: cd apps mkdir pagerank

**2**. Create file CMakeLists.txt in the new directory and add following lines to it:

project(GraphLab) add_graphlab_executable(pagerank pagerank.cpp)

**3**. Then create file pagerank.cpp with following content:

/* PageRank tutorial application. pagerank.cpp For description of the PageRank algorithm, see Wikipedia article http://en.wikipedia.org/wiki/Pagerank */ #include <string> #include <graphlab.hpp> #include <graphlab/macros_def.hpp> int main(int argc, char** argv) { global_logger().set_log_level(LOG_INFO); global_logger().set_log_to_console(true); logger(LOG_INFO, "PageRank starting\n"); // Setup the parser graphlab::command_line_options clopts("Run the PageRank algorithm."); // Create a graphlab core gl_types::core core; // Parse arguments if(!clopts.parse(argc, argv)) { std::cout << "Error in parsing input." << std::endl; return EXIT_FAILURE; } // Set the engine options core.set_engine_options(clopts); // Create a synthetic graph //create_graph(core.graph()); // Schedule all vertices to run pagerank update on the // first round. // core.add_task_to_all(pagerank_update, 100.0); // Run the engine double runtime = core.start(); return EXIT_SUCCESS; }

Above is just the bare skeleton for the application, and contains the code for initializing graphlab. We will soon add more code to include the actual functionality.

**4**. Then finally, we need to finish the setup and check we can compile our new app. Go to the root of graphlab and run:

./configure

Then, to compile the new app go to directory debug/apps/pagerank and run `make`

. You should compilation errors, but do not worry about them now!

The objective of PageRank algorithm is to assign a score of the "importance" of each webpage (or scientific article). The score (Pagerank) of each page depends on the pageranks of pages *linking to it*. In essence, the PageRank of a page is a weighted sum of pageranks of articles linking to it.

That is, each page gives 1/N of its pagerank to each page its links to, where N is the number of links from the page. This can be interpreted by thinking that there is probablity of 1/N to move to any of the linked pages. In addition, for a more realistic ranking, a certain weight, depending on a *damping factor* is given to all other pages in the Web. This models the situation when surfer moves to a page not linked from a page, for example by selecting a bookmark.

In this Tutorial, we implement PageRank inference algorithm using a *power iteration method*. At each iteration, we set the pagerank of each page to be the weighted average of pageranks of pages linking to the page, plus the fraction of possibility to moving to a random page. Algorithm is terminated when the pageranks change less than a predefined threshold. The exact algorithm will become clear when we implement the **update function**.

For more information, please consult the Wikipedia article about PageRank ( http://en.wikipedia.org/wiki/PageRank )

The data model of this algorithm is readily in the form of a graph. Each page is represented by a **vertex**, of which data contains simply the current estimate of the rank. In addition, since GraphLab graphs do not support self-edges, we store the weight of the possible self-edge as a field in the vertex.

Each page has an outgoing **edge** to each vertex representing a page it has a link to. **Edge data** consists of a weight-attribute, which is 1/N for each outgoing edge (N was the number of outgoing edges of a vertex).

In addition, each edge contains a field for storing the *previous pagerank* used from the *neighboring* vertex. This is used for computing the change of values between iterations, and is used to determined when to terminate. As an example, let there be a link between vertex 1 and 3. At time T, vertex 3 is updated and it reads the pagerank of vertex 1, which is `0.5`

. It writes this value to the edge from 1 to 3. Then, at timestep T+k, vertex 1 is updated and it computes new pagerank `0.6`

. Since `|0.6-0.5| = 0.1`

is larger than our predefined threshold (0.00001 for example), the update function asks the scheduler to update vertex 3 again.

Ok, we are now ready to define our graph data types and create a simple graph. Later on, we give also code and data for a larger graph to be loaded from a file.

**1.** First we define structs for our vertex and edge datatypes. Add following code before the ` main()`

Function:

// // Stores the pagerank and the self weight // struct vertex_data { float value; float self_weight; vertex_data(float value = 1) : value(value), self_weight(0) { } }; // // Edge data represents the weight as well as the weight times the // last value of the source vertex when the target value was computed. // struct edge_data { float weight; float old_source_value; edge_data(float weight) : weight(weight), old_source_value(0) { } };

**2.** Since GraphLab uses C++ templates for convenience and for avoiding casting errors, we need to tell the compiler the type of our graph, which we name `pagerank_graph`

. In addition, we declare that all GraphLab classes will be templated based on the `pagerank_graph`

. Add following code after the previous definitions of vertex and edge data structures.

// The type of graph used in this program typedef graphlab::graph<vertex_data, edge_data> pagerank_graph; // // The collection of graphlab types restricted to the graph type used // in this program. // typedef graphlab::types<pagerank_graph> gl_types;

**3.** To introduce basics of graph creation, we construct a 5-vertex (page) graph by hand. In the following code snippet, function `create_graph()`

first creates five vertices with default vertex data. Vertices will get ids from 0 to 4. After that, a set of edges are created. In addition, the fifth page (id 4) links to all pages, including itself, and thus needs to set the `self_weight`

parameter of the vertex data. Please note that each edge has a weight that is inverse of the number of edges from the vertex it is from.

// Creates simple 5 vertex graph void create_graph(pagerank_graph& graph) { // Create 5 vertices graph.add_vertex(vertex_data()); graph.add_vertex(vertex_data()); graph.add_vertex(vertex_data()); graph.add_vertex(vertex_data()); graph.add_vertex(vertex_data()); // Page 0 links to page 3 only, so weight is 1 graph.add_edge(0, 3, edge_data(1)); // Page 1 links to 0 and 2 graph.add_edge(1, 0, edge_data(0.5)); graph.add_edge(1, 2, edge_data(0.5)); // ... and so on graph.add_edge(2, 0, edge_data(1.0/3)); graph.add_edge(2, 1, edge_data(1.0/3)); graph.add_edge(2, 3, edge_data(1.0/3)); graph.add_edge(3, 0, edge_data(0.25)); graph.add_edge(3, 1, edge_data(0.25)); graph.add_edge(3, 2, edge_data(0.25)); graph.add_edge(3, 4, edge_data(0.25)); graph.add_edge(4, 0, edge_data(0.2)); graph.add_edge(4, 1, edge_data(0.2)); graph.add_edge(4, 2, edge_data(0.2)); graph.add_edge(4, 3, edge_data(0.2)); // and self edge from 4 to 4, which must be handled specially. graph.vertex_data(4).self_weight = 0.2; }

**4.** Then *uncomment* following line from the `main()`

:

```
// create_graph(core.graph());
```

Graphlab object `core`

contains handle to the **graph**, computation **engine** and allows configuration of the runtime. Please see section Introduction for more information.

Now we are ready to write the actual computation, which was described above. To accomplish this, we define an **update function** `pagerank_update`

. Update function operates on one vertex a time. A typical GraphLab update function consists of three steps:

- 1. Read values from (inbound) neighbor vertices and edges; accumulate into a local variable to get a new value.
- 2. Write the new value to the vertex data.
- 3. If new value changed more than a threshold, ask scheduler to update the (outbound) neighbors vertices. Optionally set a priority for the update.

Algorithm will end when no update function decides to add neighbors again. It is important to note, that if a vertex was already scheduled by another neighbor, it is not added again to the task list (we call this *task pruning*).

**1.** First we define two constant parameters for the algorithm: the damping factor and termination threshold. Usually, it is better to use **shared variables** for managing such paramaters, but to keep this Tutorial simple, we "hard-code" them. See the supplementary sections below how to use the SDT for this purpose.

Add following lines somewhere in the beginning of the file.

#define termination_bound 1e-5 #define damping_factor 0.85 // PageRank damping factor

**2.** Now let's add the empty update function, before the `main()`

function:

// // The PageRank update function // void pagerank_update(gl_types::iscope &scope, gl_types::icallback &scheduler) { }

All update functions share the same signature. Note that the parameters are passed as *references*.

**3.** Now, we start by adding functionality to the update function. First we assign a reference of the vertex the scope is operating on to variable `vdata`

:

```
// Get the data associated with the vertex
vertex_data& vdata = scope.vertex_data();
```

**4.** Then we add the main computation loop of the update function. We loop over each *incoming* edge, and read the pagerank value of the associated vertex and multiply it by the weight of the edge. These values are accumulated to local variable `sum`

. Immediatelly after reading the value, we record the read value to the edge. This is later used, when the neighboring vertex is updated, to determine the change between previous used value.

// Start with including contribution of the self-edge float sum = vdata.value * vdata.self_weight; foreach(graphlab::edge_id_t eid, scope.in_edge_ids()) { // Get the neighobr vertex value const vertex_data& neighbor_vdata = scope.const_neighbor_vertex_data(scope.source(eid)); double neighbor_value = neighbor_vdata.value; // Get the edge data for the neighbor edge_data& edata = scope.edge_data(eid); // Compute the contribution of the neighbor double contribution = edata.weight * neighbor_value; // Add the contribution to the sum sum += contribution; // Remember this value as last read from the neighbor edata.old_source_value = neighbor_value; }

**5.** Now we are ready to compute the new PageRank for the vertex (page), which is a weighted sum of the computed `sum`

and the *damping factor*. New value is stored in field `value`

of the vertex.

```
// compute the new pagerank
sum = (1-damping_factor)/scope.num_vertices() + damping_factor*sum;
vdata.value = sum;
```

**6.** Now we need to check if the new value differed from previous value significantly. If it did, we ask the scheduler to update the neighbor vertices. Note that we check the change for each neighbor separately, by computing a weighted *residual* (this is the most efficient way, but requires a lot of memory if the graph is big, because the previous value needs to be stored in each edge separately. Simpler, but less accurate way, is to just compare old and new value of the vertex).

// Schedule the neighbors as needed foreach(graphlab::edge_id_t eid, scope.out_edge_ids()) { edge_data& outedgedata = scope.edge_data(eid); // Compute edge-specific residual by comparing the new value of this // vertex to the previous value seen by the neighbor vertex. double residual = outedgedata.weight * std::fabs(outedgedata.old_source_value - vdata.value); // If the neighbor changed sufficiently add to scheduler. if(residual > termination_bound) { gl_types::update_task task(scope.target(eid), pagerank_update); scheduler.add_task(task, residual); } }

**7.** Finally, we need to kick-start the computation by scheduling each vertex for update. To achieve this, *uncomment* following line from the `main()`

function:

```
// core.add_task_to_all(pagerank_update, 100.0);
```

After the algorithm has converged, we would like to output the results. In addition, to get comparable results, we should normalize the results. The best way would be to use **sync** facility, but for simplicity we do it manually.

**1.** To compute the normalizer (i.e sum of pageranks), please add following code in the end of the `main()`

function, before `return`

:

double norm = 0.0; for(graphlab::vertex_id_t vid=0; vid < core.graph().num_vertices(); vid++) { norm += core.graph().vertex_data(vid).value; }

**2.** Then we simply output normalized pagerank of five first vertices:

for(graphlab::vertex_id_t vid=0; vid < 5 ; vid++) { std::cout << "Page " << vid << " pagerank = " << core.graph().vertex_data(vid).value / norm << std::endl; }

**1.** To compile, do as before and go to `debug/apps/pagerank`

and enter `make pagerank`

. For better code performance, you can instead go to `release/apps/pagerank`

.

**2.** Now you can run the algorithm by simply entering `./pagerank`

.

You should see approximately following output:

Graphlab finished, runtime: 0.000952 seconds. Page 0 pagerank = 0.235752 Page 1 pagerank = 0.165445 Page 2 pagerank = 0.183704 Page 3 pagerank = 0.301708 Page 4 pagerank = 0.11339

**3.** You can also try running with different schedulers, number of threads and consistency settings. With such a small data, you should not expect to see any difference in performance, but it is good to learn how to declare runtime parameters. Here are some examples:

./pagerank --scheduler=multiqueue_fifo ./pagerank --ncpus=1 ./pagerank --ncpus=8 ./pagerank --ncpus=16 ./pagerank --scope=edge ./pagerank --scope=full

Following code reads a dataset for PageRank from a file. The format is explained in the header. To use it, simply replace the call to `create_graph()`

in the main with ` load_graph("mydata.txt", core.graph());`

/* Load a graph file specified in the format: source_id, target_id, weight source_id, target_id, weight source_id, target_id, weight .... The file should not contain repeated edges. */ bool load_graph(const std::string& filename, pagerank_graph& graph) { std::ifstream fin(filename.c_str()); if(!fin.good()) return false; // Loop through file reading each line while(fin.good()) { size_t source = 0; size_t target = 0; float weight = -1; fin >> source; fin.ignore(1); // skip comma fin >> target; fin.ignore(1); // skip comma fin >> weight; // fin.ignore(1); // skip comma // Ensure that the number of vertices is correct if(source >= graph.num_vertices() || target >= graph.num_vertices()) graph.resize(std::max(source, target) + 1); if(source != target) { // Add the edge edge_data edata(weight); graph.add_edge(source, target, weight); } else { // add the self edge by updating the vertex weight graph.vertex_data(source).self_weight = weight; } } graph.finalize(); return true; }

Instead of defining termination threshold and damping factor as hard-coded constants, it is better to manage them with the **Shared Variables**. This makes it easy to pass them from command-line, for example. Following code achieves this:

First define the variables:

```
// Keys for shared data
gl_types::glshared_const<float> DAMPING;
gl_types::glshared_const<float> TERMINATION_BOUND;
```

Then before starting graphlab, assign values for the keys as constants:

DAMPING.set(0.85);

Then in the update function, you can access the values:

```
float damping_factor = DAMPING.get_val();
```

GraphLab uses Boost Program Options package for command line option parsing. You can add your own options, such as `--dampingfactor=xx`

easily. Following code shows an example how to add one custom string parameter for input file and float parameter for the termination bound:

// Setup the parser graphlab::command_line_options clopts("Run the PageRank algorithm on a input file."); clopts.attach_option("infile", &filename, "PageRank input file. In src, dest, weight format."); clopts.attach_option("bound", &termination_bound, termination_bound, "Termination bound for maximum change in vertex values."); clopts.add_positional("infile"); if(!clopts.parse(argc, argv)) { std::cout << "Error in parsing input." << std::endl; return EXIT_FAILURE; } // Set the engine options core.set_engine_options(clopts);

Generated on Tue May 17 2011 15:43:54 for GraphLab: A framework for Parallel Machine Learning by 1.7.1