GraphLab

A New Parallel Framework for Machine Learning

Datasets and Benchmarks

Please note, that this webpage is deprecated - the instructions are for GraphLab v1. Newer GraphLab instructions

In the page you can find some datasets that can be used for benchmarking GraphLab on your system.

Datasets

.
ImageDataset nameDataset sizeGraphLab AlgorithmDownload instructionsCredit
Yahoo! KDD CUP 2011 - music rating1M users, 600K songs, 260M ratingsMatrix factorizationInstructions Yahoo! KDD CUP
Twitter social graph8K x 8K twitter user, 62M linksMatrix factorization with sparse factorsInstructions Timmy Wilson, smarttypes.org
PlanetLab network flows 16M x 16M computers, 200M flows Bayesian prob. tensor factorization download files netflow2, netflow2e from here
Run: ./pmf netflow2e 2 --ncpus=8 --float=true --scheduler="round_robin(max_iterations=10,block_size=1)" --bptf_burn_in=5
Danny Bickson
audikw_1 - structural problem~1M x ~1M, 70M nnzMatrix factorization1. Download mat file from here
2. In matlab: "load audikw_1.mat; [i1,i2,i3] = find(Problem.A); save_c_gl_mat('audikw_1', [i1 i2 ones(length(i1),1) i3]);"
3. Run using pmf: ./pmf audikw_1 0 --ncpus=8 --scheduler="round_robin(max_iterations=10)"
Univ. Florida sparse matrix collection
Bone10 - model reduction problem 1M x 10M, 50M nnz Linear solver: Gaussian BP 1. Download mat file from here
2. In matlab: "load bone10.mat; save_c_gl('bone10', Problem.A, ones(length(Problem.A),1),zeros(length(Problem.A),1));"
3. Run using gabp: ./gabp 0 bone10 --ncpus=8 --scheduler="round_robin(max_iterations=10,blocksize=1) --syncinterval=1000000 --regularization=10000"
Univ. Florida sparse matrix collection
Netflix - collaborative filtering (subset) 1M x 17K, 3M nnz Alternating least squares Due to copyright, Netflix data is not available for download. It is recommended to use KDD CUP data instead. Netflix
NPIC 500 Dataset (Natural Language Processing dataset). 88K Noun phrases, 99K contexts, 20M occurrences SVD 1. Download dataset from here.
2. Extract the tgz file using: "tar xvzf all-pairs-t500-matrix-data-code.tar.gz"
3. Find the file matrix.txt, and add the following two lines at the top: %%MatrixMarket matrix coordinate real general
88322 99400 20597287
4. Run SVD using: ./pmf matrix.txt 13 --ncpus=8 --matrixmarket=true --max_iter=10
Tom Mitchell, CMU

Bigger Datasets (above half a billion non-zeros)

ImageDataset nameDataset sizeGraphLab AlgorithmDownload instructionsCredit
Wikipedia term occurrences dataset 4.3M terms, 3.3M documents, 513M occurrences SVD Download thefile medwiki.gz Contributed by Andrew Onley, The University of Memphis.
Wikipedia term occurrences dataset 40K terms, 10M documents, 689M occurrences SVD Download the file bigwiki.gz Contributed by Jamie Callan, Brian Murphy, and Partha Talukdar, CMU.
Wikipedia term occurrences dataset 40K terms, 50M documents, 3.3G occurrences SVD Download the file hugewiki.gz Contributed by Jamie Callan, Brian Murphy, and Partha Talukdar, CMU.
Mouse Visual Cortex 26K x 21K image (572M non-zeros) Spectral Clustering Original data here. Download matrix market format file: mouse_brain from here. Contributed by Joshua Vogelstein, OpenConnectToMe Project, Johns Hopkins University.
Twitter graph 41M nodes, 1.4 billion edges K-cores Download instructions Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon

Benchmarks

algorithm dataset command string num cpus total running time Platform Train Accuracy Test Accuracy
BPTFKDDCUPomplace -nt 32 ./pmf kddcup 2 --scheduler=round_robin(max_iterations=50,block_size=1) --float=true --zero=true --maxval=1 --minval=0 --scalerating=100 --burn_in=5 --ncpus=32 --outputvalidation=true --D=20 32 10573.402626 BlackLight 0.2146 0.2221
SGD KDDCUP omplace -nt 16 ./pmf kddcup 6 --sgd_lambda=0.0025 --sgd_gamma=2e-2 --float=true --zero=true --ncpus=16 --scheduler=round_robin(max_iterations=100) --sgd_step=0.999 --aggregatevalidation=true --scalerating=100 --minval=0 --maxval=1 --D=15 16 11042.167419 BlackLight 0.2065 0.1721
WALS KDDCUP omplace -nt 16 ./pmf kddcup 9 --ncpus=16 --scheduler=round_robin(max_iterations=25,block_size=1) --float=true --zero=true --scalerating=100 --scaling=5000 --truncating=2276 --maxval=1 --minval=0 --scope=null --D=18 --lambda=0.0001 --aggregatevalidation=true 16 4858 BlackLight 0.1426 0.1125
BPTF NETFLIX omplace -nt 16 ./mkl_seq netflix-r 2 --float=false --ncpus=16 --scheduler=round_robin(max_iterations=10,block_size=1) --burn_in=10 --minval=1 --maxval=5 --D=30 16 534 BlackLight + Intel MKL 0.8424 0.9659
BPTF NETFLIX omplace -nt 16 ./pmf netflix-r 2 --float=false --ncpus=16 --scheduler=round_robin(max_iterations=10,block_size=1) --burn_in=10 --minval=1 --maxval=5 --D=30 16 2517 BlackLight 0.8434 0.9369
ALS NETFLIX /pmf netflix-r 0 --scheduler="round_robin(max_iterations=10,block_size=1)" --float=false --lambda=0.065 --ncpus=8 8 283 Intel(R) Xeon(R) 8 x CPU X5550 @ 2.67GHz (using Eigen) 0.7982 0.9326
BPTF NETFLIX ./pmf netflix-r 2 --scheduler="round_robin(max_iterations=10,block_size=1)" --float=false --lambda=0.065 --ncpus=8 8 447 Intel(R) Xeon(R) 8 x CPU X5550 @ 2.67GHz (using Eigen) 0.8202 0.9633
SVD NPIC500 ./pmf matrix.txt 13 --matrixmarket=true --ncpus=16 --max_iter=24 --scope=null 16 28 Intel(R) Xeon(R) 8 x CPU X5550 @ 2.67GHz (using Eigen) N/AN/A

Acknowledgements

  • Thanks to Abhay Harpale, CMU for providing the NPIC500 dataset.