Skip to content

Latest commit

 

History

History

DGK

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
=== ABOUT ===
This folder includes the source code for the following paper:

Deep Graph Kernels, Pinar Yanardag, SVN Vishwanathan, KDD 2015

Please visit www.mit.edu/~pinary/kdd for more information and 
infer your queries to pinary@mit.edu.

=== LICENSE ===

The scripts in this folder are licensed with MIT License, which you can 
find a copy below:

Copyright (c) 2016 Pinar Yanardag

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

=== REQUIRED LIBRARIES ===

Required libraries to run the package: numpy, scipy, networkx, gensim, pynauty.
For Gensim, please see: https://radimrehurek.com/gensim
For pynauty, please see: https://web.cs.dal.ca/~peter/software/pynauty

=== HOW TO GENERATE KERNELS === 

You should have the following folders and files to make the code run:

deep_kernel.py (the kernels functions discussed in the paper)
canonical_maps (these files are required for graphlet kernel)
graphlet_counter_maps (these files are required for graphlet kernel)
datasets (see DATASETS section below for more information)
kernels (the generated kernels will be saved here)

The deep_kernel.py script takes the following parameters, in order:

num_dimensions: this is the size of the dimension space you want to embed 
sub-structures, e.g. if you want to represent each sub-structure as a 
10-dimensional array, then this parameter should be 10.

kernel_type: if you just want to compute MLE kernel (without using the 
embeddings), then this parameter should be 3. if you want to use graph 
kernels, this parameter should be 2 which simply takes the dot product 
between the embeddings.

feature_type: use 1 if you want to try graphlet kernel, use 2 if you 
want to use shortest-path kernel or use 3 if you want to use Weisfelier-Lehman kernel.

ds_name: name of the dataset under datasets folder, e.g. mutag or enzymes.

window_size: the size of the window skip-gram and c-bow models use, e.g. 
if you use 5, then these models will build the model by looking at 5 previous 
words and 5 next words.

ngram_type: use 1 if you want to use skip-gram representation, use 0 if 
you want to use cbow representation.

sampling_type: use 1 if you want to use hierarchical sampling, use 0 if 
you want negative sampling.

graphlet_size: this is number of nodes in graphlets you want to use, e.g. 
if you use 7 then the code will sample graphlets of size 7 from the graphs. 
Currently, graphlet sizes from 2 to 8 are supported.

sample_size: number of graphlets you want to sample from the dataset, e.g. 
if you use 100, then the code will sample 100 random graphlets.

For ease of use, you can export the parameters as an environment variable, 
and input them as follows:

export num_dimensions=10
export kernel_type=3
export feature_type=2
export ds_name=mutag
export window_size=5
export ngram_type=1
export sampling_type=1
export graphlet_size=7
export sample_size=100

python deep_kernel.py $num_dimensions $kernel_type $feature_type $ds_name $window_size $ngram_type $sampling_type $graphlet_size $sample_size

Once you run this script, the .mat formatted kernel file will be saved under "kernels" folder. 
This .mat formatted kernel file can then be used directly in an SVM library (see below).

=== HOW TO RUN SVM AFTER GENERATING THE KERNELS ===

We use the following Matlab toolbox from Shervashidze et.al. which provides 
various scripts to feed a kernel matrix and outputs prediction 
performances: http://cbio.ensmp.fr/~nshervashidze/code/Graphkernels/graphkernels.zip

Here is an example Matlab code on how you can load a kernel generated by deep_kernel.py:

k_mle = load('kernels/graphlet_kernel_mutag.mat')
label = load('kernels/mutag_labels.mat') % the labels in matlab format will be generated by the deep_kernel.py script
label = label.label;
r_mle =  runntimes(k_mle.kernel, label', 10); % run the experiment 10 times and report average accuracy

An important note: 
runntimes.m script in the Matlab toolbox from Shervashidze et.al. don't have a fixed random seed, 
therefore it outputs a different result everytime you run runntimes() function. In order to have a 
consistent experimental setup, you need to add the following line right after function definition 
in runntimes.m:

rand('seed', 666) % we used 666 as our seed in the experiments. 

=== DATASETS === 
Please visit www.mit.edu/~pinary/kdd address to download the datasets 
used in the paper, and extract them under "datasets" folder.

Each dataset is a dictionary saved as a pickle. The dictionary of the dataset 
looks as follows:

dataset = {"graph": graphs, "labels": labels}

labels: this is an array where each element corresponds to the class of an individual graph. The class 
labels are expected to be integers.

graphs: this is a dictionary where each key represents the index number for a graph. so if the dataset has 
600 graphs, then this is a dictionary that has 600 keys, ranged from 0 to 599. 
Each graph in this dictionary points to another dictonary that represents all the nodes embedded in the graph, 
where nodes themselves also have index numbers 0 to max_number_of_nodes. For instance, accessing to a node 
indexed with "nidx" in the graph indexed as "gidx" reveals the following dictionary:

graph_data[gidx][nidx] = {"neighbors": list_of_neighbors, "label": label_of_the_node}

where list_of_neighbors is an array that represents all other node id's that node "nidx" 
has an edge, and label_of_the_node represents the label associated with the node. labels 
are not used in graphlet kernel since it works on unlabeled graphs.

As an example, here is how the dictionaries look for mutag dataset (can be found under 
datasets/mutag.graph).

# load the data
>>> import pickle
>>> f = open("datasets/mutag.graph")
>>> data = pickle.loads(f.read())
>>> f.close()

# the fields this data has
>>> data.keys()
['graph', 'labels']

# let's look at the labels array
>>> data['labels']
array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1], dtype=int16)

# these are the graph indices (mutag has 188 graphs)
>>> data['graph'].keys()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187]

# let's access to first graph (indexed by 0) and its first node (indexed by 0)
# here, [1, 13] are two nodes that node 0 is connected to, and 'label' represents 
# the discrete label associated with label 3. 
>>> data['graph'][0][0]
{'neighbors': array([ 1, 13], dtype=uint8), 'label': (3,)}

== ERRETA === 

The results for the graphlet kernel in the paper was generated with k=8 graphlet 
size instead of k=7 as reported in the paper. Please use k=8 graphlet size when 
reproducing our results.