Tutorial

In this tutorial, we will run Fraud Eagle algorithm [1] on a synthetic data set provided in [2].

Installation

An implementation of Fraud Eagle, rgmining-fraud-eagle, is available in PyPI. A Synthetic Review Dataset, rgmining-synthetic-dataset, is also available in PyPI. At first, we need to install both packages via pip.

pip install --upgrade rgmining-fraud-eagle rgmining-synthetic-dataset

Constructing a review graph

rgmining-synthetic-dataset has synthetic package which exports function synthetic.load() and constant synthetic.ANOMALOUS_REVIEWER_SIZE. The function synthetic.load() loads the synthetic data set and adds them to a review graph. Constant synthetic.ANOMALOUS_REVIEWER_SIZE represents how many anomalous reviewers the data set has and it’s \(57\). Every anomalous reviewer has anomaly in their name.

Let us omit to explain how to construct the synthetic data set. If you are interested in that, please read the original article.

In order to make a review graph object of Fraud Eagle algorithm and load the synthetic data set, run

import fraud_eagle as feagle
import synthetic

graph = feagle.ReviewGraph(0.10)
synthetic.load(graph)

where, \(0.10\) is a parameter of Fraud Eagle. See fraud_eagle.ReviewGraph for more information.

Running Fraud Eagle algorithm

Since Fraud Eagle takes one parameter, we have a question that which value is the best one to determine anomalous reviewers. In this tutorial, we evaluate each parameter with precision of top-57 anomalous reviewers, i.e. how many actual anomalous reviewers appear in top-57 highest anomaly reviewers.

To evaluate one parameter, we make a simple script analyze.py:

#!/usr/bin/env python
import click
import fraud_eagle as feagle
import synthetic

@click.command()
@click.argument("epsilon", type=float)
def analyze(epsilon):
    graph = feagle.ReviewGraph(epsilon)
    synthetic.load(graph)

    for _ in range(100):
        diff = graph.update()
        print("Iteration end: {0}".format(diff))
        if diff < 10**-4:
            break

    reviewers = sorted(
        graph.reviewers,
        key=lambda r: -r.anomalous_score)[:synthetic.ANOMALOUS_REVIEWER_SIZE]

    print(len([r for r in reviewers if "anomaly" in r.name]) / len(reviewers))

if __name__ == "__main__":
    analyze()

Note that the above script uses a command-line parser click.

With this script, to evaluate a parameter, e.g. \(0.1\), run:

$ chmod u+x analyze.py
$ ./analyze.py 0.1

The result might be :

$ ./analyze.py 0.10
Iteration end: 0.388863491546
Iteration end: 0.486597792445
Iteration end: 0.679722652169
Iteration end: 0.546349261422
Iteration end: 0.333657951459
Iteration end: 0.143313076183
Iteration end: 0.0596751050403
Iteration end: 0.0265415183341
Iteration end: 0.0109979501706
Iteration end: 0.00584731865022
Iteration end: 0.00256288275348
Iteration end: 0.00102187920468
Iteration end: 0.000365458293609
Iteration end: 0.000151984909839
Iteration end: 4.14654814812e-05
0.543859649123

It means about 54% reviewers in the top-57 anomaly reviewers are actual anomalous reviewers.

Parallel evaluation

We need to evaluate several parameters with analyze.py to determine the best one. Since it seems taking long time, We employ Google Cloud Platform and Roadie for parallel evaluation.

To use the Google Cloud Platform, you need to register it. After registration, setup Google Cloud SDK and Roadie.

In order to run analyze.py on another environment, we need to prepare requirements.txt, which is a list of related libraries, in the same directory as analyze.py:

click==6.6
rgmining-fraud-eagle==0.9.2
rgmining-synthetic-dataset==0.9.0

Roadie requires a configuration file which is written in YAML and specifies programs to be run on a cloud server. We make analyze.yml and just specify to run analyze.py:

run:
- python analyze.py {{epsilon}}

where {{epsilon}} is a placeholder and we will give several values for it.

At first, we upload our source code and make an instance on the cloud with parameter \(0.01\).

roadie run --local . --name feagle0.01 --queue feagle -e epsilon=0.01 analyze.yml

where --local . means the root of our source code is the current directory, and --queue feagle means the new task belongs to a set of tasks named feagle.

Next, we make instances with other parameters. Those instances use same source code uploaded with the instance named feagle0.01, and the code to create such instances is as follows:

$ for i in `seq -w 2 25`; do
    roadie run --source "feagle0.01.tar.gz" --name "feagle0.${i}" --queue feagle -e "epsilon=0.$i" analyze.yml
done

By default, Roadie creates one instance for a set of tasks. We need more instance to run tasks in parallel and create 7 more instances:

$ roadie queue instance add --instances 7 feagle

and roadie status shows current status of every instance, which has a name starting with the queue name and a random number. If roadie status shows nothing, all tasks have done.

The results are stored in Google Cloud Storage and roadie result show <task name> shows the result of a task. To download them in a CSV file, run

$ for i in `seq -w 1 25`; do
    echo "0.${i}, `roadie result show feagle0.${i} | tail -1`" >> result.csv
done

We now create a simple script, plot.py, to plot the results:

#!/usr/bin/env python
import click
from matplotlib import pyplot
import pandas as pd

@click.command()
@click.argument("infile")
def plot(infile):
    data = pd.read_csv(infile, header=None)
    pyplot.plot(data[0], data[1])
    pyplot.show()

if __name__ == "__main__":
    plot()

The above script requires, click, matplotlib, and pandas.

After running the script by

$ chmod u+x plot.py
$ ./plot.py result.csv

we get the following graph, where x-axis means parameter values and y-axis means precision.

_images/figure_1.png

From the graph, the parameter should be less than \(0.1\).