# dataset_io package¶

Define dataset file format and load/output methods.

This package provides a set of methods to load review datasets and output mining results in JSON format.

## Review data¶

Review data are a set of tuples. Each tuple consists of a reviewer’s ID, a review target product ID, five-start rating score, and reviewing date. In the JSON format, keys of them are member_id, product_id, rating, and date, respectively. A review tuple looks like:

{
"member_id": "A1AF30H2MPOO9",
"product_id": "0001056530",
"rating": 4.0,
"date": "2000-08-21"
}


Review data file must consist of such JSON object and each line has only one object.

dataset_io.loader.load() or its alias dataset_io.load() parse a JSON file and add those review data to a graph.

## Mining results¶

Mining results are described as a set of state information. Since we assume mining algorithms employ repeated improvement principle, every iteration outputs a new state.

In the output format, each line represents a reviewer or a product object. Reviewer objects are defined as

{
"iteration": <the iteration number given as i>
"reviewer":
{
"reviewer_id": <Reviewer's ID>
"score": <Anomalous score of the reviewer>
}
}


Product objects are defined as

{
"iteration": <the iteration number given as i>
"reviewer":
{
"product_id": <Product's ID>
"summary": <Summary of the reviews for the product>
}
}


dataset_io.heler.print_state() or its alias dataset_io.print_state() output a state of a graph.

Outputted state information can be used to restore some state by dataset_io.resume.resume() or its alias dataset_io.resume(). This function takes state data and construct a graph which has same state.

## Graph interface¶

This package assumes graph, reviewer, and product objects are following certain APIs.

### Graph object¶

Graph object maintains relationship between reviewers and products. Most of algorithms treat it by a bipartite graph but any modeling is allowed.

Any graph object needs to supply some methods and properties. The required methods are the followings:

new_reviewer(name, anomalous)
create and register a new reviewer which has a given name and be initialized by a given anomalous score,
new_product(name)
create and register a new product which has a given name,
find_reviewer(name)
find and return a reviewer which has given name,
find_product(name)
find and return a product which has given name,
add a new review from reviewer to product issued in date, in which the review is a float value.

The required properties are the followings:

a set of reviewers,
a set of products.

### Reviewer object¶

A reviewer object represents a reviewer who has a name and anomalous score. The reviewer object is required to have two properties;

a name of the reviewer,
a float value of the reviewer’s anomalous score.

### Product object¶

A product object represents a product which has a name and summarized reviews, called summary. The product object is required to have two properties;

a name of the product,
a float value of the summarized reviews.

## Aliases¶

The top level module provides the following aliases;

dataset_io.load()
arias of dataset_io.loader.load(),
dataset_io.print_state()
arias of dataset_io.helper.print_state(),
dataset_io.parse_state()
arias of dataset_io.helper.parse_state(),
dataset_io.quiet()
arias of dataset_io.helper.quiet(),
dataset_io.normalize_rating()
arias of dataset_io.helper.normalize_rating(),
dataset_io.resume()
arias of dataset_io.resume.resume(),
dataset_io.UniformSampler
arias of dataset_io.sampler.UniformSampler,
dataset_io.RatingBasedSampler
arias of dataset_io.sampler.RatingBasedSampler.

## dataset_io.constants module¶

Define public constants for dataset.

dataset_io.constants.MEMBER_ID = 'member_id'

Key of member ID.

dataset_io.constants.PRODUCT_ID = 'product_id'

Key of produce ID.

dataset_io.constants.REVIEWER_ID = 'reviewer_id'

Key of reviewer ID.

## dataset_io.helper module¶

Provide helper functions and classes.

class dataset_io.helper.Product(product_id, summary)

Bases: tuple

Named tuple to access product’s attribute easily.

product_id

Alias for field number 0

summary

Alias for field number 1

class dataset_io.helper.Reviewer(reviewer_id, score)

Bases: tuple

Named tuple to access reviewer’s attributes easily.

reviewer_id

Alias for field number 0

score

Alias for field number 1

dataset_io.helper.convert_date(date)[source]

Convert data-type data to int.

For example, date 2016-01-02 is converted to integer 20160102.

Parameters: data – data to convert Int-type date data.
dataset_io.helper.normalize_rating(v)[source]

Normalize five star ratings between 0 to 1.

Parameters: v – rating which is between 1 to 5 Normalized rating data between 0 to 1
dataset_io.helper.parse_state(fp, reviewer_handler=None, product_handler=None, iteration='final')[source]

Parse a state of a graph from an iterable.

Parse a state outputted from print_state and call callback functions. The callback for reviewer must receive two arguments; iteration and review object. The review object has two attributes; reviewer_id and score. The callback for product must Recife’s two arguments; iteration and product object. The product object has two attributes; product_id and summary. See print_state for more detail.

If the callback is set None, associated objects are not parsed.

Parameters: fp – An iterable object containing state data. reviewer_handler – A callback for reviewer (default: None). product_handler – A callback for product (default: None). iteration – Choose iteration to be parsed (default: ‘final’).
dataset_io.helper.print_state(g, i, output=<open file '<stdout>', mode 'w'>)[source]

Print a current state of a given graph.

This method outputs a current of a graph as a set of json objects. Graph objects must have two properties; reviewers and products. Those properties returns a set of reviewers and products respectively.

In this output format, each line represents a reviewer or product object.

Reviewer objects are defined as

Parameters: g – Graph instance. i – Iteration number. output – A writable object (default: sys.stdout).
dataset_io.helper.quiet(f)[source]

Decorator ignoring ValueError.

Parameters: f – A function A decorated function which ignores ValueError and returns None when such exceptions happen.

Load review data formatted in JSON to a graph object.

dataset_io.loader.load(g, fp, anomalous=None, normalize=<function normalize_rating>)[source]

Load a review dataset to a given graph object.

The graph object must implement the Graph interface i.e. it must have the following methods:

new_reviewer(name, anomalous)
create and register a new reviewer which has a given name and be initialized by a given anomalous score,
new_product(name)
create and register a new product which has a given name,
find_reviewer(name)
find and return a reviewer which has given name,
find_product(name)
find and return a product which has given name,
add a new review from reviewer to product issued in date, in which the review is a float value.

and must have the following properties:

a set of reviewers,
a set of products.

fp is an iterative object which yields a JSON string representing a review. Each review must have the following elements:

where member_id is a reviewer’s id, i.e. name, product_id is a product’s id which the reviewer posts a review. Rating is a five-star score for the product. Date is the date the review issued.

Parameters: g – graph object where loaded review data are stored. fp – readable object containing JSON data of a loading table. anomalous – default anomalous scores (Default: None). normalize – normalize function of rating scores; if set Nont, scores are not normalized. The graph instance, which is as same as g.

## dataset_io.resume module¶

Provide a function to resume mining.

dataset_io.resume.resume(graph, state, iteration='final')[source]

Reconstruct a bipertite graph from original file and outputed state file.

Parameters: graph – A empty bipertite graph object. state – A readable object containing state data outputed by helper.print_state. iteration – Loading iteration. (Default: final) The graph instance. This is as same as graph.

## dataset_io.sampler module¶

Provide samplers for generating random ratings.

This module provides two samplers which generate random ratings. One sampler is dataset_io.sampler.UniformSampler. This sampler generate ratings from an uniform distribution. The other one is dataset_io.sampler.RatingBasedSampler. This sampler generate ratings from a rating distribution computed from a real dataset provided by Amazon.com.

Note that ganerated ratings are normalized into [0, 1].

class dataset_io.sampler.RatingBasedSampler[source]

Bases: object

Sampling review scores from a distribution based on actual reviews.

This sampler generate ratings from a rating distribution computed from a real dataset provided by Amazon.com.

According to the dataset, the distribution is the followings;

rating the number of reviews
1 167137
2 122025
3 189801
4 422698
5 1266919

This sampler is a callable object. To generate random ratings,

sampler = UniformSampler()
for rating in sampler():
# use the rating.


Note that in the above example, sampler never ends and break is required to stop the generation.

class dataset_io.sampler.UniformSampler[source]

Bases: object

Sampling review scores from a uniform distribution.

This sampler is a callable object. To generate random ratings,

sampler = UniformSampler()
for rating in sampler():
# use the rating.


Note that in the above example, sampler never ends and break is required to stop the generation.