dataset_io package

Define dataset file format and load/output methods.

This package provides a set of methods to load review datasets and output mining results in JSON format.

Review data

Review data are a set of tuples. Each tuple consists of a reviewer’s ID, a review target product ID, five-start rating score, and reviewing date. In the JSON format, keys of them are member_id, product_id, rating, and date, respectively. A review tuple looks like:

{
    "member_id": "A1AF30H2MPOO9",
    "product_id": "0001056530",
    "rating": 4.0,
    "date": "2000-08-21"
}

Review data file must consist of such JSON object and each line has only one object.

dataset_io.loader.load() or its alias dataset_io.load() parse a JSON file and add those review data to a graph.

Mining results

Mining results are described as a set of state information. Since we assume mining algorithms employ repeated improvement principle, every iteration outputs a new state.

In the output format, each line represents a reviewer or a product object. Reviewer objects are defined as

{
   "iteration": <the iteration number given as i>
   "reviewer":
   {
      "reviewer_id": <Reviewer's ID>
      "score": <Anomalous score of the reviewer>
   }
}

Product objects are defined as

{
   "iteration": <the iteration number given as i>
   "reviewer":
   {
      "product_id": <Product's ID>
      "summary": <Summary of the reviews for the product>
   }
}

dataset_io.heler.print_state() or its alias dataset_io.print_state() output a state of a graph.

Outputted state information can be used to restore some state by dataset_io.resume.resume() or its alias dataset_io.resume(). This function takes state data and construct a graph which has same state.

Graph interface

This package assumes graph, reviewer, and product objects are following certain APIs.

Graph object

Graph object maintains relationship between reviewers and products. Most of algorithms treat it by a bipartite graph but any modeling is allowed.

Any graph object needs to supply some methods and properties. The required methods are the followings:

new_reviewer(name, anomalous)
create and register a new reviewer which has a given name and be initialized by a given anomalous score,
new_product(name)
create and register a new product which has a given name,
find_reviewer(name)
find and return a reviewer which has given name,
find_product(name)
find and return a product which has given name,
add_review(self, reviewer, product, review, date)
add a new review from reviewer to product issued in date, in which the review is a float value.

The required properties are the followings:

reviewers (readable)
a set of reviewers,
products (readable)
a set of products.

Reviewer object

A reviewer object represents a reviewer who has a name and anomalous score. The reviewer object is required to have two properties;

name (readable)
a name of the reviewer,
anomalous_score (readable)
a float value of the reviewer’s anomalous score.

Product object

A product object represents a product which has a name and summarized reviews, called summary. The product object is required to have two properties;

name (readable)
a name of the product,
summary (readable)
a float value of the summarized reviews.

Aliases

The top level module provides the following aliases;

dataset_io.load()
arias of dataset_io.loader.load(),
dataset_io.print_state()
arias of dataset_io.helper.print_state(),
dataset_io.parse_state()
arias of dataset_io.helper.parse_state(),
dataset_io.quiet()
arias of dataset_io.helper.quiet(),
dataset_io.normalize_rating()
arias of dataset_io.helper.normalize_rating(),
dataset_io.resume()
arias of dataset_io.resume.resume(),
dataset_io.UniformSampler
arias of dataset_io.sampler.UniformSampler,
dataset_io.RatingBasedSampler
arias of dataset_io.sampler.RatingBasedSampler.

Submodules

dataset_io.constants module

Define public constants for dataset.

dataset_io.constants.MEMBER_ID = 'member_id'

Key of member ID.

dataset_io.constants.PRODUCT_ID = 'product_id'

Key of produce ID.

dataset_io.constants.REVIEWER_ID = 'reviewer_id'

Key of reviewer ID.

dataset_io.helper module

Provide helper functions and classes.

class dataset_io.helper.Product(product_id, summary)

Bases: tuple

Named tuple to access product’s attribute easily.

product_id

Alias for field number 0

summary

Alias for field number 1

class dataset_io.helper.Reviewer(reviewer_id, score)

Bases: tuple

Named tuple to access reviewer’s attributes easily.

reviewer_id

Alias for field number 0

score

Alias for field number 1

dataset_io.helper.convert_date(date)[source]

Convert data-type data to int.

For example, date 2016-01-02 is converted to integer 20160102.

Parameters:data – data to convert
Returns:Int-type date data.
dataset_io.helper.normalize_rating(v)[source]

Normalize five star ratings between 0 to 1.

Parameters:v – rating which is between 1 to 5
Returns:Normalized rating data between 0 to 1
dataset_io.helper.parse_state(fp, reviewer_handler=None, product_handler=None, iteration='final')[source]

Parse a state of a graph from an iterable.

Parse a state outputted from print_state and call callback functions. The callback for reviewer must receive two arguments; iteration and review object. The review object has two attributes; reviewer_id and score. The callback for product must Recife’s two arguments; iteration and product object. The product object has two attributes; product_id and summary. See print_state for more detail.

If the callback is set None, associated objects are not parsed.

Parameters:
  • fp – An iterable object containing state data.
  • reviewer_handler – A callback for reviewer (default: None).
  • product_handler – A callback for product (default: None).
  • iteration – Choose iteration to be parsed (default: ‘final’).
dataset_io.helper.print_state(g, i, output=<open file '<stdout>', mode 'w'>)[source]

Print a current state of a given graph.

This method outputs a current of a graph as a set of json objects. Graph objects must have two properties; reviewers and products. Those properties returns a set of reviewers and products respectively.

In this output format, each line represents a reviewer or product object.

Reviewer objects are defined as

{
   "iteration": <the iteration number given as i>
   "reviewer":
   {
      "reviewer_id": <Reviewer's ID>
      "score": <Anomalous score of the reviewer>
   }
}

Product objects are defined as

{
   "iteration": <the iteration number given as i>
   "reviewer":
   {
      "product_id": <Product's ID>
      "sumarry": <Summary of the reviews for the product>
   }
}
Parameters:
  • g – Graph instance.
  • i – Iteration number.
  • output – A writable object (default: sys.stdout).
dataset_io.helper.quiet(f)[source]

Decorator ignoring ValueError.

Parameters:f – A function
Returns:A decorated function which ignores ValueError and returns None when such exceptions happen.

dataset_io.loader module

Load review data formatted in JSON to a graph object.

dataset_io.loader.load(g, fp, anomalous=None, normalize=<function normalize_rating>)[source]

Load a review dataset to a given graph object.

The graph object must implement the Graph interface i.e. it must have the following methods:

new_reviewer(name, anomalous)
create and register a new reviewer which has a given name and be initialized by a given anomalous score,
new_product(name)
create and register a new product which has a given name,
find_reviewer(name)
find and return a reviewer which has given name,
find_product(name)
find and return a product which has given name,
add_review(self, reviewer, product, review, date)
add a new review from reviewer to product issued in date, in which the review is a float value.

and must have the following properties:

reviewers (readable)
a set of reviewers,
products (readable)
a set of products.

fp is an iterative object which yields a JSON string representing a review. Each review must have the following elements:

{
    "member_id": "A1AF30H2MPOO9",
    "product_id": "0001056530",
    "rating": 4.0,
    "date": "2000-08-21"
}

where member_id is a reviewer’s id, i.e. name, product_id is a product’s id which the reviewer posts a review. Rating is a five-star score for the product. Date is the date the review issued.

Parameters:
  • g – graph object where loaded review data are stored.
  • fp – readable object containing JSON data of a loading table.
  • anomalous – default anomalous scores (Default: None).
  • normalize – normalize function of rating scores; if set Nont, scores are not normalized.
Returns:

The graph instance, which is as same as g.

dataset_io.resume module

Provide a function to resume mining.

dataset_io.resume.resume(graph, state, iteration='final')[source]

Reconstruct a bipertite graph from original file and outputed state file.

Parameters:
  • graph – A empty bipertite graph object.
  • state – A readable object containing state data outputed by helper.print_state.
  • iteration – Loading iteration. (Default: final)
Returns:

The graph instance. This is as same as graph.

dataset_io.sampler module

Provide samplers for generating random ratings.

This module provides two samplers which generate random ratings. One sampler is dataset_io.sampler.UniformSampler. This sampler generate ratings from an uniform distribution. The other one is dataset_io.sampler.RatingBasedSampler. This sampler generate ratings from a rating distribution computed from a real dataset provided by Amazon.com.

Note that ganerated ratings are normalized into [0, 1].

class dataset_io.sampler.RatingBasedSampler[source]

Bases: object

Sampling review scores from a distribution based on actual reviews.

This sampler generate ratings from a rating distribution computed from a real dataset provided by Amazon.com.

According to the dataset, the distribution is the followings;

rating the number of reviews
1 167137
2 122025
3 189801
4 422698
5 1266919

This sampler is a callable object. To generate random ratings,

sampler = UniformSampler()
for rating in sampler():
    # use the rating.

Note that in the above example, sampler never ends and break is required to stop the generation.

class dataset_io.sampler.UniformSampler[source]

Bases: object

Sampling review scores from a uniform distribution.

This sampler is a callable object. To generate random ratings,

sampler = UniformSampler()
for rating in sampler():
    # use the rating.

Note that in the above example, sampler never ends and break is required to stop the generation.