dataset_io package¶
Define dataset file format and load/output methods.
This package provides a set of methods to load review datasets and output mining results in JSON format.
Review data¶
Review data are a set of tuples. Each tuple consists of a reviewer’s ID, a review target product ID, five-start rating score, and reviewing date. In the JSON format, keys of them are member_id, product_id, rating, and date, respectively. A review tuple looks like:
{
"member_id": "A1AF30H2MPOO9",
"product_id": "0001056530",
"rating": 4.0,
"date": "2000-08-21"
}
Review data file must consist of such JSON object and each line has only one object.
dataset_io.loader.load()
or its alias dataset_io.load()
parse a JSON file and add those review data to a graph.
Mining results¶
Mining results are described as a set of state information. Since we assume mining algorithms employ repeated improvement principle, every iteration outputs a new state.
In the output format, each line represents a reviewer or a product object. Reviewer objects are defined as
{
"iteration": <the iteration number given as i>
"reviewer":
{
"reviewer_id": <Reviewer's ID>
"score": <Anomalous score of the reviewer>
}
}
Product objects are defined as
{
"iteration": <the iteration number given as i>
"reviewer":
{
"product_id": <Product's ID>
"summary": <Summary of the reviews for the product>
}
}
dataset_io.heler.print_state()
or its alias dataset_io.print_state()
output a state of a graph.
Outputted state information can be used to restore some state by dataset_io.resume.resume()
or its alias dataset_io.resume()
. This function takes state data and construct a graph which has same state.
Graph interface¶
This package assumes graph, reviewer, and product objects are following certain APIs.
Graph object¶
Graph object maintains relationship between reviewers and products. Most of algorithms treat it by a bipartite graph but any modeling is allowed.
Any graph object needs to supply some methods and properties. The required methods are the followings:
- new_reviewer(name, anomalous)
- create and register a new reviewer which has a given name and be initialized by a given anomalous score,
- new_product(name)
- create and register a new product which has a given name,
- find_reviewer(name)
- find and return a reviewer which has given name,
- find_product(name)
- find and return a product which has given name,
- add_review(self, reviewer, product, review, date)
- add a new review from reviewer to product issued in date, in which the review is a float value.
The required properties are the followings:
- reviewers (readable)
- a set of reviewers,
- products (readable)
- a set of products.
Reviewer object¶
A reviewer object represents a reviewer who has a name and anomalous score. The reviewer object is required to have two properties;
- name (readable)
- a name of the reviewer,
- anomalous_score (readable)
- a float value of the reviewer’s anomalous score.
Product object¶
A product object represents a product which has a name and summarized reviews, called summary. The product object is required to have two properties;
- name (readable)
- a name of the product,
- summary (readable)
- a float value of the summarized reviews.
Aliases¶
The top level module provides the following aliases;
dataset_io.load()
- arias of
dataset_io.loader.load()
, dataset_io.print_state()
- arias of
dataset_io.helper.print_state()
, dataset_io.parse_state()
- arias of
dataset_io.helper.parse_state()
, dataset_io.quiet()
- arias of
dataset_io.helper.quiet()
, dataset_io.normalize_rating()
- arias of
dataset_io.helper.normalize_rating()
, dataset_io.resume()
- arias of
dataset_io.resume.resume()
, dataset_io.UniformSampler
- arias of
dataset_io.sampler.UniformSampler
, dataset_io.RatingBasedSampler
- arias of
dataset_io.sampler.RatingBasedSampler
.
Submodules¶
dataset_io.constants module¶
Define public constants for dataset.
-
dataset_io.constants.
MEMBER_ID
= 'member_id'¶ Key of member ID.
-
dataset_io.constants.
PRODUCT_ID
= 'product_id'¶ Key of produce ID.
-
dataset_io.constants.
REVIEWER_ID
= 'reviewer_id'¶ Key of reviewer ID.
dataset_io.helper module¶
Provide helper functions and classes.
- class
dataset_io.helper.
Product
(product_id, summary)¶ Bases:
tuple
Named tuple to access product’s attribute easily.
-
product_id
¶ Alias for field number 0
-
summary
¶ Alias for field number 1
-
- class
dataset_io.helper.
Reviewer
(reviewer_id, score)¶ Bases:
tuple
Named tuple to access reviewer’s attributes easily.
-
reviewer_id
¶ Alias for field number 0
-
score
¶ Alias for field number 1
-
-
dataset_io.helper.
convert_date
(date)[source]¶ Convert data-type data to int.
For example, date 2016-01-02 is converted to integer 20160102.
Parameters: data – data to convert Returns: Int-type date data.
-
dataset_io.helper.
normalize_rating
(v)[source]¶ Normalize five star ratings between 0 to 1.
Parameters: v – rating which is between 1 to 5 Returns: Normalized rating data between 0 to 1
-
dataset_io.helper.
parse_state
(fp, reviewer_handler=None, product_handler=None, iteration='final')[source]¶ Parse a state of a graph from an iterable.
Parse a state outputted from print_state and call callback functions. The callback for reviewer must receive two arguments; iteration and review object. The review object has two attributes; reviewer_id and score. The callback for product must Recife’s two arguments; iteration and product object. The product object has two attributes; product_id and summary. See print_state for more detail.
If the callback is set None, associated objects are not parsed.
Parameters: - fp – An iterable object containing state data.
- reviewer_handler – A callback for reviewer (default: None).
- product_handler – A callback for product (default: None).
- iteration – Choose iteration to be parsed (default: ‘final’).
-
dataset_io.helper.
print_state
(g, i, output=<open file '<stdout>', mode 'w'>)[source]¶ Print a current state of a given graph.
This method outputs a current of a graph as a set of json objects. Graph objects must have two properties; reviewers and products. Those properties returns a set of reviewers and products respectively.
In this output format, each line represents a reviewer or product object.
Reviewer objects are defined as
{ "iteration": <the iteration number given as i> "reviewer": { "reviewer_id": <Reviewer's ID> "score": <Anomalous score of the reviewer> } }
Product objects are defined as
{ "iteration": <the iteration number given as i> "reviewer": { "product_id": <Product's ID> "sumarry": <Summary of the reviews for the product> } }
Parameters: - g – Graph instance.
- i – Iteration number.
- output – A writable object (default: sys.stdout).
dataset_io.loader module¶
Load review data formatted in JSON to a graph object.
-
dataset_io.loader.
load
(g, fp, anomalous=None, normalize=<function normalize_rating>)[source]¶ Load a review dataset to a given graph object.
The graph object must implement the Graph interface i.e. it must have the following methods:
- new_reviewer(name, anomalous)
- create and register a new reviewer which has a given name and be initialized by a given anomalous score,
- new_product(name)
- create and register a new product which has a given name,
- find_reviewer(name)
- find and return a reviewer which has given name,
- find_product(name)
- find and return a product which has given name,
- add_review(self, reviewer, product, review, date)
- add a new review from reviewer to product issued in date, in which the review is a float value.
and must have the following properties:
- reviewers (readable)
- a set of reviewers,
- products (readable)
- a set of products.
fp is an iterative object which yields a JSON string representing a review. Each review must have the following elements:
{ "member_id": "A1AF30H2MPOO9", "product_id": "0001056530", "rating": 4.0, "date": "2000-08-21" }
where member_id is a reviewer’s id, i.e. name, product_id is a product’s id which the reviewer posts a review. Rating is a five-star score for the product. Date is the date the review issued.
Parameters: - g – graph object where loaded review data are stored.
- fp – readable object containing JSON data of a loading table.
- anomalous – default anomalous scores (Default: None).
- normalize – normalize function of rating scores; if set Nont, scores are not normalized.
Returns: The graph instance, which is as same as g.
dataset_io.resume module¶
Provide a function to resume mining.
-
dataset_io.resume.
resume
(graph, state, iteration='final')[source]¶ Reconstruct a bipertite graph from original file and outputed state file.
Parameters: - graph – A empty bipertite graph object.
- state – A readable object containing state data outputed by helper.print_state.
- iteration – Loading iteration. (Default: final)
Returns: The graph instance. This is as same as graph.
dataset_io.sampler module¶
Provide samplers for generating random ratings.
This module provides two samplers which generate random ratings. One sampler is dataset_io.sampler.UniformSampler
. This sampler generate ratings from an uniform distribution. The other one is dataset_io.sampler.RatingBasedSampler
. This sampler generate ratings from a rating distribution computed from a real dataset provided by Amazon.com.
Note that ganerated ratings are normalized into [0, 1].
- class
dataset_io.sampler.
RatingBasedSampler
[source]¶ Bases:
object
Sampling review scores from a distribution based on actual reviews.
This sampler generate ratings from a rating distribution computed from a real dataset provided by Amazon.com.
According to the dataset, the distribution is the followings;
rating the number of reviews 1 167137 2 122025 3 189801 4 422698 5 1266919 This sampler is a callable object. To generate random ratings,
sampler = UniformSampler() for rating in sampler(): # use the rating.
Note that in the above example, sampler never ends and break is required to stop the generation.
- class
dataset_io.sampler.
UniformSampler
[source]¶ Bases:
object
Sampling review scores from a uniform distribution.
This sampler is a callable object. To generate random ratings,
sampler = UniformSampler() for rating in sampler(): # use the rating.
Note that in the above example, sampler never ends and break is required to stop the generation.