dataset module¶

Analyze and handle datasets.

dataset.active_reviewers(graph, output, threshold=2)[source]¶

Output the ID of reviewers who review at least threshold items.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. threshold – the threshold (default: 2).

dataset.distinct_product(graph, output)[source]¶

Output distinct product IDs.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object.

dataset.file_or_list(value)[source]¶

Argument type for dsargparse.

If argument is a file, it will be opened and passed as an iterator. If argument is a string, it will be treated as a comma-separated list.

Parameters:	value – Argument value.
Yields:	each line in the file if the given value points a file, otherwise, each item in the given collection.

dataset.filter_product(graph, output, target, csv_format=False)[source]¶

Output reviews posted to products of which IDs match the given set of IDs.

The output format is JSON and the scheme as:

{
    "member_id": <Reviewer ID>,
    "product_id": <Product ID>,
    "rating": <Rating score>,
    "date": <Date the review posted>
}

In the outputs, one line represents one JSON object.

CSV format is also supported to output results. In this option, the first line shows a header.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. target – a list of target product IDs. csv_format – If True, outputs will be formatted in CSV format.

dataset.filter_reviewers(graph, output, target, csv_format=False)[source]¶

Output reviews posted by reviewers whose IDs match the given set of IDs.

The output format is JSON and the scheme as:

{
    "member_id": <Reviewer ID>,
    "product_id": <Product ID>,
    "rating": <Rating score>,
    "date": <Date the review posted>
}

In the outputs, one line represents one JSON object.

CSV format is also supported to output results. In this option, the first line shows a header.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. target – a list of target reviewer ids. csv_format – If True, outputs will be formatted in CSV format.

dataset.main()[source]¶: The main function.

dataset.popular_products(graph, output, threshold=2)[source]¶

Output ID of products of which the number of reviews >= threshold.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. threshold – the threshold (default: 2).

dataset.rating_average(graph, output, csv_format=False)[source]¶

Output average rating scores of each product.

The output format is JSON and the scheme as:

{
    "product_id": <Product ID>,
    "summary": <Average rating score>
}

In the outputs, one line represents one JSON object.

CSV format is also supported to output results. In this option, the first line shows a header.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. csv_format – If True, outputs will be formatted in CSV format.

dataset.retrieve_reviewers(graph, output, target)[source]¶

Output the ID of reviewers who review at least one of the given products.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. target – a list of target product ids.

dataset.review_variance(graph, output, target=None, csv_format=False)[source]¶

Output variances of reviews for each product.

Each line of the output will be formatted as a JSON document, of which schema is as:

{
  "product_id": <Product ID>,
  "size": <number of reviews>,
  "variance": <variance of reviews>
}

In the outputs, one line represents one JSON object.

CSV format is also supported to output results. In this option, the first line shows a header.

If target is supplied, only products of which id is in the target will be outputted.

Parameters:	data – a readable object containing reviews. output – a writable object to be outputted results. target – an iterable of target product ids (default: None). csv_format – If True, outputs will be formatted in CSV format.

dataset.reviewer_size(graph, output, target, csv_format=False)[source]¶

Output the number of reviews of each reviewer who reviews target products.

Compute the number of reviews of each reviewer who reviews at least one product in the given target products.

The default output format is JSON and the scheme as:

{
  "reviewer": <Reviewer ID>,
  "size": <The number of reviews the reviewer posts>,
  "product": <Product ID which the reviewer reviews in the targets>
}

In the outputs, one line represents one JSON object.

CSV format is also supported to output results. In this option, the first line shows a header.

Parameters:	graph – Graph instance to which the target dataset is loaded. output – a writable object. target – a list of target object IDs. csv_format – If True, outputs will be formatted in CSV format.