View source on GitHub |
Recommendation data loader.
tflite_model_maker.recommendation.DataLoader(
dataset, size, vocab
)
Args | |
---|---|
dataset
|
tf.data.Dataset for recommendation. |
size
|
int, dataset size. |
vocab
|
list of dict, each vocab item is described above. |
Methods
download_and_extract_movielens
@classmethod
download_and_extract_movielens( download_dir )
Downloads and extracts movielens dataset, then returns extracted dir.
from_movielens
@classmethod
from_movielens( data_dir, data_tag, input_spec:
tflite_model_maker.recommendation.spec.InputSpec
, generated_examples_dir=None, min_timeline_length=3, max_context_length=10, max_context_movie_genre_length=10, min_rating=None, train_data_fraction=0.9, build_vocabs=True, train_filename='train_movielens_1m.tfrecord', test_filename='test_movielens_1m.tfrecord', vocab_filename='movie_vocab.json', meta_filename='meta.json' )
Generates data loader from movielens dataset.
The method downloads and prepares dataset, then generates for train/eval.
For movielens
data format, see:
- function
_generate_fake_data
inrecommendation_testutil.py
- Or, zip file: http://files.grouplens.org/datasets/movielens/ml-1m.zip
Args | |
---|---|
data_dir
|
str, path to dataset containing (unzipped) text data. |
data_tag
|
str, specify dataset in {'train', 'test'}. |
input_spec
|
InputSpec, specify data format for input and embedding. |
generated_examples_dir
|
str, path to generate preprocessed examples. (default: same as data_dir) |
min_timeline_length
|
int, min timeline length to split train/eval set. |
max_context_length
|
int, max context length as one input. |
max_context_movie_genre_length
|
int, max context length of movie genre as one input. |
min_rating
|
int or None, include examples with min rating. |
train_data_fraction
|
float, percentage of training data [0.0, 1.0]. |
build_vocabs
|
boolean, whether to build vocabs. |
train_filename
|
str, generated file name for training data. |
test_filename
|
str, generated file name for test data. |
vocab_filename
|
str, generated file name for vocab data. |
meta_filename
|
str, generated file name for meta data. |
Returns | |
---|---|
Data Loader. |
gen_dataset
gen_dataset(
batch_size=1,
is_training=False,
shuffle=False,
input_pipeline_context=None,
preprocess=None,
drop_remainder=True,
total_steps=None
)
Generates dataset, and overwrites default drop_remainder = True.
generate_movielens_dataset
@classmethod
generate_movielens_dataset( data_dir, generated_examples_dir=None, train_filename='train_movielens_1m.tfrecord', test_filename='test_movielens_1m.tfrecord', vocab_filename='movie_vocab.json', meta_filename='meta.json', min_timeline_length=3, max_context_length=10, max_context_movie_genre_length=10, min_rating=None, train_data_fraction=0.9, build_vocabs=True )
Generate movielens dataset, and returns a dict contains meta.
Args | |
---|---|
data_dir
|
str, path to dataset containing (unzipped) text data. |
generated_examples_dir
|
str, path to generate preprocessed examples. (default: same as data_dir) |
train_filename
|
str, generated file name for training data. |
test_filename
|
str, generated file name for test data. |
vocab_filename
|
str, generated file name for vocab data. |
meta_filename
|
str, generated file name for meta data. |
min_timeline_length
|
int, min timeline length to split train/eval set. |
max_context_length
|
int, max context length as one input. |
max_context_movie_genre_length
|
int, max context length of movie genre as one input. |
min_rating
|
int or None, include examples with min rating. |
train_data_fraction
|
float, percentage of training data [0.0, 1.0]. |
build_vocabs
|
boolean, whether to build vocabs. |
Returns | |
---|---|
Dict, metadata for the movielens dataset. Containing keys:
train_file , train_size , test_file , test_size , vocab_file, vocab_size`, etc.
|
get_num_classes
@classmethod
get_num_classes( meta ) -> int
Gets number of classes.
0 is reserved. Number of classes is Max Id + 1, e.g., if Max Id = 100, then classes are [0, 100], that is 101 classes in total.
Args | |
---|---|
meta
|
dict, containing meta['vocab_max_id']. |
Returns | |
---|---|
Number of classes. |
load_vocab
@classmethod
load_vocab( vocab_file ) -> collections.OrderedDict
Loads vocab from file.
The vocab file should be json format of: a list of list[size=4], where the 4 elements are ordered as: [id=int, title=str, genres=str joined with '|', count=int] It is generated when preparing movielens dataset.
Args | |
---|---|
vocab_file
|
str, path to vocab file. |
Returns | |
---|---|
vocab
|
an OrderedDict maps id to item. Each item represents a movie { 'id': int, 'title': str, 'genres': list[str], 'count': int, } |
split
split(
fraction
)
Splits dataset into two sub-datasets with the given fraction.
Primarily used for splitting the data set into training and testing sets.
Args | |
---|---|
fraction
|
float, demonstrates the fraction of the first returned subdataset in the original data. |
Returns | |
---|---|
The splitted two sub datasets. |
__len__
__len__()