DataLoader for text classifier.
tflite_model_maker.text_classifier.DataLoader(
dataset, size, index_to_label
)
Used in the notebooks
Args |
dataset
|
A tf.data.Dataset object that contains a potentially large set of
elements, where each element is a pair of (input_data, target). The
input_data means the raw input data, like an image, a text etc., while
the target means some ground truth of the raw input data, such as the
classification label of the image etc.
|
size
|
The size of the dataset. tf.data.Dataset donesn't support a function
to get the length directly since it's lazy-loaded and may be infinite.
|
Attributes |
num_classes
|
|
size
|
Returns the size of the dataset.
Note that this function may return None becuase the exact size of the
dataset isn't a necessary parameter to create an instance of this class,
and tf.data.Dataset donesn't support a function to get the length directly
since it's lazy-loaded and may be infinite.
In most cases, however, when an instance of this class is created by helper
functions like 'from_folder', the size of the dataset will be preprocessed,
and this function can return an int representing the size of the dataset.
|
Methods
from_csv
View source
@classmethod
from_csv(
filename,
text_column,
label_column,
fieldnames=None,
model_spec='average_word_vec',
is_training=True,
delimiter=',',
quotechar='\'",
shuffle=False,
cache_dir=None
)
Loads text with labels from the csv file and preproecess text according to model_spec
.
Args |
filename
|
Name of the file.
|
text_column
|
String, Column name for input text.
|
label_column
|
String, Column name for labels.
|
fieldnames
|
A sequence, used in csv.DictReader. If fieldnames is omitted,
the values in the first row of file f will be used as the fieldnames.
|
model_spec
|
Specification for the model.
|
is_training
|
Whether the loaded data is for training or not.
|
delimiter
|
Character used to separate fields.
|
quotechar
|
Character used to quote fields containing special characters.
|
shuffle
|
boolean, if shuffle, random shuffle data.
|
cache_dir
|
The cache directory to save preprocessed data. If None,
generates a temporary directory to cache preprocessed data.
|
Returns |
TextDataset containing text, labels and other related info.
|
from_folder
View source
@classmethod
from_folder(
filename,
model_spec='average_word_vec',
is_training=True,
class_labels=None,
shuffle=True,
cache_dir=None
)
Loads text with labels and preproecess text according to model_spec
.
Assume the text data of the same label are in the same subdirectory. each
file is one text.
Args |
filename
|
Name of the file.
|
model_spec
|
Specification for the model.
|
is_training
|
Whether the loaded data is for training or not.
|
class_labels
|
Class labels that should be considered. Name of the
subdirectory not in class_labels will be ignored. If None, all the
subdirectories will be considered.
|
shuffle
|
boolean, if shuffle, random shuffle data.
|
cache_dir
|
The cache directory to save preprocessed data. If None,
generates a temporary directory to cache preprocessed data.
|
Returns |
TextDataset containing text, labels and other related info.
|
gen_dataset
View source
gen_dataset(
batch_size=1,
is_training=False,
shuffle=False,
input_pipeline_context=None,
preprocess=None,
drop_remainder=False
)
Generate a shared and batched tf.data.Dataset for training/evaluation.
Args |
batch_size
|
A integer, the returned dataset will be batched by this size.
|
is_training
|
A boolean, when True, the returned dataset will be optionally
shuffled and repeated as an endless dataset.
|
shuffle
|
A boolean, when True, the returned dataset will be shuffled to
create randomness during model training.
|
input_pipeline_context
|
A InputContext instance, used to shared dataset
among multiple workers when distribution strategy is used.
|
preprocess
|
A function taking three arguments in order, feature, label and
boolean is_training.
|
drop_remainder
|
boolean, whether the finaly batch drops remainder.
|
Returns |
A TF dataset ready to be consumed by Keras model.
|
split
View source
split(
fraction
)
Splits dataset into two sub-datasets with the given fraction.
Primarily used for splitting the data set into training and testing sets.
Args |
fraction
|
float, demonstrates the fraction of the first returned
subdataset in the original data.
|
Returns |
The splitted two sub datasets.
|
__len__
View source
__len__()