![]() |
![]() |
![]() |
![]() |
This tutorial demonstrates two ways to load and preprocess text.
First, you will use Keras utilities and layers. If you are new to TensorFlow, you should start with these.
Next, you will use lower-level utilities like
tf.data.TextLineDataset
to load text files, andtf.text
to preprocess the data for finer-grain control.
# Be sure you're using the stable versions of both tf and tf-text, for binary compatibility.
pip install -q -U tensorflow
pip install -q -U tensorflow-text
import collections
import pathlib
import re
import string
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
Example 1: Predict the tag for a Stack Overflow question
As a first example, you will download a dataset of programming questions from Stack Overflow. Each question ("How do I sort a dictionary by value?") is labeled with exactly one tag (Python
, CSharp
, JavaScript
, or Java
). Your task is to develop a model that predicts the tag for a question. This is an example of multi-class classification, an important and widely applicable kind of machine learning problem.
Download and explore the dataset
Next, you will download the dataset, and explore the directory structure.
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
'stack_overflow_16k.tar.gz',
data_url,
untar=True,
cache_dir='stack_overflow',
cache_subdir='')
dataset_dir = pathlib.Path(dataset).parent
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz 6053888/6053168 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/train'), PosixPath('/tmp/.keras/README.md'), PosixPath('/tmp/.keras/test'), PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz')]
train_dir = dataset_dir/'train'
list(train_dir.iterdir())
[PosixPath('/tmp/.keras/train/java'), PosixPath('/tmp/.keras/train/csharp'), PosixPath('/tmp/.keras/train/javascript'), PosixPath('/tmp/.keras/train/python')]
The train/csharp
, train/java
, train/python
and train/javascript
directories contain many text files, each of which is a Stack Overflow question. Print a file and inspect the data.
sample_file = train_dir/'python/1755.txt'
with open(sample_file) as f:
print(f.read())
why does this blank program print true x=true.def stupid():. x=false.stupid().print x
Load the dataset
Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use text_dataset_from_directory utility to create a labeled tf.data.Dataset
. If you're new to tf.data, it's a powerful collection of tools for building input pipelines.
The preprocessing.text_dataset_from_directory
expects a directory structure as follows.
train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt
When running a machine learning experiment, it is a best practice to divide your dataset into three splits: train, validation, and test. The Stack Overflow dataset has already been divided into train and test, but it lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split
argument below.
batch_size = 32
seed = 42
raw_train_ds = preprocessing.text_dataset_from_directory(
train_dir,
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
Found 8000 files belonging to 4 classes. Using 6400 files for training.
As you can see above, there are 8,000 examples in the training folder, of which you will use 80% (or 6,400) for training. As you will see in a moment, you can train a model by passing a tf.data.Dataset
directly to model.fit
. First, iterate over the dataset and print out a few examples, to get a feel for the data.
for text_batch, label_batch in raw_train_ds.take(1):
for i in range(10):
print("Question: ", text_batch.numpy()[i])
print("Label:", label_batch.numpy()[i])
Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon(). {. mynumsides = 5;. mysidelength = 30;. }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength). {. mynumsides = numsides;. mysidelength = sidelength;. }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);. shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle(). {. system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5. system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30.. double vertexangle;. vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;. return vertexangle;. }//end method vertexangle..public void menu().{. system.out.println(mynumsides); // prints out what the user puts in. system.out.println(mysidelength); // prints out what the user puts in. gotographic();. calcr(mynumsides, mysidelength);. calcr(mynumsides, mysidelength);. print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{. int numsides;. double sidelength;. scanner keyboard = new scanner(system.in);.. system.out.println(""welcome to the regular polygon program!"");. system.out.println();.. system.out.print(""enter the number of sides of the polygon ==> "");. numsides = keyboard.nextint();. system.out.println();.. system.out.print(""enter the side length of each side ==> "");. sidelength = keyboard.nextdouble();. system.out.println();.. regularpolygon shape = new regularpolygon(numsides, sidelength);. shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n' Label: 1 Question: b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ? ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):.. print img.shape. res=img.copy(). for x in range(img.shape[0]):. for y in range(img.shape[1]):. rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]). labimg=rgbimg.convert_to(\'lab\', debug=false). if (labimg.lab_l > treshold):. res[x,y,:]=color. else: . res[x,y,:]=img[x,y,:].. return res"\n' Label: 3 Question: b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:.. import blank.io.*;. import blank.util.arraylist;. import static blank.lang.system.out;.. public class rentalsystem{. static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));. static file file = new file(""file.txt"");. static arraylist<string> list = new arraylist<string>();. static int rows;.. public static void main(string[] args) throws exception{. introduction();. system.out.print(""nn"");. login();. system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");. introduction();. string repeat;. do{. loadfile();. system.out.print(""nwhat do you want to do?nn"");. system.out.print(""n - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.print(""nn | 1. add customer | 2. rent return |n"");. system.out.print(""n - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.print(""nn | 3. view list | 4. search |n"");. system.out.print(""n - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.print(""nn | 5. exit |n"");. system.out.print(""n - - - - - - - - - -"");. system.out.print(""nnchoice:"");. int choice = integer.parseint(input.readline());. switch(choice){. case 1:. writedata();. break;. case 2:. rentdata();. break;. case 3:. viewlist();. break;. case 4:. search();. break;. case 5:. system.out.println(""goodbye!"");. system.exit(0);. default:. system.out.print(""invalid choice: "");. break;. }. system.out.print(""ndo another task? [y/n] "");. repeat = input.readline();. }while(repeat.equals(""y""));.. if(repeat!=""y"") system.out.println(""ngoodbye!"");.. }.. public static void writedata() throws exception{. system.out.print(""nname: "");. string cname = input.readline();. system.out.print(""address: "");. string add = input.readline();. system.out.print(""phone no.: "");. string pno = input.readline();. system.out.print(""rental amount: "");. string ramount = input.readline();. system.out.print(""tapenumber: "");. string tno = input.readline();. system.out.print(""title: "");. string title = input.readline();. system.out.print(""date borrowed: "");. string dborrowed = input.readline();. system.out.print(""due date: "");. string ddate = input.readline();. createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);. rentdata();. }.. public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{. filewriter fw = new filewriter(file, true);. fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");. fw.close();. }.. public static void loadfile() throws exception{. try{. list.clear();. fileinputstream fstream = new fileinputstream(file);. bufferedreader br = new bufferedreader(new inputstreamreader(fstream));. rows = 0;. while( br.ready()). {. list.add(br.readline());. rows++;. }. br.close();. } catch(exception e){. system.out.println(""list not yet loaded."");. }. }.. public static void viewlist(){. system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print("" |list of all costumers|"");. system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. for(int i = 0; i <rows; i++){. system.out.println(list.get(i));. }. }. public static void rentdata()throws exception. { system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print("" |rent data list|"");. system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print(""nenter customer name: "");. string cname = input.readline();. system.out.print(""date borrowed: "");. string dborrowed = input.readline();. system.out.print(""due date: "");. string ddate = input.readline();. system.out.print(""return date: "");. string rdate = input.readline();. system.out.print(""rent amount: "");. string ramount = input.readline();.. system.out.print(""you pay:""+ramount);... }. public static void search()throws exception. { system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print("" |search costumers|"");. system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");. system.out.print(""nenter costumer name: "");. string cname = input.readline();. boolean found = false;.. for(int i=0; i < rows; i++){. string temp[] = list.get(i).split("","");.. if(cname.equals(temp[0])){. system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");. found = true;. }. }.. if(!found){. system.out.print(""no results."");. }.. }.. public static boolean evaluate(string uname, string pass){. if (uname.equals(""admin"")&&pass.equals(""12345"")) return true;. else return false;. }.. public static string login()throws exception{. bufferedreader input=new bufferedreader(new inputstreamreader(system.in));. int counter=0;. do{. system.out.print(""username:"");. string uname =input.readline();. system.out.print(""password:"");. string pass =input.readline();.. boolean accept= evaluate(uname,pass);.. if(accept){. break;. }else{. system.out.println(""incorrect username or password!"");. counter ++;. }. }while(counter<3);.. if(counter !=3) return ""login successful"";. else return ""login failed"";. }. public static void introduction() throws exception{.. system.out.println("" - - - - - - - - - - - - - - - - - - - - - - - - -"");. system.out.println("" ! r e n t a l !"");. system.out.println("" ! ~ ~ ~ ~ ~ ! ================= ! ~ ~ ~ ~ ~ !"");. system.out.println("" ! s y s t e m !"");. system.out.println("" - - - - - - - - - - - - - - - - - - - - - - - - -"");. }..}"\n' Label: 1 Question: b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{. nomtable = ""listeinteretstable"";. d.cnx.open();. da = new sqldataadapter(""select nome from offices"", d.cnx);. ds = new dataset();. da.fill(ds, nomtable);. datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{. d.cnx.open();. cmdb = new sqlcommandbuilder(da);. da.update(ds, nomtable);. d.cnx.close();.}"\n' Label: 0 Question: b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1<? super t> onnext, final action1<throwable> onerror) {.}...in the first parameter, what does the question mark and super mean?"\n' Label: 1 Question: b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n' Label: 0 Question: b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n" Label: 0 Question: b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true. #do things...however i get a syntax error at ..if completed == true"\n' Label: 3 Question: b'"blank control flow i made a number which asks for 2 numbers with blank and responds with the corresponding message for the case. how come it doesnt work for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{. class program. {. static void main(string[] args). {. string myinput; // declaring the type of the variables. int myint;.. string number1;. int number;... console.writeline(""enter a number"");. myinput = console.readline(); //muyinput is a string which is entry input. myint = int32.parse(myinput); // myint converts the string into an integer.. if (myint > 0). console.writeline(""your number {0} is greater than zero."", myint);. else if (myint < 0). console.writeline(""your number {0} is less than zero."", myint);. else. console.writeline(""your number {0} is equal zero."", myint);.. console.writeline(""enter another number"");. number1 = console.readline(); . number = int32.parse(myinput); .. if (number < 0 || number == 0). console.writeline(""your number {0} is less than zero or equal zero."", number);. else if (number > 0 && number <= 10). console.writeline(""your number {0} is in the range from 0 to 10."", number);. else. console.writeline(""your number {0} is greater than 10."", number);.. console.writeline(""enter another number"");.. }. } .}"\n' Label: 0 Question: b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {. private static final string proxy_host = ""proxy.****.com"";. private static final int proxy_port = 6050;.. public static void main(string[] args) {. httpclient client = new httpclient();. httpmethod method = new getmethod(""https://kodeblank.org"");.. hostconfiguration config = client.gethostconfiguration();. config.setproxy(proxy_host, proxy_port);.. string username = ""*****"";. string password = ""*****"";. credentials credentials = new usernamepasswordcredentials(username, password);. authscope authscope = new authscope(proxy_host, proxy_port);.. client.getstate().setproxycredentials(authscope, credentials);.. try {. client.executemethod(method);.. if (method.getstatuscode() == httpstatus.sc_ok) {. string response = method.getresponsebodyasstring();. system.out.println(""response = "" + response);. }. } catch (ioexception e) {. e.printstacktrace();. } finally {. method.releaseconnection();. }. }.}...exception:... dec 08, 2017 1:41:39 pm . org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme. info: ntlm authentication scheme selected. dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect. severe: credentials cannot be used for ntlm authentication: . org.apache.commons.httpclient.usernamepasswordcredentials. org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials . cannot be used for ntlm authentication: . enter code here . org.apache.commons.httpclient.usernamepasswordcredentials. at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332). at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320). at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491). at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391). at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171). at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397). at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323). at httpgetproxy.main(httpgetproxy.blank:31). dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge. info: failure authenticating with ntlm @proxy.****.com:6050"\n' Label: 1
The labels are 0
, 1
, 2
or 3
. To see which of these correspond to which string label, you can check the class_names
property on the dataset.
for i, label in enumerate(raw_train_ds.class_names):
print("Label", i, "corresponds to", label)
Label 0 corresponds to csharp Label 1 corresponds to java Label 2 corresponds to javascript Label 3 corresponds to python
Next, you will create a validation and test dataset. You will use the remaining 1,600 reviews from the training set for validation.
raw_val_ds = preprocessing.text_dataset_from_directory(
train_dir,
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
Found 8000 files belonging to 4 classes. Using 1600 files for validation.
test_dir = dataset_dir/'test'
raw_test_ds = preprocessing.text_dataset_from_directory(
test_dir, batch_size=batch_size)
Found 8000 files belonging to 4 classes.
Prepare the dataset for training
Next, you will standardize, tokenize, and vectorize the data using the preprocessing.TextVectorization
layer.
Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.
Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).
Vectorization refers to converting tokens into numbers so they can be fed into a neural network.
All of these tasks can be accomplished with this layer. You can learn more about each of these in the API doc.
The default standardization converts text to lowercase and removes punctuation.
The default tokenizer splits on whitespace.
The default vectorization mode is
int
. This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes, likebinary
, to build bag-of-word models.
You will build two modes to learn more about these. First, you will use the binary
model to build a bag-of-words model. Next, you will use the int
mode with a 1D ConvNet.
VOCAB_SIZE = 10000
binary_vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='binary')
For int
mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.
MAX_SEQUENCE_LENGTH = 250
int_vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
Next, you will call adapt
to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.
# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)
See the result of using these layers to preprocess data:
def binary_vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return binary_vectorize_layer(text), label
def int_vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return int_vectorize_layer(text), label
# Retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)
Question tf.Tensor(b'"function expected error in blank for dynamically created check box when it is clicked i want to grab the attribute value.it is working in ie 8,9,10 but not working in ie 11,chrome shows function expected error..<input type=checkbox checked=\'checked\' id=\'symptomfailurecodeid\' tabindex=\'54\' style=\'cursor:pointer;\' onclick=chkclickevt(this); failurecodeid=""1"" >...function chkclickevt(obj) { . alert(obj.attributes(""failurecodeid""));.}"\n', shape=(), dtype=string) Label tf.Tensor(2, shape=(), dtype=int32)
print("'binary' vectorized question:",
binary_vectorize_text(first_question, first_label)[0])
'binary' vectorized question: tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)
print("'int' vectorized question:",
int_vectorize_text(first_question, first_label)[0])
'int' vectorized question: tf.Tensor( [[ 38 450 65 7 16 12 892 265 186 451 44 11 6 685 3 46 4 2062 2 485 1 6 158 7 479 1 26 20 158 7 479 1 502 38 450 1 1767 1763 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]], shape=(1, 250), dtype=int64)
As you can see above, binary
mode returns an array denoting which tokens exist at least once in the input, while int
mode replaces each token by an integer, thus preserving their order. You can lookup the token (string) that each integer corresponds to by calling .get_vocabulary()
on the layer.
print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))
1289 ---> roman 313 ---> source Vocabulary size: 10000
You are nearly ready to train your model. As a final preprocessing step, you will apply the TextVectorization
layers you created earlier to the train, validation, and test dataset.
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)
int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)
Configure the dataset for performance
These are two important methods you should use when loading data to make sure that I/O does not become blocking.
.cache()
keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.
.prefetch()
overlaps data preprocessing and model execution while training.
You can learn more about both methods, as well as how to cache data to disk in the data performance guide.
AUTOTUNE = tf.data.AUTOTUNE
def configure_dataset(dataset):
return dataset.cache().prefetch(buffer_size=AUTOTUNE)
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)
int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)
Train the model
It's time to create our neural network. For the binary
vectorized data, train a simple bag-of-words linear model:
binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = binary_model.fit(
binary_train_ds, validation_data=binary_val_ds, epochs=10)
Epoch 1/10 200/200 [==============================] - 2s 9ms/step - loss: 1.2359 - accuracy: 0.5427 - val_loss: 0.9108 - val_accuracy: 0.7744 Epoch 2/10 200/200 [==============================] - 1s 3ms/step - loss: 0.8149 - accuracy: 0.8277 - val_loss: 0.7481 - val_accuracy: 0.8031 Epoch 3/10 200/200 [==============================] - 1s 3ms/step - loss: 0.6482 - accuracy: 0.8616 - val_loss: 0.6631 - val_accuracy: 0.8125 Epoch 4/10 200/200 [==============================] - 1s 3ms/step - loss: 0.5492 - accuracy: 0.8832 - val_loss: 0.6100 - val_accuracy: 0.8225 Epoch 5/10 200/200 [==============================] - 1s 3ms/step - loss: 0.4805 - accuracy: 0.9055 - val_loss: 0.5735 - val_accuracy: 0.8294 Epoch 6/10 200/200 [==============================] - 1s 3ms/step - loss: 0.4287 - accuracy: 0.9177 - val_loss: 0.5470 - val_accuracy: 0.8369 Epoch 7/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3876 - accuracy: 0.9286 - val_loss: 0.5270 - val_accuracy: 0.8363 Epoch 8/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3537 - accuracy: 0.9332 - val_loss: 0.5115 - val_accuracy: 0.8394 Epoch 9/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3250 - accuracy: 0.9396 - val_loss: 0.4993 - val_accuracy: 0.8419 Epoch 10/10 200/200 [==============================] - 1s 3ms/step - loss: 0.3003 - accuracy: 0.9479 - val_loss: 0.4896 - val_accuracy: 0.8438
Next, you will use the int
vectorized layer to build a 1D ConvNet.
def create_model(vocab_size, num_labels):
model = tf.keras.Sequential([
layers.Embedding(vocab_size, 64, mask_zero=True),
layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
layers.GlobalMaxPooling1D(),
layers.Dense(num_labels)
])
return model
# vocab_size is VOCAB_SIZE + 1 since 0 is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)
Epoch 1/5 200/200 [==============================] - 4s 8ms/step - loss: 1.3016 - accuracy: 0.3903 - val_loss: 0.7395 - val_accuracy: 0.6950 Epoch 2/5 200/200 [==============================] - 1s 6ms/step - loss: 0.6901 - accuracy: 0.7170 - val_loss: 0.5435 - val_accuracy: 0.7906 Epoch 3/5 200/200 [==============================] - 1s 6ms/step - loss: 0.4277 - accuracy: 0.8562 - val_loss: 0.4766 - val_accuracy: 0.8194 Epoch 4/5 200/200 [==============================] - 1s 6ms/step - loss: 0.2419 - accuracy: 0.9402 - val_loss: 0.4701 - val_accuracy: 0.8188 Epoch 5/5 200/200 [==============================] - 1s 6ms/step - loss: 0.1218 - accuracy: 0.9767 - val_loss: 0.4932 - val_accuracy: 0.8163
Compare the two models:
print("Linear model on binary vectorized data:")
print(binary_model.summary())
Linear model on binary vectorized data: Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 4) 40004 ================================================================= Total params: 40,004 Trainable params: 40,004 Non-trainable params: 0 _________________________________________________________________ None
print("ConvNet model on int vectorized data:")
print(int_model.summary())
ConvNet model on int vectorized data: Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 64) 640064 _________________________________________________________________ conv1d (Conv1D) (None, None, 64) 20544 _________________________________________________________________ global_max_pooling1d (Global (None, 64) 0 _________________________________________________________________ dense_1 (Dense) (None, 4) 260 ================================================================= Total params: 660,868 Trainable params: 660,868 Non-trainable params: 0 _________________________________________________________________ None
Evaluate both models on the test data:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)
print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))
250/250 [==============================] - 1s 5ms/step - loss: 0.5166 - accuracy: 0.8139 250/250 [==============================] - 1s 4ms/step - loss: 0.5116 - accuracy: 0.8117 Binary model accuracy: 81.39% Int model accuracy: 81.17%
Export the model
In the code above, you applied the TextVectorization
layer to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the TextVectorization
layer inside your model. To do so, you can create a new model using the weights you just trained.
export_model = tf.keras.Sequential(
[binary_vectorize_layer, binary_model,
layers.Activation('sigmoid')])
export_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))
250/250 [==============================] - 2s 5ms/step - loss: 0.5187 - accuracy: 0.8138 Accuracy: 81.39%
Now your model can take raw strings as input and predict a score for each label using model.predict
. Define a function to find the label with the maximum score:
def get_string_labels(predicted_scores_batch):
predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
return predicted_labels
Run inference on new data
inputs = [
"how do I extract keys from a dict into a list?", # python
"debug public static void main(string[] args) {...}", # java
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
print("Question: ", input)
print("Predicted label: ", label.numpy())
Question: how do I extract keys from a dict into a list? Predicted label: b'python' Question: debug public static void main(string[] args) {...} Predicted label: b'java'
Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for train/test skew.
There is a performance difference to keep in mind when choosing where to apply your TextVectorization
layer. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization layer inside your model when you're ready to prepare for deployment.
Visit this tutorial to learn more about saving models.
Example 2: Predict the author of Illiad translations
The following provides an example of using tf.data.TextLineDataset
to load examples from text files, and tf.text
to preprocess the data. In this example, you will use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text.
Download and explore the dataset
The texts of the three translations are by:
The text files used in this tutorial have undergone some typical preprocessing tasks like removing document header and footer, line numbers and chapter titles. Download these lightly munged files locally.
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']
for name in FILE_NAMES:
text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)
parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt 819200/815980 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt 811008/809730 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt 811008/807992 [==============================] - 0s 0us/step [PosixPath('/home/kbuilder/.keras/datasets/Giant Panda'), PosixPath('/home/kbuilder/.keras/datasets/derby.txt'), PosixPath('/home/kbuilder/.keras/datasets/flower_photos.tar.gz'), PosixPath('/home/kbuilder/.keras/datasets/spa-eng'), PosixPath('/home/kbuilder/.keras/datasets/heart.csv'), PosixPath('/home/kbuilder/.keras/datasets/iris_test.csv'), PosixPath('/home/kbuilder/.keras/datasets/train.csv'), PosixPath('/home/kbuilder/.keras/datasets/butler.txt'), PosixPath('/home/kbuilder/.keras/datasets/flower_photos'), PosixPath('/home/kbuilder/.keras/datasets/image.jpg'), PosixPath('/home/kbuilder/.keras/datasets/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg'), PosixPath('/home/kbuilder/.keras/datasets/shakespeare.txt'), PosixPath('/home/kbuilder/.keras/datasets/Fireboat'), PosixPath('/home/kbuilder/.keras/datasets/iris_training.csv'), PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'), PosixPath('/home/kbuilder/.keras/datasets/320px-Felis_catus-cat_on_snow.jpg'), PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv.zip'), PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'), PosixPath('/home/kbuilder/.keras/datasets/ImageNetLabels.txt'), PosixPath('/home/kbuilder/.keras/datasets/mnist.npz'), PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv'), PosixPath('/home/kbuilder/.keras/datasets/spa-eng.zip')]
Load the dataset
You will use TextLineDataset
, which is designed to create a tf.data.Dataset
from a text file in which each example is a line of text from the original file, whereas text_dataset_from_directory
treats all contents of a file as a single example. TextLineDataset
is useful for text data that is primarily line-based (for example, poetry or error logs).
Iterate through these files, loading each one into its own dataset. Each example needs to be individually labeled, so use tf.data.Dataset.map
to apply a labeler function to each one. This will iterate over every example in the dataset, returning (example, label
) pairs.
def labeler(example, index):
return example, tf.cast(index, tf.int64)
labeled_data_sets = []
for i, file_name in enumerate(FILE_NAMES):
lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
labeled_data_sets.append(labeled_dataset)
Next, you'll combine these labeled datasets into a single dataset, and shuffle it.
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
all_labeled_data = all_labeled_data.shuffle(
BUFFER_SIZE, reshuffle_each_iteration=False)
Print out a few examples as before. The dataset hasn't been batched yet, hence each entry in all_labeled_data
corresponds to one data point:
for text, label in all_labeled_data.take(10):
print("Sentence: ", text.numpy())
print("Label:", label.numpy())
Sentence: b'To chariot driven, thou maim thyself and me.' Label: 0 Sentence: b'On choicest marrow, and the fat of lambs;' Label: 1 Sentence: b'And through the gorgeous breastplate, and within' Label: 1 Sentence: b'To visit there the parent of the Gods' Label: 0 Sentence: b'For safe escape from danger and from death.' Label: 0 Sentence: b'Achilles, ye at least the fight decline' Label: 0 Sentence: b"Which done, Achilles portion'd out to each" Label: 0 Sentence: b'Whom therefore thou devourest; else themselves' Label: 0 Sentence: b'Drove them afar into the host of Greece.' Label: 0 Sentence: b"Their succour; then I warn thee, while 'tis time," Label: 1
Prepare the dataset for training
Instead of using the Keras TextVectorization
layer to preprocess our text dataset, you will now use the tf.text
API to standardize and tokenize the data, build a vocabulary and use StaticVocabularyTable
to map tokens to integers to feed to the model.
While tf.text provides various tokenizers, you will use the UnicodeScriptTokenizer
to tokenize our dataset. Define a function to convert the text to lower-case and tokenize it. You will use tf.data.Dataset.map
to apply the tokenization to the dataset.
tokenizer = tf_text.UnicodeScriptTokenizer()
def tokenize(text, unused_label):
lower_case = tf_text.case_fold_utf8(text)
return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201: batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25. Instructions for updating: `tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
You can iterate over the dataset and print out a few tokenized examples.
for text_batch in tokenized_ds.take(5):
print("Tokens: ", text_batch.numpy())
Tokens: [b'to' b'chariot' b'driven' b',' b'thou' b'maim' b'thyself' b'and' b'me' b'.'] Tokens: [b'on' b'choicest' b'marrow' b',' b'and' b'the' b'fat' b'of' b'lambs' b';'] Tokens: [b'and' b'through' b'the' b'gorgeous' b'breastplate' b',' b'and' b'within'] Tokens: [b'to' b'visit' b'there' b'the' b'parent' b'of' b'the' b'gods'] Tokens: [b'for' b'safe' b'escape' b'from' b'danger' b'and' b'from' b'death' b'.']
Next, you will build a vocabulary by sorting tokens by frequency and keeping the top VOCAB_SIZE
tokens.
tokenized_ds = configure_dataset(tokenized_ds)
vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
for tok in toks:
vocab_dict[tok] += 1
vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])
Vocab size: 10000 First five vocab entries: [b',', b'the', b'and', b"'", b'of']
To convert the tokens into integers, use the vocab
set to create a StaticVocabularyTable
. You will map tokens to integers in the range [2
, vocab_size + 2
]. As with the TextVectorization
layer, 0
is reserved to denote padding and 1
is reserved to denote an out-of-vocabulary (OOV) token.
keys = vocab
values = range(2, len(vocab) + 2) # reserve 0 for padding, 1 for OOV
init = tf.lookup.KeyValueTensorInitializer(
keys, values, key_dtype=tf.string, value_dtype=tf.int64)
num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)
Finally, define a fuction to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table:
def preprocess_text(text, label):
standardized = tf_text.case_fold_utf8(text)
tokenized = tokenizer.tokenize(standardized)
vectorized = vocab_table.lookup(tokenized)
return vectorized, label
You can try this on a single example to see the output:
example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())
Sentence: b'To chariot driven, thou maim thyself and me.' Vectorized sentence: [ 8 195 716 2 47 5605 552 4 40 7]
Now run the preprocess function on the dataset using tf.data.Dataset.map
.
all_encoded_data = all_labeled_data.map(preprocess_text)
Split the dataset into train and test
The Keras TextVectorization
layer also batches and pads the vectorized data. Padding is required because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size — each line of text has a different number of words. tf.data.Dataset
supports splitting and padded-batching datasets:
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)
Now, validation_data
and train_data
are not collections of (example, label
) pairs, but collections of batches. Each batch is a pair of (many examples, many labels) represented as arrays. To illustrate:
sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])
Text batch shape: (64, 16) Label batch shape: (64,) First text example: tf.Tensor( [ 8 195 716 2 47 5605 552 4 40 7 0 0 0 0 0 0], shape=(16,), dtype=int64) First label example: tf.Tensor(0, shape=(), dtype=int64)
Since we use 0
for padding and 1
for out-of-vocabulary (OOV) tokens, the vocabulary size has increased by two.
vocab_size += 2
Configure the datasets for better performance as before.
train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)
Train the model
You can train a model on this dataset as before.
model = create_model(vocab_size=vocab_size, num_labels=3)
model.compile(
optimizer='adam',
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_data, validation_data=validation_data, epochs=3)
Epoch 1/3 697/697 [==============================] - 30s 12ms/step - loss: 0.6900 - accuracy: 0.6660 - val_loss: 0.3815 - val_accuracy: 0.8368 Epoch 2/3 697/697 [==============================] - 5s 7ms/step - loss: 0.3173 - accuracy: 0.8705 - val_loss: 0.3622 - val_accuracy: 0.8460 Epoch 3/3 697/697 [==============================] - 4s 6ms/step - loss: 0.2159 - accuracy: 0.9167 - val_loss: 0.3895 - val_accuracy: 0.8466
loss, accuracy = model.evaluate(validation_data)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 1s 2ms/step - loss: 0.3895 - accuracy: 0.8466 Loss: 0.3894515335559845 Accuracy: 84.66%
Export the model
To make our model capable to taking raw strings as input, you will create a TextVectorization
layer that performs the same steps as our custom preprocessing function. Since you already trained a vocabulary, you can use set_vocaublary
instead of adapt
which trains a new vocabulary.
preprocess_layer = TextVectorization(
max_tokens=vocab_size,
standardize=tf_text.case_fold_utf8,
split=tokenizer.tokenize,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
preprocess_layer.set_vocabulary(vocab)
export_model = tf.keras.Sequential(
[preprocess_layer, model,
layers.Activation('sigmoid')])
export_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
# Create a test dataset of raw strings
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
loss, accuracy = export_model.evaluate(test_ds)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 7s 11ms/step - loss: 0.4626 - accuracy: 0.8128 Loss: 0.4913882315158844 Accuracy: 80.50%
The loss and accuracy for the model on encoded validation set and the exported model on the raw validation set are the same, as expected.
Run inference on new data
inputs = [
"Join'd to th' Ionians with their flowing robes,", # Label: 1
"the allies, and his armour flashed about him so that he seemed to all", # Label: 2
"And with loud clangor of his arms he fell.", # Label: 0
]
predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
print("Question: ", input)
print("Predicted label: ", label.numpy())
Question: Join'd to th' Ionians with their flowing robes, Predicted label: 1 Question: the allies, and his armour flashed about him so that he seemed to all Predicted label: 2 Question: And with loud clangor of his arms he fell. Predicted label: 0
Downloading more datasets using TensorFlow Datasets (TFDS)
You can download many more datasets from TensorFlow Datasets. As an example, you will download the IMDB Large Movie Review dataset, and use it to train a model for sentiment classification.
train_ds = tfds.load(
'imdb_reviews',
split='train',
batch_size=BATCH_SIZE,
shuffle_files=True,
as_supervised=True)
val_ds = tfds.load(
'imdb_reviews',
split='train',
batch_size=BATCH_SIZE,
shuffle_files=True,
as_supervised=True)
Print a few examples.
for review_batch, label_batch in val_ds.take(1):
for i in range(5):
print("Review: ", review_batch[i].numpy())
print("Label: ", label_batch[i].numpy())
Review: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it." Label: 0 Review: b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.' Label: 0 Review: b'Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.' Label: 0 Review: b'This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.' Label: 1 Review: b'As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably shows the hypocrisy of the female libido. When men are around they want to be pursued, but when no "men" are around, they become the pursuers of a 14 year old boy. And the boy becomes a man really fast (we should all be so lucky at this age!). He then gets up the courage to pursue his true love.' Label: 1
You can now preprocess the data and train a model as before.
Prepare the dataset for training
vectorize_layer = TextVectorization(
max_tokens=VOCAB_SIZE,
output_mode='int',
output_sequence_length=MAX_SEQUENCE_LENGTH)
# Make a text-only dataset (without labels), then call adapt
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)
def vectorize_text(text, label):
text = tf.expand_dims(text, -1)
return vectorize_layer(text), label
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)
# Configure datasets for performance as before
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)
Train the model
model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()
Model: "sequential_5" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, None, 64) 640064 _________________________________________________________________ conv1d_2 (Conv1D) (None, None, 64) 20544 _________________________________________________________________ global_max_pooling1d_2 (Glob (None, 64) 0 _________________________________________________________________ dense_3 (Dense) (None, 1) 65 ================================================================= Total params: 660,673 Trainable params: 660,673 Non-trainable params: 0 _________________________________________________________________
model.compile(
loss=losses.BinaryCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = model.fit(train_ds, validation_data=val_ds, epochs=3)
Epoch 1/3 391/391 [==============================] - 6s 13ms/step - loss: 0.6123 - accuracy: 0.5805 - val_loss: 0.2976 - val_accuracy: 0.8807 Epoch 2/3 391/391 [==============================] - 4s 10ms/step - loss: 0.3141 - accuracy: 0.8609 - val_loss: 0.1708 - val_accuracy: 0.9423 Epoch 3/3 391/391 [==============================] - 4s 10ms/step - loss: 0.1977 - accuracy: 0.9211 - val_loss: 0.0944 - val_accuracy: 0.9776
loss, accuracy = model.evaluate(val_ds)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
391/391 [==============================] - 1s 3ms/step - loss: 0.0944 - accuracy: 0.9776 Loss: 0.09437894821166992 Accuracy: 97.76%
Export the model
export_model = tf.keras.Sequential(
[vectorize_layer, model,
layers.Activation('sigmoid')])
export_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
# 0 --> negative review
# 1 --> positive review
inputs = [
"This is a fantastic movie.",
"This is a bad movie.",
"This movie was so bad that it was good.",
"I will never say yes to watching this movie.",
]
predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]
for input, label in zip(inputs, predicted_labels):
print("Question: ", input)
print("Predicted label: ", label)
Question: This is a fantastic movie. Predicted label: 1 Question: This is a bad movie. Predicted label: 0 Question: This movie was so bad that it was good. Predicted label: 0 Question: I will never say yes to watching this movie. Predicted label: 0
Conclusion
This tutorial demonstrated several ways to load and preprocess text. As a next step, you can explore additional tutorials on the website, or download new datasets from TensorFlow Datasets.