¡Reserva! Google I / O regresa del 18 al 20 de mayo Regístrese ahora
Se usó la API de Cloud Translation para traducir esta página.
Switch to English

Cargar texto

Ver en TensorFlow.org Ejecutar en Google Colab Ver fuente en GitHub Descargar cuaderno

Este tutorial muestra dos formas de cargar y preprocesar texto.

  • Primero, utilizará las utilidades y capas de Keras. Si es nuevo en TensorFlow, debe comenzar con estos.

  • A continuación, utilizará utilidades de nivel inferior como tf.data.TextLineDataset para cargar archivos de texto y tf.text para preprocesar los datos para un control más tf.text .

# Be sure you're using the stable versions of both tf and tf-text, for binary compatibility.
pip install -q -U tensorflow
pip install -q -U tensorflow-text
import collections
import pathlib
import re
import string

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

Ejemplo 1: predice la etiqueta para una pregunta de desbordamiento de pila

Como primer ejemplo, descargará un conjunto de datos de preguntas de programación de Stack Overflow. Cada pregunta ("¿Cómo clasifico un diccionario por valor?") Está etiquetada con exactamente una etiqueta ( Python , CSharp , JavaScript o Java ). Su tarea es desarrollar un modelo que prediga la etiqueta de una pregunta. Este es un ejemplo de clasificación de clases múltiples, un tipo de problema de aprendizaje automático importante y ampliamente aplicable.

Descarga y explora el conjunto de datos

A continuación, descargará el conjunto de datos y explorará la estructura del directorio.

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')
dataset_dir = pathlib.Path(dataset).parent
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz
6053888/6053168 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz')]
train_dir = dataset_dir/'train'
list(train_dir.iterdir())
[PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python')]

Los train/csharp , train/java , train/python y train/javascript contienen muchos archivos de texto, cada uno de los cuales es una pregunta de Stack Overflow. Imprima un archivo e inspeccione los datos.

sample_file = train_dir/'python/1755.txt'
with open(sample_file) as f:
  print(f.read())
why does this blank program print true x=true.def stupid():.    x=false.stupid().print x

Cargar el conjunto de datos

A continuación, cargará los datos del disco y los preparará en un formato adecuado para el entrenamiento. Para hacerlo, utilizará la utilidad text_dataset_from_directory para crear untf.data.Dataset etiquetado. Si es nuevo en tf.data , es una poderosa colección de herramientas para crear canalizaciones de entrada.

El preprocessing.text_dataset_from_directory espera una estructura de directorio de la siguiente manera.

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

Al ejecutar un experimento de aprendizaje automático, es una buena práctica dividir su conjunto de datos en tres divisiones: entrenamiento , validación y prueba . El conjunto de datos de Stack Overflow ya se ha dividido en entrenamiento y prueba, pero carece de un conjunto de validación. Cree un conjunto de validación usando una división 80:20 de los datos de entrenamiento usando el argumento validation_split continuación.

batch_size = 32
seed = 42

raw_train_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)
Found 8000 files belonging to 4 classes.
Using 6400 files for training.

Como puede ver arriba, hay 8.000 ejemplos en la carpeta de formación, de los cuales utilizará el 80% (o 6.400) para la formación. Como verá en un momento, puede entrenar un modelo pasando untf.data.Dataset directamente a model.fit . Primero, repita el conjunto de datos e imprima algunos ejemplos para familiarizarse con los datos.

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])
Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle().    {.        system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5.        system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30..        double vertexangle;.        vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;.        return vertexangle;.    }//end method vertexangle..public void menu().{.    system.out.println(mynumsides); // prints out what the user puts in.    system.out.println(mysidelength); // prints out what the user puts in.    gotographic();.    calcr(mynumsides, mysidelength);.    calcr(mynumsides, mysidelength);.    print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{.    int numsides;.    double sidelength;.    scanner keyboard = new scanner(system.in);..    system.out.println(""welcome to the regular polygon program!"");.    system.out.println();..    system.out.print(""enter the number of sides of the polygon ==> "");.    numsides = keyboard.nextint();.    system.out.println();..    system.out.print(""enter the side length of each side ==> "");.    sidelength = keyboard.nextdouble();.    system.out.println();..    regularpolygon shape = new regularpolygon(numsides, sidelength);.    shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n'
Label: 1
Question:  b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ?    ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):..    print img.shape.    res=img.copy().    for x in range(img.shape[0]):.        for y in range(img.shape[1]):.            rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]).            labimg=rgbimg.convert_to(\'lab\', debug=false).            if (labimg.lab_l > treshold):.                res[x,y,:]=color.            else: .                res[x,y,:]=img[x,y,:]..    return res"\n'
Label: 3
Question:  b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:..  import blank.io.*;.    import blank.util.arraylist;.    import static blank.lang.system.out;..    public class rentalsystem{.    static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));.    static file file = new file(""file.txt"");.    static arraylist<string> list = new arraylist<string>();.    static int rows;..    public static void main(string[] args) throws exception{.        introduction();.        system.out.print(""nn"");.        login();.        system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");.        introduction();.        string repeat;.        do{.            loadfile();.            system.out.print(""nwhat do you want to do?nn"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     1. add customer    |   2. rent return |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     3. view list       |   4. search      |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                                             |   5. exit        |n"");.            system.out.print(""n                                              - - - - - - - - - -"");.            system.out.print(""nnchoice:"");.            int choice = integer.parseint(input.readline());.            switch(choice){.                case 1:.                    writedata();.                    break;.                case 2:.                    rentdata();.                    break;.                case 3:.                    viewlist();.                    break;.                case 4:.                    search();.                    break;.                case 5:.                    system.out.println(""goodbye!"");.                    system.exit(0);.                default:.                    system.out.print(""invalid choice: "");.                    break;.            }.            system.out.print(""ndo another task? [y/n] "");.            repeat = input.readline();.        }while(repeat.equals(""y""));..        if(repeat!=""y"") system.out.println(""ngoodbye!"");..    }..    public static void writedata() throws exception{.        system.out.print(""nname: "");.        string cname = input.readline();.        system.out.print(""address: "");.        string add = input.readline();.        system.out.print(""phone no.: "");.        string pno = input.readline();.        system.out.print(""rental amount: "");.        string ramount = input.readline();.        system.out.print(""tapenumber: "");.        string tno = input.readline();.        system.out.print(""title: "");.        string title = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);.        rentdata();.    }..    public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{.        filewriter fw = new filewriter(file, true);.        fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");.        fw.close();.    }..    public static void loadfile() throws exception{.        try{.            list.clear();.            fileinputstream fstream = new fileinputstream(file);.            bufferedreader br = new bufferedreader(new inputstreamreader(fstream));.            rows = 0;.            while( br.ready()).            {.                list.add(br.readline());.                rows++;.            }.            br.close();.        } catch(exception e){.            system.out.println(""list not yet loaded."");.        }.    }..    public static void viewlist(){.        system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |list of all costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        for(int i = 0; i <rows; i++){.            system.out.println(list.get(i));.        }.    }.        public static void rentdata()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |rent data list|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter customer name: "");.        string cname = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        system.out.print(""return date: "");.        string rdate = input.readline();.        system.out.print(""rent amount: "");.        string ramount = input.readline();..        system.out.print(""you pay:""+ramount);...    }.    public static void search()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |search costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter costumer name: "");.        string cname = input.readline();.        boolean found = false;..        for(int i=0; i < rows; i++){.            string temp[] = list.get(i).split("","");..            if(cname.equals(temp[0])){.            system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");.                found = true;.            }.        }..        if(!found){.            system.out.print(""no results."");.        }..    }..        public static boolean evaluate(string uname, string pass){.        if (uname.equals(""admin"")&&pass.equals(""12345"")) return true;.        else return false;.    }..    public static string login()throws exception{.        bufferedreader input=new bufferedreader(new inputstreamreader(system.in));.        int counter=0;.        do{.            system.out.print(""username:"");.            string uname =input.readline();.            system.out.print(""password:"");.            string pass =input.readline();..            boolean accept= evaluate(uname,pass);..            if(accept){.                break;.                }else{.                    system.out.println(""incorrect username or password!"");.                    counter ++;.                    }.        }while(counter<3);..            if(counter !=3) return ""login successful"";.            else return ""login failed"";.            }.        public static void introduction() throws exception{..        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        system.out.println(""                  !                  r e n t a l                  !"");.        system.out.println(""                   ! ~ ~ ~ ~ ~ !  =================  ! ~ ~ ~ ~ ~ !"");.        system.out.println(""                  !                  s y s t e m                  !"");.        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        }..}"\n'
Label: 1
Question:  b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{.    nomtable = ""listeinteretstable"";.    d.cnx.open();.    da = new sqldataadapter(""select nome from offices"", d.cnx);.    ds = new dataset();.    da.fill(ds, nomtable);.    datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{.    d.cnx.open();.    cmdb = new sqlcommandbuilder(da);.    da.update(ds, nomtable);.    d.cnx.close();.}"\n'
Label: 0
Question:  b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1<? super t> onnext, final action1<throwable> onerror) {.}...in the first parameter, what does the question mark and super mean?"\n'
Label: 1
Question:  b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n'
Label: 0
Question:  b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n"
Label: 0
Question:  b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true.   #do things...however i get a syntax error at ..if completed == true"\n'
Label: 3
Question:  b'"blank control flow i made a number which asks for 2 numbers with blank and responds with  the corresponding message for the case. how come it doesnt work  for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{.    class program.    {.        static void main(string[] args).        {.            string myinput;  // declaring the type of the variables.            int myint;..            string number1;.            int number;...            console.writeline(""enter a number"");.            myinput = console.readline(); //muyinput is a string  which is entry input.            myint = int32.parse(myinput); // myint converts the string into an integer..            if (myint > 0).                console.writeline(""your number {0} is greater than zero."", myint);.            else if (myint < 0).                console.writeline(""your number {0} is  less  than zero."", myint);.            else.                console.writeline(""your number {0} is equal zero."", myint);..            console.writeline(""enter another number"");.            number1 = console.readline(); .            number = int32.parse(myinput); ..            if (number < 0 || number == 0).                console.writeline(""your number {0} is  less  than zero or equal zero."", number);.            else if (number > 0 && number <= 10).                console.writeline(""your number {0} is  in the range from 0 to 10."", number);.            else.                console.writeline(""your number {0} is greater than 10."", number);..            console.writeline(""enter another number"");..        }.    }    .}"\n'
Label: 0
Question:  b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {.    private static final string proxy_host = ""proxy.****.com"";.    private static final int proxy_port = 6050;..    public static void main(string[] args) {.        httpclient client = new httpclient();.        httpmethod method = new getmethod(""https://kodeblank.org"");..        hostconfiguration config = client.gethostconfiguration();.        config.setproxy(proxy_host, proxy_port);..        string username = ""*****"";.        string password = ""*****"";.        credentials credentials = new usernamepasswordcredentials(username, password);.        authscope authscope = new authscope(proxy_host, proxy_port);..        client.getstate().setproxycredentials(authscope, credentials);..        try {.            client.executemethod(method);..            if (method.getstatuscode() == httpstatus.sc_ok) {.                string response = method.getresponsebodyasstring();.                system.out.println(""response = "" + response);.            }.        } catch (ioexception e) {.            e.printstacktrace();.        } finally {.            method.releaseconnection();.        }.    }.}...exception:...  dec 08, 2017 1:41:39 pm .          org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme.         info: ntlm authentication scheme selected.       dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect.         severe: credentials cannot be used for ntlm authentication: .           org.apache.commons.httpclient.usernamepasswordcredentials.           org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials .         cannot be used for ntlm authentication: .        enter code here .          org.apache.commons.httpclient.usernamepasswordcredentials.      at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332).        at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320).      at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491).      at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391).      at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323).      at httpgetproxy.main(httpgetproxy.blank:31).  dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge.  info: failure authenticating with ntlm @proxy.****.com:6050"\n'
Label: 1

Las etiquetas son 0 , 1 , 2 o 3 . Para ver cuál de estos corresponde a qué etiqueta de cadena, puede verificar la propiedad class_names en el conjunto de datos.

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)
Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python

A continuación, creará un conjunto de datos de validación y prueba. Utilizará las 1.600 revisiones restantes del conjunto de formación para la validación.

raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)
Found 8000 files belonging to 4 classes.
Using 1600 files for validation.
test_dir = dataset_dir/'test'
raw_test_ds = preprocessing.text_dataset_from_directory(
    test_dir, batch_size=batch_size)
Found 8000 files belonging to 4 classes.

Prepare el conjunto de datos para el entrenamiento

A continuación, estandarizará, convertirá en token y vectorizará los datos utilizando la capa de preprocessing.TextVectorization .

  • La estandarización se refiere al preprocesamiento del texto, generalmente para eliminar la puntuación o elementos HTML para simplificar el conjunto de datos.

  • La tokenización se refiere a dividir cadenas en tokens (por ejemplo, dividir una oración en palabras individuales dividiéndola en espacios en blanco).

  • La vectorización se refiere a convertir tokens en números para que puedan introducirse en una red neuronal.

Todas estas tareas se pueden realizar con esta capa. Puede obtener más información sobre cada uno de estos en el documento API .

  • La estandarización predeterminada convierte el texto a minúsculas y elimina la puntuación.

  • El tokenizador predeterminado se divide en espacios en blanco.

  • El modo de vectorización predeterminado es int . Esto genera índices enteros (uno por token). Este modo se puede utilizar para crear modelos que tengan en cuenta el orden de las palabras. También puede utilizar otros modos, como binary , para crear modelos de bolsa de palabras.

Construirá dos modos para aprender más sobre estos. Primero, usará el modelo binary para construir un modelo de bolsa de palabras. A continuación, utilizará el modo int con un 1D ConvNet.

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

Para el modo int , además del tamaño máximo de vocabulario, debe establecer una longitud de secuencia máxima explícita, lo que hará que la capa rellene o trunque las secuencias a valores de longitud de secuencia exactamente.

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

A continuación, llamará a adapt para ajustar el estado de la capa de preprocesamiento al conjunto de datos. Esto hará que el modelo cree un índice de cadenas a números enteros.

# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

Vea el resultado de usar estas capas para preprocesar datos:

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label
# Retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)
Question tf.Tensor(b'"function expected error in blank for dynamically created check box when it is clicked i want to grab the attribute value.it is working in ie 8,9,10 but not working in ie 11,chrome shows function expected error..<input type=checkbox checked=\'checked\' id=\'symptomfailurecodeid\' tabindex=\'54\' style=\'cursor:pointer;\' onclick=chkclickevt(this);  failurecodeid=""1"" >...function chkclickevt(obj) { .    alert(obj.attributes(""failurecodeid""));.}"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)
print("'binary' vectorized question:", 
      binary_vectorize_text(first_question, first_label)[0])
'binary' vectorized question: tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])
'int' vectorized question: tf.Tensor(
[[  38  450   65    7   16   12  892  265  186  451   44   11    6  685
     3   46    4 2062    2  485    1    6  158    7  479    1   26   20
   158    7  479    1  502   38  450    1 1767 1763    1    1    1    1
     1    1    1    1    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]], shape=(1, 250), dtype=int64)

Como puede ver arriba, el modo binary devuelve una matriz que indica qué tokens existen al menos una vez en la entrada, mientras que el modo int reemplaza cada token por un número entero, preservando así su orden. Puede buscar el token (cadena) al que corresponde cada entero llamando a .get_vocabulary() en la capa.

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))
1289 --->  roman
313 --->  source
Vocabulary size: 10000

Está casi listo para entrenar su modelo. Como paso final de preprocesamiento, aplicará las capas de TextVectorization que creó anteriormente al conjunto de datos de entrenamiento, validación y prueba.

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

Configurar el conjunto de datos para el rendimiento

Estos son dos métodos importantes que debe utilizar al cargar datos para asegurarse de que las E / S no se bloqueen.

.cache() mantiene los datos en la memoria después de que se cargan fuera del disco. Esto asegurará que el conjunto de datos no se convierta en un cuello de botella mientras entrena su modelo. Si su conjunto de datos es demasiado grande para caber en la memoria, también puede usar este método para crear una caché en disco de alto rendimiento, que es más eficiente de leer que muchos archivos pequeños.

.prefetch() superpone el preprocesamiento de datos y la ejecución del modelo durante el entrenamiento.

Puede obtener más información sobre ambos métodos, así como sobre cómo almacenar en caché los datos en el disco en la guía de rendimiento de datos .

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

Entrena el modelo

Es hora de crear nuestra red neuronal. Para los datos vectorizados binary , entrene un modelo lineal simple de bolsa de palabras:

binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)
Epoch 1/10
200/200 [==============================] - 2s 9ms/step - loss: 1.2359 - accuracy: 0.5427 - val_loss: 0.9108 - val_accuracy: 0.7744
Epoch 2/10
200/200 [==============================] - 1s 3ms/step - loss: 0.8149 - accuracy: 0.8277 - val_loss: 0.7481 - val_accuracy: 0.8031
Epoch 3/10
200/200 [==============================] - 1s 3ms/step - loss: 0.6482 - accuracy: 0.8616 - val_loss: 0.6631 - val_accuracy: 0.8125
Epoch 4/10
200/200 [==============================] - 1s 3ms/step - loss: 0.5492 - accuracy: 0.8832 - val_loss: 0.6100 - val_accuracy: 0.8225
Epoch 5/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4805 - accuracy: 0.9055 - val_loss: 0.5735 - val_accuracy: 0.8294
Epoch 6/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4287 - accuracy: 0.9177 - val_loss: 0.5470 - val_accuracy: 0.8369
Epoch 7/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3876 - accuracy: 0.9286 - val_loss: 0.5270 - val_accuracy: 0.8363
Epoch 8/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3537 - accuracy: 0.9332 - val_loss: 0.5115 - val_accuracy: 0.8394
Epoch 9/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3250 - accuracy: 0.9396 - val_loss: 0.4993 - val_accuracy: 0.8419
Epoch 10/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3003 - accuracy: 0.9479 - val_loss: 0.4896 - val_accuracy: 0.8438

A continuación, utilizará la capa int vectorizada para construir una ConvNet 1D.

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model
# vocab_size is VOCAB_SIZE + 1 since 0 is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)
Epoch 1/5
200/200 [==============================] - 4s 8ms/step - loss: 1.3016 - accuracy: 0.3903 - val_loss: 0.7395 - val_accuracy: 0.6950
Epoch 2/5
200/200 [==============================] - 1s 6ms/step - loss: 0.6901 - accuracy: 0.7170 - val_loss: 0.5435 - val_accuracy: 0.7906
Epoch 3/5
200/200 [==============================] - 1s 6ms/step - loss: 0.4277 - accuracy: 0.8562 - val_loss: 0.4766 - val_accuracy: 0.8194
Epoch 4/5
200/200 [==============================] - 1s 6ms/step - loss: 0.2419 - accuracy: 0.9402 - val_loss: 0.4701 - val_accuracy: 0.8188
Epoch 5/5
200/200 [==============================] - 1s 6ms/step - loss: 0.1218 - accuracy: 0.9767 - val_loss: 0.4932 - val_accuracy: 0.8163

Compare los dos modelos:

print("Linear model on binary vectorized data:")
print(binary_model.summary())
Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 4)                 40004     
=================================================================
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None
print("ConvNet model on int vectorized data:")
print(int_model.summary())
ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          640064    
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d (Global (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 260       
=================================================================
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None

Evalúe ambos modelos en los datos de prueba:

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))
250/250 [==============================] - 1s 5ms/step - loss: 0.5166 - accuracy: 0.8139
250/250 [==============================] - 1s 4ms/step - loss: 0.5116 - accuracy: 0.8117
Binary model accuracy: 81.39%
Int model accuracy: 81.17%

Exportar el modelo

En el código anterior, aplicó la capa TextVectorization al conjunto de datos antes de alimentar texto al modelo. Si desea que su modelo sea capaz de procesar cadenas sin procesar (por ejemplo, para simplificar su implementación), puede incluir la capa TextVectorization dentro de su modelo. Para hacerlo, puede crear un nuevo modelo utilizando los pesos que acaba de entrenar.

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))
250/250 [==============================] - 2s 5ms/step - loss: 0.5187 - accuracy: 0.8138
Accuracy: 81.39%

Ahora su modelo puede tomar cadenas sin procesar como entrada y predecir una puntuación para cada etiqueta usando model.predict . Defina una función para encontrar la etiqueta con la puntuación máxima:

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

Ejecutar inferencia sobre nuevos datos

inputs = [
    "how do I extract keys from a dict into a list?",  # python
    "debug public static void main(string[] args) {...}",  # java
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())
Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'

Incluir la lógica de preprocesamiento de texto dentro de su modelo le permite exportar un modelo para producción que simplifica la implementación y reduce el potencial de sesgo de entrenamiento / prueba .

Hay una diferencia de rendimiento a tener en cuenta al elegir dónde aplicar su capa TextVectorization . Usarlo fuera de su modelo le permite realizar un procesamiento de CPU asincrónico y almacenamiento en búfer de sus datos cuando entrena en GPU. Por lo tanto, si está entrenando su modelo en la GPU, probablemente quiera usar esta opción para obtener el mejor rendimiento mientras desarrolla su modelo, luego cambie a incluir la capa TextVectorization dentro de su modelo cuando esté listo para prepararse para la implementación .

Visite este tutorial para obtener más información sobre cómo guardar modelos.

Ejemplo 2: Predecir el autor de las traducciones de Illiad

A continuación, se proporciona un ejemplo del uso de tf.data.TextLineDataset para cargar ejemplos de archivos de texto y tf.text para preprocesar los datos. En este ejemplo, utilizará tres traducciones al inglés diferentes del mismo trabajo, Homer's Illiad, y entrenará un modelo para identificar al traductor con una sola línea de texto.

Descarga y explora el conjunto de datos

Los textos de las tres traducciones son por:

Los archivos de texto utilizados en este tutorial se han sometido a algunas tareas típicas de preprocesamiento, como eliminar el encabezado y pie de página del documento, los números de línea y los títulos de los capítulos. Descargue estos archivos ligeramente modificados localmente.

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step
[PosixPath('/home/kbuilder/.keras/datasets/Giant Panda'),
 PosixPath('/home/kbuilder/.keras/datasets/derby.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/flower_photos.tar.gz'),
 PosixPath('/home/kbuilder/.keras/datasets/spa-eng'),
 PosixPath('/home/kbuilder/.keras/datasets/heart.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/iris_test.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/train.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/butler.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/flower_photos'),
 PosixPath('/home/kbuilder/.keras/datasets/image.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/shakespeare.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/Fireboat'),
 PosixPath('/home/kbuilder/.keras/datasets/iris_training.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/320px-Felis_catus-cat_on_snow.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv.zip'),
 PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'),
 PosixPath('/home/kbuilder/.keras/datasets/ImageNetLabels.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/mnist.npz'),
 PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/spa-eng.zip')]

Cargar el conjunto de datos

Utilizará TextLineDataset , que está diseñado para crear untf.data.Dataset partir de un archivo de texto en el que cada ejemplo es una línea de texto del archivo original, mientras que text_dataset_from_directory trata todo el contenido de un archivo como un solo ejemplo. TextLineDataset es útil para datos de texto que se basan principalmente en líneas (por ejemplo, poesía o registros de errores).

Repita estos archivos, cargando cada uno en su propio conjunto de datos. Cada ejemplo debe etiquetarse individualmente, así que use tf.data.Dataset.map para aplicar una función de etiquetado a cada uno. Esto iterará sobre cada ejemplo en el conjunto de datos, devolviendo pares ( example, label ).

def labeler(example, index):
  return example, tf.cast(index, tf.int64)
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

A continuación, combinará estos conjuntos de datos etiquetados en un solo conjunto de datos y lo mezclará.

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

Imprima algunos ejemplos como antes. El conjunto de datos aún no se ha all_labeled_data por all_labeled_data , por lo tanto, cada entrada en all_labeled_data datos all_labeled_data corresponde a un punto de datos:

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())
Sentence:  b'To chariot driven, thou maim thyself and me.'
Label: 0
Sentence:  b'On choicest marrow, and the fat of lambs;'
Label: 1
Sentence:  b'And through the gorgeous breastplate, and within'
Label: 1
Sentence:  b'To visit there the parent of the Gods'
Label: 0
Sentence:  b'For safe escape from danger and from death.'
Label: 0
Sentence:  b'Achilles, ye at least the fight decline'
Label: 0
Sentence:  b"Which done, Achilles portion'd out to each"
Label: 0
Sentence:  b'Whom therefore thou devourest; else themselves'
Label: 0
Sentence:  b'Drove them afar into the host of Greece.'
Label: 0
Sentence:  b"Their succour; then I warn thee, while 'tis time,"
Label: 1

Prepare el conjunto de datos para el entrenamiento

En lugar de usar la capa Keras TextVectorization para preprocesar nuestro conjunto de datos de texto, ahora usará la API tf.text para estandarizar y tokenizar los datos, construir un vocabulario y usar StaticVocabularyTable para mapear tokens a enteros para alimentar el modelo.

Si bien tf.text proporciona varios tokenizadores, usará UnicodeScriptTokenizer para tokenizar nuestro conjunto de datos. Defina una función para convertir el texto a minúsculas y convertirlo en token. Utilizará tf.data.Dataset.map para aplicar la tokenización al conjunto de datos.

tokenizer = tf_text.UnicodeScriptTokenizer()
def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201: batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.

Puede iterar sobre el conjunto de datos e imprimir algunos ejemplos tokenizados.

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())
Tokens:  [b'to' b'chariot' b'driven' b',' b'thou' b'maim' b'thyself' b'and' b'me'
 b'.']
Tokens:  [b'on' b'choicest' b'marrow' b',' b'and' b'the' b'fat' b'of' b'lambs' b';']
Tokens:  [b'and' b'through' b'the' b'gorgeous' b'breastplate' b',' b'and' b'within']
Tokens:  [b'to' b'visit' b'there' b'the' b'parent' b'of' b'the' b'gods']
Tokens:  [b'for' b'safe' b'escape' b'from' b'danger' b'and' b'from' b'death' b'.']

A continuación, desarrollará un vocabulario clasificando los tokens por frecuencia y conservando los tokens VOCAB_SIZE superiores.

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])
Vocab size:  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']

Para convertir los tokens en números enteros, use el conjunto de vocab para crear una StaticVocabularyTable . vocab_size + 2 tokens a números enteros en el rango [ 2 , vocab_size + 2 ]. Al igual que con la capa TextVectorization , 0 está reservado para denotar relleno y 1 está reservado para denotar un token fuera de vocabulario (OOV).

keys = vocab
values = range(2, len(vocab) + 2)  # reserve 0 for padding, 1 for OOV

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

Finalmente, defina una función para estandarizar, tokenizar y vectorizar el conjunto de datos usando el tokenizador y la tabla de búsqueda:

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

Puede probar esto en un solo ejemplo para ver el resultado:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())
Sentence:  b'To chariot driven, thou maim thyself and me.'
Vectorized sentence:  [   8  195  716    2   47 5605  552    4   40    7]

Ahora ejecute la función de preproceso en el conjunto de datos usando tf.data.Dataset.map .

all_encoded_data = all_labeled_data.map(preprocess_text)

Divida el conjunto de datos en tren y pruebe

La capa Keras TextVectorization también TextVectorization y TextVectorization los datos vectorizados. El relleno es necesario porque los ejemplos dentro de un lote deben tener el mismo tamaño y forma, pero los ejemplos de estos conjuntos de datos no son todos del mismo tamaño: cada línea de texto tiene un número diferente de palabras.tf.data.Dataset admite conjuntos de datos divididos y por lotes rellenados:

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

Ahora, validation_data y train_data no son colecciones de pares ( example, label ), sino colecciones de lotes. Cada lote es un par de ( muchos ejemplos , muchas etiquetas ) representados como matrices. Para ilustrar:

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])
Text batch shape:  (64, 16)
Label batch shape:  (64,)
First text example:  tf.Tensor(
[   8  195  716    2   47 5605  552    4   40    7    0    0    0    0
    0    0], shape=(16,), dtype=int64)
First label example:  tf.Tensor(0, shape=(), dtype=int64)

Dado que usamos 0 para relleno y 1 para tokens de vocabulario fuera de vocabulario (OOV), el tamaño del vocabulario se ha incrementado en dos.

vocab_size += 2

Configure los conjuntos de datos para un mejor rendimiento como antes.

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

Entrena el modelo

Puede entrenar un modelo en este conjunto de datos como antes.

model = create_model(vocab_size=vocab_size, num_labels=3)
model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])
history = model.fit(train_data, validation_data=validation_data, epochs=3)
Epoch 1/3
697/697 [==============================] - 30s 12ms/step - loss: 0.6900 - accuracy: 0.6660 - val_loss: 0.3815 - val_accuracy: 0.8368
Epoch 2/3
697/697 [==============================] - 5s 7ms/step - loss: 0.3173 - accuracy: 0.8705 - val_loss: 0.3622 - val_accuracy: 0.8460
Epoch 3/3
697/697 [==============================] - 4s 6ms/step - loss: 0.2159 - accuracy: 0.9167 - val_loss: 0.3895 - val_accuracy: 0.8466
loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 1s 2ms/step - loss: 0.3895 - accuracy: 0.8466
Loss:  0.3894515335559845
Accuracy: 84.66%

Exportar el modelo

Para que nuestro modelo sea capaz de tomar cadenas sin procesar como entrada, creará una capa TextVectorization que realiza los mismos pasos que nuestra función de preprocesamiento personalizada. Como ya entrenó un vocabulario, puede usar set_vocaublary lugar de adapt que entrena un nuevo vocabulario.

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)
preprocess_layer.set_vocabulary(vocab)
export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])
# Create a test dataset of raw strings
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
loss, accuracy = export_model.evaluate(test_ds)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 7s 11ms/step - loss: 0.4626 - accuracy: 0.8128
Loss:  0.4913882315158844
Accuracy: 80.50%

La pérdida y precisión para el modelo en el conjunto de validación codificado y el modelo exportado en el conjunto de validación sin procesar son las mismas, como se esperaba.

Ejecutar inferencia sobre nuevos datos

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]
predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())
Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0

Descarga de más conjuntos de datos con conjuntos de datos de TensorFlow (TFDS)

Puede descargar muchos más conjuntos de datos de TensorFlow Datasets . Por ejemplo, descargará el conjunto de datos IMDB Large Movie Review y lo usará para entrenar un modelo para la clasificación de opiniones.

train_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)
val_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

Imprime algunos ejemplos.

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())
Review:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label:  0
Review:  b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.'
Label:  0
Review:  b'Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.'
Label:  0
Review:  b'This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.'
Label:  1
Review:  b'As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably shows the hypocrisy of the female libido. When men are around they want to be pursued, but when no "men" are around, they become the pursuers of a 14 year old boy. And the boy becomes a man really fast (we should all be so lucky at this age!). He then gets up the courage to pursue his true love.'
Label:  1

Ahora puede preprocesar los datos y entrenar un modelo como antes.

Prepare el conjunto de datos para el entrenamiento

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call adapt
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)
# Configure datasets for performance as before
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

Entrena el modelo

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 64)          640064    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
=================================================================
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________
model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = model.fit(train_ds, validation_data=val_ds, epochs=3)
Epoch 1/3
391/391 [==============================] - 6s 13ms/step - loss: 0.6123 - accuracy: 0.5805 - val_loss: 0.2976 - val_accuracy: 0.8807
Epoch 2/3
391/391 [==============================] - 4s 10ms/step - loss: 0.3141 - accuracy: 0.8609 - val_loss: 0.1708 - val_accuracy: 0.9423
Epoch 3/3
391/391 [==============================] - 4s 10ms/step - loss: 0.1977 - accuracy: 0.9211 - val_loss: 0.0944 - val_accuracy: 0.9776
loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
391/391 [==============================] - 1s 3ms/step - loss: 0.0944 - accuracy: 0.9776
Loss:  0.09437894821166992
Accuracy: 97.76%

Exportar el modelo

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])
# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]
predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)
Question:  This is a fantastic movie.
Predicted label:  1
Question:  This is a bad movie.
Predicted label:  0
Question:  This movie was so bad that it was good.
Predicted label:  0
Question:  I will never say yes to watching this movie.
Predicted label:  0

Conclusión

Este tutorial demostró varias formas de cargar y preprocesar texto. Como siguiente paso, puede explorar tutoriales adicionales en el sitio web o descargar nuevos conjuntos de datos de TensorFlow Datasets .