Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Charger le texte

Voir sur TensorFlow.org

Exécuter dans Google Colab

Voir la source sur GitHub

Télécharger le cahier

Ce didacticiel montre deux manières de charger et de prétraiter du texte.

Tout d'abord, vous utiliserez les utilitaires Keras et les couches de prétraitement. Ceux-ci incluent tf.keras.utils.text_dataset_from_directory pour transformer les données en un tf.data.Dataset et tf.keras.layers.TextVectorization pour la standardisation, la tokenisation et la vectorisation des données. Si vous débutez avec TensorFlow, vous devriez commencer par ceux-ci.
Ensuite, vous utiliserez des utilitaires de niveau inférieur tels que tf.data.TextLineDataset pour charger des fichiers texte, et des API de texte TensorFlow , telles que text.UnicodeScriptTokenizer et text.case_fold_utf8 , pour prétraiter les données pour un contrôle plus fin.

# Be sure you're using the stable versions of both `tensorflow` and
# `tensorflow-text`, for binary compatibility.
pip uninstall -y tf-nightly keras-nightly
pip install tensorflow
pip install tensorflow-text

import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

Exemple 1 : Prédire la balise d'une question Stack Overflow

Comme premier exemple, vous allez télécharger un ensemble de données de questions de programmation à partir de Stack Overflow. Chaque question ( « Comment trier un dictionnaire par valeur ? » ) est étiquetée avec exactement une balise ( Python , CSharp , JavaScript ou Java ). Votre tâche consiste à développer un modèle qui prédit la balise d'une question. Il s'agit d'un exemple de classification multi-classes, un type de problème d'apprentissage automatique important et largement applicable.

Téléchargez et explorez le jeu de données

Commencez par télécharger l'ensemble de données Stack Overflow à l'aide tf.keras.utils.get_file et explorez la structure des répertoires :

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz
6053888/6053168 [==============================] - 0s 0us/step
6062080/6053168 [==============================] - 0s 0us/step

list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz'),
 PosixPath('/tmp/.keras/test')]

train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python')]

Les train/csharp , train/java , train/python et train/javascript contiennent de nombreux fichiers texte, chacun étant une question Stack Overflow.

Imprimez un exemple de fichier et inspectez les données :

sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x

Charger le jeu de données

Ensuite, vous allez charger les données hors disque et les préparer dans un format adapté à la formation. Pour ce faire, vous utiliserez l'utilitaire tf.keras.utils.text_dataset_from_directory pour créer un tf.data.Dataset étiqueté. Si vous débutez avec tf.data , il s'agit d'une puissante collection d'outils pour créer des pipelines d'entrée. (Pour en savoir plus, consultez le guide tf.data : Build TensorFlow input pipelines .)

L'API tf.keras.utils.text_dataset_from_directory attend une structure de répertoire comme suit :

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

Lors de l'exécution d'une expérience d'apprentissage automatique, il est recommandé de diviser votre ensemble de données en trois parties : formation , validation et test .

L'ensemble de données Stack Overflow a déjà été divisé en ensembles d'entraînement et de test, mais il manque un ensemble de validation.

Créez un ensemble de validation en utilisant une répartition 80:20 des données d'apprentissage en utilisant tf.keras.utils.text_dataset_from_directory avec validation_split défini sur 0.2 (c'est-à-dire 20 %) :

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.

Comme le suggère la sortie de cellule précédente, il y a 8 000 exemples dans le dossier de formation, dont vous utiliserez 80 % (ou 6 400) pour la formation. Vous apprendrez dans un instant que vous pouvez former un modèle en passant un tf.data.Dataset directement à Model.fit .

Tout d'abord, parcourez l'ensemble de données et imprimez quelques exemples, pour avoir une idée des données.

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle().    {.        system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5.        system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30..        double vertexangle;.        vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;.        return vertexangle;.    }//end method vertexangle..public void menu().{.    system.out.println(mynumsides); // prints out what the user puts in.    system.out.println(mysidelength); // prints out what the user puts in.    gotographic();.    calcr(mynumsides, mysidelength);.    calcr(mynumsides, mysidelength);.    print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{.    int numsides;.    double sidelength;.    scanner keyboard = new scanner(system.in);..    system.out.println(""welcome to the regular polygon program!"");.    system.out.println();..    system.out.print(""enter the number of sides of the polygon ==&gt; "");.    numsides = keyboard.nextint();.    system.out.println();..    system.out.print(""enter the side length of each side ==&gt; "");.    sidelength = keyboard.nextdouble();.    system.out.println();..    regularpolygon shape = new regularpolygon(numsides, sidelength);.    shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n'
Label: 1
Question:  b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ?    ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):..    print img.shape.    res=img.copy().    for x in range(img.shape[0]):.        for y in range(img.shape[1]):.            rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]).            labimg=rgbimg.convert_to(\'lab\', debug=false).            if (labimg.lab_l &gt; treshold):.                res[x,y,:]=color.            else: .                res[x,y,:]=img[x,y,:]..    return res"\n'
Label: 3
Question:  b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:..  import blank.io.*;.    import blank.util.arraylist;.    import static blank.lang.system.out;..    public class rentalsystem{.    static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));.    static file file = new file(""file.txt"");.    static arraylist&lt;string&gt; list = new arraylist&lt;string&gt;();.    static int rows;..    public static void main(string[] args) throws exception{.        introduction();.        system.out.print(""nn"");.        login();.        system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");.        introduction();.        string repeat;.        do{.            loadfile();.            system.out.print(""nwhat do you want to do?nn"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     1. add customer    |   2. rent return |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     3. view list       |   4. search      |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                                             |   5. exit        |n"");.            system.out.print(""n                                              - - - - - - - - - -"");.            system.out.print(""nnchoice:"");.            int choice = integer.parseint(input.readline());.            switch(choice){.                case 1:.                    writedata();.                    break;.                case 2:.                    rentdata();.                    break;.                case 3:.                    viewlist();.                    break;.                case 4:.                    search();.                    break;.                case 5:.                    system.out.println(""goodbye!"");.                    system.exit(0);.                default:.                    system.out.print(""invalid choice: "");.                    break;.            }.            system.out.print(""ndo another task? [y/n] "");.            repeat = input.readline();.        }while(repeat.equals(""y""));..        if(repeat!=""y"") system.out.println(""ngoodbye!"");..    }..    public static void writedata() throws exception{.        system.out.print(""nname: "");.        string cname = input.readline();.        system.out.print(""address: "");.        string add = input.readline();.        system.out.print(""phone no.: "");.        string pno = input.readline();.        system.out.print(""rental amount: "");.        string ramount = input.readline();.        system.out.print(""tapenumber: "");.        string tno = input.readline();.        system.out.print(""title: "");.        string title = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);.        rentdata();.    }..    public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{.        filewriter fw = new filewriter(file, true);.        fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");.        fw.close();.    }..    public static void loadfile() throws exception{.        try{.            list.clear();.            fileinputstream fstream = new fileinputstream(file);.            bufferedreader br = new bufferedreader(new inputstreamreader(fstream));.            rows = 0;.            while( br.ready()).            {.                list.add(br.readline());.                rows++;.            }.            br.close();.        } catch(exception e){.            system.out.println(""list not yet loaded."");.        }.    }..    public static void viewlist(){.        system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |list of all costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        for(int i = 0; i &lt;rows; i++){.            system.out.println(list.get(i));.        }.    }.        public static void rentdata()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |rent data list|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter customer name: "");.        string cname = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        system.out.print(""return date: "");.        string rdate = input.readline();.        system.out.print(""rent amount: "");.        string ramount = input.readline();..        system.out.print(""you pay:""+ramount);...    }.    public static void search()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |search costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter costumer name: "");.        string cname = input.readline();.        boolean found = false;..        for(int i=0; i &lt; rows; i++){.            string temp[] = list.get(i).split("","");..            if(cname.equals(temp[0])){.            system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");.                found = true;.            }.        }..        if(!found){.            system.out.print(""no results."");.        }..    }..        public static boolean evaluate(string uname, string pass){.        if (uname.equals(""admin"")&amp;&amp;pass.equals(""12345"")) return true;.        else return false;.    }..    public static string login()throws exception{.        bufferedreader input=new bufferedreader(new inputstreamreader(system.in));.        int counter=0;.        do{.            system.out.print(""username:"");.            string uname =input.readline();.            system.out.print(""password:"");.            string pass =input.readline();..            boolean accept= evaluate(uname,pass);..            if(accept){.                break;.                }else{.                    system.out.println(""incorrect username or password!"");.                    counter ++;.                    }.        }while(counter&lt;3);..            if(counter !=3) return ""login successful"";.            else return ""login failed"";.            }.        public static void introduction() throws exception{..        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        system.out.println(""                  !                  r e n t a l                  !"");.        system.out.println(""                   ! ~ ~ ~ ~ ~ !  =================  ! ~ ~ ~ ~ ~ !"");.        system.out.println(""                  !                  s y s t e m                  !"");.        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        }..}"\n'
Label: 1
Question:  b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{.    nomtable = ""listeinteretstable"";.    d.cnx.open();.    da = new sqldataadapter(""select nome from offices"", d.cnx);.    ds = new dataset();.    da.fill(ds, nomtable);.    datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{.    d.cnx.open();.    cmdb = new sqlcommandbuilder(da);.    da.update(ds, nomtable);.    d.cnx.close();.}"\n'
Label: 0
Question:  b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1&lt;? super t&gt; onnext, final action1&lt;throwable&gt; onerror) {.}...in the first parameter, what does the question mark and super mean?"\n'
Label: 1
Question:  b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n'
Label: 0
Question:  b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n"
Label: 0
Question:  b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true.   #do things...however i get a syntax error at ..if completed == true"\n'
Label: 3
Question:  b'"blank control flow i made a number which asks for 2 numbers with blank and responds with  the corresponding message for the case. how come it doesnt work  for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{.    class program.    {.        static void main(string[] args).        {.            string myinput;  // declaring the type of the variables.            int myint;..            string number1;.            int number;...            console.writeline(""enter a number"");.            myinput = console.readline(); //muyinput is a string  which is entry input.            myint = int32.parse(myinput); // myint converts the string into an integer..            if (myint &gt; 0).                console.writeline(""your number {0} is greater than zero."", myint);.            else if (myint &lt; 0).                console.writeline(""your number {0} is  less  than zero."", myint);.            else.                console.writeline(""your number {0} is equal zero."", myint);..            console.writeline(""enter another number"");.            number1 = console.readline(); .            number = int32.parse(myinput); ..            if (number &lt; 0 || number == 0).                console.writeline(""your number {0} is  less  than zero or equal zero."", number);.            else if (number &gt; 0 &amp;&amp; number &lt;= 10).                console.writeline(""your number {0} is  in the range from 0 to 10."", number);.            else.                console.writeline(""your number {0} is greater than 10."", number);..            console.writeline(""enter another number"");..        }.    }    .}"\n'
Label: 0
Question:  b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {.    private static final string proxy_host = ""proxy.****.com"";.    private static final int proxy_port = 6050;..    public static void main(string[] args) {.        httpclient client = new httpclient();.        httpmethod method = new getmethod(""https://kodeblank.org"");..        hostconfiguration config = client.gethostconfiguration();.        config.setproxy(proxy_host, proxy_port);..        string username = ""*****"";.        string password = ""*****"";.        credentials credentials = new usernamepasswordcredentials(username, password);.        authscope authscope = new authscope(proxy_host, proxy_port);..        client.getstate().setproxycredentials(authscope, credentials);..        try {.            client.executemethod(method);..            if (method.getstatuscode() == httpstatus.sc_ok) {.                string response = method.getresponsebodyasstring();.                system.out.println(""response = "" + response);.            }.        } catch (ioexception e) {.            e.printstacktrace();.        } finally {.            method.releaseconnection();.        }.    }.}...exception:...  dec 08, 2017 1:41:39 pm .          org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme.         info: ntlm authentication scheme selected.       dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect.         severe: credentials cannot be used for ntlm authentication: .           org.apache.commons.httpclient.usernamepasswordcredentials.           org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials .         cannot be used for ntlm authentication: .        enter code here .          org.apache.commons.httpclient.usernamepasswordcredentials.      at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332).        at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320).      at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491).      at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391).      at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323).      at httpgetproxy.main(httpgetproxy.blank:31).  dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge.  info: failure authenticating with ntlm @proxy.****.com:6050"\n'
Label: 1

Les étiquettes sont 0 , 1 , 2 ou 3 . Pour vérifier lequel d'entre eux correspond à quelle étiquette de chaîne, vous pouvez inspecter la propriété class_names sur l'ensemble de données :

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python

Ensuite, vous allez créer une validation et un jeu de test à l'aide tf.keras.utils.text_dataset_from_directory . Vous utiliserez les 1 600 avis restants de l'ensemble de formation pour validation.

# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.

test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

Found 8000 files belonging to 4 classes.

Préparer l'ensemble de données pour la formation

Ensuite, vous allez standardiser, tokeniser et vectoriser les données à l'aide de la couche tf.keras.layers.TextVectorization .

La normalisation fait référence au prétraitement du texte, généralement pour supprimer la ponctuation ou les éléments HTML afin de simplifier l'ensemble de données.
La tokenisation fait référence à la division de chaînes en jetons (par exemple, la division d'une phrase en mots individuels en la divisant sur des espaces).
La vectorisation fait référence à la conversion de jetons en nombres afin qu'ils puissent être introduits dans un réseau de neurones.

Toutes ces tâches peuvent être accomplies avec cette couche. (Vous pouvez en savoir plus sur chacun d'entre eux dans la documentation de l'API tf.keras.layers.TextVectorization .)

Noter que:

La normalisation par défaut convertit le texte en minuscules et supprime la ponctuation ( standardize='lower_and_strip_punctuation' ).
Le tokenizer par défaut se divise sur les espaces ( split='whitespace' ).
Le mode de vectorisation par défaut est 'int' ( output_mode='int' ). Cela génère des indices entiers (un par jeton). Ce mode peut être utilisé pour construire des modèles prenant en compte l'ordre des mots. Vous pouvez également utiliser d'autres modes, comme 'binary' pour créer des modèles de sac de mots .

Vous construirez deux modèles pour en savoir plus sur la standardisation, la tokenisation et la vectorisation avec TextVectorization :

Tout d'abord, vous utiliserez le mode de vectorisation 'binary' pour construire un modèle de sac de mots.
Ensuite, vous utiliserez le mode 'int' avec un ConvNet 1D.

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

Pour le mode 'int' , en plus de la taille maximale du vocabulaire, vous devez définir une longueur de séquence maximale explicite ( MAX_SEQUENCE_LENGTH ), ce qui amènera la couche à remplir ou à tronquer les séquences aux valeurs exactes de output_sequence_length :

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

Ensuite, appelez TextVectorization.adapt pour adapter l'état de la couche de prétraitement au jeu de données. Cela amènera le modèle à construire un index de chaînes en nombres entiers.

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

Imprimez le résultat de l'utilisation de ces calques pour prétraiter les données :

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)

print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)

print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[ 55   6   2 410 211 229 121 895   4 124  32 245  43   5   1   1   5   1
    1   6   2 410 211 191 318  14   2  98  71 188   8   2 199  71 178   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 250), dtype=int64)

Comme indiqué ci-dessus, le mode 'binary' de TextVectorization renvoie un tableau indiquant quels jetons existent au moins une fois dans l'entrée, tandis que le mode 'int' remplace chaque jeton par un entier, préservant ainsi leur ordre.

Vous pouvez rechercher le jeton (chaîne) auquel correspond chaque entier en appelant TextVectorization.get_vocabulary sur la couche :

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 --->  roman
313 --->  source
Vocabulary size: 10000

Vous êtes presque prêt à entraîner votre modèle.

Comme dernière étape de prétraitement, vous appliquerez les couches TextVectorization que vous avez créées précédemment aux ensembles d'entraînement, de validation et de test :

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

Configurer l'ensemble de données pour les performances

Ce sont deux méthodes importantes que vous devez utiliser lors du chargement des données pour vous assurer que les E/S ne deviennent pas bloquantes.

Dataset.cache conserve les données en mémoire après leur chargement hors disque. Cela garantira que l'ensemble de données ne devienne pas un goulot d'étranglement lors de la formation de votre modèle. Si votre jeu de données est trop volumineux pour tenir en mémoire, vous pouvez également utiliser cette méthode pour créer un cache sur disque performant, plus efficace à lire que de nombreux petits fichiers.
Dataset.prefetch chevauche le prétraitement des données et l'exécution du modèle pendant la formation.

Vous pouvez en savoir plus sur les deux méthodes, ainsi que sur la mise en cache des données sur le disque dans la section Prélecture du guide Meilleures performances avec l'API tf.data .

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

Former le modèle

Il est temps de créer votre réseau de neurones.

Pour les données vectorisées 'binary' , définissez un modèle linéaire de sac de mots simple, puis configurez-le et entraînez-le :

binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
200/200 [==============================] - 2s 4ms/step - loss: 1.1170 - accuracy: 0.6509 - val_loss: 0.9165 - val_accuracy: 0.7844
Epoch 2/10
200/200 [==============================] - 1s 3ms/step - loss: 0.7781 - accuracy: 0.8169 - val_loss: 0.7522 - val_accuracy: 0.8050
Epoch 3/10
200/200 [==============================] - 1s 3ms/step - loss: 0.6274 - accuracy: 0.8591 - val_loss: 0.6664 - val_accuracy: 0.8163
Epoch 4/10
200/200 [==============================] - 1s 3ms/step - loss: 0.5342 - accuracy: 0.8866 - val_loss: 0.6129 - val_accuracy: 0.8188
Epoch 5/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4683 - accuracy: 0.9038 - val_loss: 0.5761 - val_accuracy: 0.8281
Epoch 6/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4181 - accuracy: 0.9181 - val_loss: 0.5494 - val_accuracy: 0.8331
Epoch 7/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3779 - accuracy: 0.9287 - val_loss: 0.5293 - val_accuracy: 0.8388
Epoch 8/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3446 - accuracy: 0.9361 - val_loss: 0.5137 - val_accuracy: 0.8400
Epoch 9/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3164 - accuracy: 0.9430 - val_loss: 0.5014 - val_accuracy: 0.8381
Epoch 10/10
200/200 [==============================] - 1s 3ms/step - loss: 0.2920 - accuracy: 0.9495 - val_loss: 0.4916 - val_accuracy: 0.8388

Ensuite, vous utiliserez la couche vectorisée 'int' pour construire un ConvNet 1D :

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

Epoch 1/5
200/200 [==============================] - 9s 5ms/step - loss: 1.1471 - accuracy: 0.5016 - val_loss: 0.7856 - val_accuracy: 0.6913
Epoch 2/5
200/200 [==============================] - 1s 3ms/step - loss: 0.6378 - accuracy: 0.7550 - val_loss: 0.5494 - val_accuracy: 0.8056
Epoch 3/5
200/200 [==============================] - 1s 3ms/step - loss: 0.3900 - accuracy: 0.8764 - val_loss: 0.4845 - val_accuracy: 0.8206
Epoch 4/5
200/200 [==============================] - 1s 3ms/step - loss: 0.2234 - accuracy: 0.9447 - val_loss: 0.4819 - val_accuracy: 0.8188
Epoch 5/5
200/200 [==============================] - 1s 3ms/step - loss: 0.1146 - accuracy: 0.9809 - val_loss: 0.5038 - val_accuracy: 0.8150

Comparez les deux modèles :

print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 4)                 40004     
                                                                 
=================================================================
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None

print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 64)          640064    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
=================================================================
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None

Évaluez les deux modèles sur les données de test :

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

250/250 [==============================] - 1s 3ms/step - loss: 0.5178 - accuracy: 0.8151
250/250 [==============================] - 1s 2ms/step - loss: 0.5262 - accuracy: 0.8073
Binary model accuracy: 81.51%
Int model accuracy: 80.73%

Exporter le modèle

Dans le code ci-dessus, vous avez appliqué tf.keras.layers.TextVectorization au jeu de données avant de fournir du texte au modèle. Si vous souhaitez rendre votre modèle capable de traiter des chaînes brutes (par exemple, pour simplifier son déploiement), vous pouvez inclure la couche TextVectorization dans votre modèle.

Pour ce faire, vous pouvez créer un nouveau modèle en utilisant les poids que vous venez d'entraîner :

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

250/250 [==============================] - 1s 4ms/step - loss: 0.5178 - accuracy: 0.8151
Accuracy: 81.51%

Désormais, votre modèle peut prendre des chaînes brutes en entrée et prédire un score pour chaque étiquette à l'aide Model.predict . Définissez une fonction pour trouver l'étiquette avec le score maximum :

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

Exécuter l'inférence sur de nouvelles données

inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'

L'inclusion de la logique de prétraitement du texte dans votre modèle vous permet d'exporter un modèle pour la production, ce qui simplifie le déploiement et réduit le potentiel d' apprentissage/test de biais .

Il y a une différence de performances à garder à l'esprit lorsque vous choisissez où appliquer tf.keras.layers.TextVectorization . L'utiliser en dehors de votre modèle vous permet d'effectuer un traitement CPU asynchrone et une mise en mémoire tampon de vos données lors de l'entraînement sur GPU. Donc, si vous formez votre modèle sur le GPU, vous voudrez probablement utiliser cette option pour obtenir les meilleures performances lors du développement de votre modèle, puis passez à l'inclusion de la couche TextVectorization dans votre modèle lorsque vous êtes prêt à préparer le déploiement. .

Consultez le didacticiel Enregistrer et charger des modèles pour en savoir plus sur l'enregistrement de modèles.

Exemple 2 : Prédire l'auteur des traductions de l'Iliade

Voici un exemple d'utilisation de tf.data.TextLineDataset pour charger des exemples à partir de fichiers texte et de TensorFlow Text pour prétraiter les données. Vous utiliserez trois traductions anglaises différentes du même ouvrage, l'Iliade d'Homère, et formerez un modèle pour identifier le traducteur à partir d'une seule ligne de texte.

Téléchargez et explorez le jeu de données

Les textes des trois traductions sont de :

Les fichiers texte utilisés dans ce didacticiel ont subi des tâches de prétraitement typiques telles que la suppression de l'en-tête et du pied de page du document, des numéros de ligne et des titres de chapitre.

Téléchargez localement ces fichiers légèrement modifiés :

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
827392/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
819200/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step
819200/807992 [==============================] - 0s 0us/step
[PosixPath('/home/kbuilder/.keras/datasets/derby.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/butler.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'),
 PosixPath('/home/kbuilder/.keras/datasets/mnist.npz')]

Charger le jeu de données

Auparavant, avec tf.keras.utils.text_dataset_from_directory , tout le contenu d'un fichier était traité comme un seul exemple. Ici, vous utiliserez tf.data.TextLineDataset , qui est conçu pour créer un tf.data.Dataset à partir d'un fichier texte où chaque exemple est une ligne de texte du fichier d'origine. TextLineDataset est utile pour les données textuelles principalement basées sur des lignes (par exemple, de la poésie ou des journaux d'erreurs).

Parcourez ces fichiers, en chargeant chacun dans son propre ensemble de données. Chaque exemple doit être étiqueté individuellement, utilisez donc Dataset.map pour appliquer une fonction d'étiquetage à chacun. Cela itérera sur chaque exemple de l'ensemble de données, renvoyant des paires ( example, label ).

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

Ensuite, vous combinerez ces ensembles de données étiquetés en un seul ensemble de données à l'aide Dataset.concatenate et mélangez-le avec Dataset.shuffle :

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

Imprimez quelques exemples comme précédemment. L'ensemble de données n'a pas encore été regroupé, donc chaque entrée dans all_labeled_data correspond à un point de données :

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())

Sentence:  b'Beneath the yoke the flying coursers led.'
Label: 1
Sentence:  b'Too free a range, and watchest all I do;'
Label: 1
Sentence:  b'defence of their ships. Thus would any seer who was expert in these'
Label: 2
Sentence:  b'"From morn to eve I fell, a summer\'s day,"'
Label: 0
Sentence:  b'went to the city bearing a message of peace to the Cadmeians; on his'
Label: 2
Sentence:  b'darkness of the flying night, and tell it to Agamemnon. This might'
Label: 2
Sentence:  b"To that distinction, Nestor's son, whom yet"
Label: 0
Sentence:  b'A sounder judge of honour and disgrace:'
Label: 1
Sentence:  b'He wept as he spoke, and the elders sighed in concert as each thought'
Label: 2
Sentence:  b'to gather his bones for the silt in which I shall have hidden him, and'
Label: 2

Préparer l'ensemble de données pour la formation

Au lieu d'utiliser tf.keras.layers.TextVectorization pour prétraiter l'ensemble de données de texte, vous allez maintenant utiliser les API TensorFlow Text pour normaliser et tokeniser les données, créer un vocabulaire et utiliser tf.lookup.StaticVocabularyTable pour mapper des jetons sur des entiers à alimenter maquette. (En savoir plus sur TensorFlow Text ).

Définissez une fonction pour convertir le texte en minuscules et le tokeniser :

TensorFlow Text fournit divers tokenizers. Dans cet exemple, vous utiliserez text.UnicodeScriptTokenizer pour tokeniser l'ensemble de données.
Vous utiliserez Dataset.map pour appliquer la tokenisation à l'ensemble de données.

tokenizer = tf_text.UnicodeScriptTokenizer()

def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

tokenized_ds = all_labeled_data.map(tokenize)

Vous pouvez parcourir l'ensemble de données et imprimer quelques exemples tokenisés :

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())

Tokens:  [b'beneath' b'the' b'yoke' b'the' b'flying' b'coursers' b'led' b'.']
Tokens:  [b'too' b'free' b'a' b'range' b',' b'and' b'watchest' b'all' b'i' b'do'
 b';']
Tokens:  [b'defence' b'of' b'their' b'ships' b'.' b'thus' b'would' b'any' b'seer'
 b'who' b'was' b'expert' b'in' b'these']
Tokens:  [b'"' b'from' b'morn' b'to' b'eve' b'i' b'fell' b',' b'a' b'summer' b"'"
 b's' b'day' b',"']
Tokens:  [b'went' b'to' b'the' b'city' b'bearing' b'a' b'message' b'of' b'peace'
 b'to' b'the' b'cadmeians' b';' b'on' b'his']

Ensuite, vous allez construire un vocabulaire en triant les jetons par fréquence et en gardant les meilleurs jetons VOCAB_SIZE :

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])

Vocab size:  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']

Pour convertir les jetons en nombres entiers, utilisez l'ensemble de tf.lookup.StaticVocabularyTable vocab Vous allez mapper des jetons sur des entiers dans la plage [ 2 , vocab_size + 2 ]. Comme pour la couche TextVectorization , 0 est réservé pour désigner le remplissage et 1 est réservé pour désigner un jeton hors vocabulaire (OOV).

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

Enfin, définissez une fonction pour standardiser, tokeniser et vectoriser l'ensemble de données à l'aide du tokenizer et de la table de recherche :

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

Vous pouvez essayer ceci sur un seul exemple pour imprimer la sortie :

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence:  b'Beneath the yoke the flying coursers led.'
Vectorized sentence:  [234   3 811   3 446 749 248   7]

Exécutez maintenant la fonction de prétraitement sur l'ensemble de données à l'aide Dataset.map :

all_encoded_data = all_labeled_data.map(preprocess_text)

Diviser l'ensemble de données en ensembles d'apprentissage et de test

La couche Keras TextVectorization et remplit également les données vectorisées. Le rembourrage est nécessaire car les exemples à l'intérieur d'un lot doivent avoir la même taille et la même forme, mais les exemples de ces ensembles de données n'ont pas tous la même taille : chaque ligne de texte a un nombre de mots différent.

tf.data.Dataset prend en charge les ensembles de données fractionnés et rembourrés :

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

Maintenant, validation_data et train_data ne sont pas des collections de paires ( example, label ), mais des collections de lots. Chaque lot est une paire de (de nombreux exemples , de nombreuses étiquettes ) représentés sous forme de tableaux.

Pour illustrer ceci :

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Text batch shape:  (64, 18)
Label batch shape:  (64,)
First text example:  tf.Tensor([234   3 811   3 446 749 248   7   0   0   0   0   0   0   0   0   0   0], shape=(18,), dtype=int64)
First label example:  tf.Tensor(1, shape=(), dtype=int64)

Puisque vous utilisez 0 pour le remplissage et 1 pour les jetons hors vocabulaire (OOV), la taille du vocabulaire a augmenté de deux :

vocab_size += 2

Configurez les ensembles de données pour de meilleures performances comme avant :

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

Former le modèle

Vous pouvez entraîner un modèle sur cet ensemble de données comme avant :

model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
697/697 [==============================] - 27s 9ms/step - loss: 0.5238 - accuracy: 0.7658 - val_loss: 0.3814 - val_accuracy: 0.8306
Epoch 2/3
697/697 [==============================] - 3s 4ms/step - loss: 0.2852 - accuracy: 0.8847 - val_loss: 0.3697 - val_accuracy: 0.8428
Epoch 3/3
697/697 [==============================] - 3s 4ms/step - loss: 0.1924 - accuracy: 0.9279 - val_loss: 0.3917 - val_accuracy: 0.8424

loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

79/79 [==============================] - 1s 2ms/step - loss: 0.3917 - accuracy: 0.8424
Loss:  0.391705721616745
Accuracy: 84.24%

Exporter le modèle

Pour rendre le modèle capable de prendre des chaînes brutes en entrée, vous allez créer une couche Keras TextVectorization qui effectue les mêmes étapes que votre fonction de prétraitement personnalisée. Puisque vous avez déjà formé un vocabulaire, vous pouvez utiliser TextVectorization.set_vocabulary (au lieu de TextVectorization.adapt ), qui forme un nouveau vocabulaire.

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

2022-02-05 02:26:40.203675: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185
79/79 [==============================] - 6s 8ms/step - loss: 0.4955 - accuracy: 0.7964
Loss:  0.4955357015132904
Accuracy: 79.64%

La perte et la précision du modèle sur l'ensemble de validation codé et du modèle exporté sur l'ensemble de validation brut sont les mêmes, comme prévu.

Exécuter l'inférence sur de nouvelles données

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

2022-02-05 02:26:43.328949: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185
Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0

Télécharger plus d'ensembles de données à l'aide des ensembles de données TensorFlow (TFDS)

Vous pouvez télécharger de nombreux autres ensembles de données à partir des ensembles de données TensorFlow .

Dans cet exemple, vous utiliserez l'ensemble de données IMDB Large Movie Review pour former un modèle de classification des sentiments :

# Training set.
train_ds = tfds.load(
    'imdb_reviews',
    split='train[:80%]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

# Validation set.
val_ds = tfds.load(
    'imdb_reviews',
    split='train[80%:]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

Imprimez quelques exemples :

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())

Review: b"Instead, go to the zoo, buy some peanuts and feed 'em to the monkeys. Monkeys are funny. People with amnesia who don't say much, just sit there with vacant eyes are not all that funny. Black comedy? There isn't a black person in it, and there isn't one funny thing in it either. Walmart buys these things up somehow and puts them on their dollar rack. It's labeled Unrated. I think they took out the topless scene. They may have taken out other stuff too, who knows? All we know is that whatever they took out, isn't there any more. The acting seemed OK to me. There's a lot of unfathomables tho. It's supposed to be a city? It's supposed to be a big lake? If it's so hot in the church people are fanning themselves, why are they all wearing coats?"
Label: 0
Review: b'Well, was Morgan Freeman any more unusual as God than George Burns? This film sure was better than that bore, "Oh, God". I was totally engrossed and LMAO all the way through. Carrey was perfect as the out of sorts anchorman wannabe, and Aniston carried off her part as the frustrated girlfriend in her usual well played performance. I, for one, don\'t consider her to be either ugly or untalented. I think my favorite scene was when Carrey opened up the file cabinet thinking it could never hold his life history. See if you can spot the file in the cabinet that holds the events of his bathroom humor: I was rolling over this one. Well written and even better played out, this comedy will go down as one of this funnyman\'s best.'
Label: 1
Review: b'I remember stumbling upon this special while channel-surfing in 1965. I had never heard of Barbra before. When the show was over, I thought "This is probably the best thing on TV I will ever see in my life." 42 years later, that has held true. There is still nothing so amazing, so honestly astonishing as the talent that was displayed here. You can talk about all the super-stars you want to, this is the most superlative of them all! You name it, she can do it. Comedy, pathos, sultry seduction, ballads, Barbra is truly a story-teller. Her ability to pull off anything she attempts is legendary. But this special was made in the beginning, and helped to create the legend that she quickly became. In spite of rising so far in such a short time, she has fulfilled the promise, revealing more of her talents as she went along. But they are all here from the very beginning. You will not be disappointed in viewing this.'
Label: 1
Review: b"Firstly, I would like to point out that people who have criticised this film have made some glaring errors. Anything that has a rating below 6/10 is clearly utter nonsense. Creep is an absolutely fantastic film with amazing film effects. The actors are highly believable, the narrative thought provoking and the horror and graphical content extremely disturbing. There is much mystique in this film. Many questions arise as the audience are revealed to the strange and freakish creature that makes habitat in the dark rat ridden tunnels. How was 'Craig' created and what happened to him? A fantastic film with a large chill factor. A film with so many unanswered questions and a film that needs to be appreciated along with others like 28 Days Later, The Bunker, Dog Soldiers and Deathwatch. Look forward to more of these fantastic films!!"
Label: 1
Review: b"I'm sorry but I didn't like this doc very much. I can think of a million ways it could have been better. The people who made it obviously don't have much imagination. The interviews aren't very interesting and no real insight is offered. The footage isn't assembled in a very informative way, either. It's too bad because this is a movie that really deserves spellbinding special features. One thing I'll say is that Isabella Rosselini gets more beautiful the older she gets. All considered, this only gets a '4.'"
Label: 0

Vous pouvez maintenant prétraiter les données et former un modèle comme auparavant.

Préparer l'ensemble de données pour la formation

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

Créer, configurer et entraîner le modèle

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, None, 64)          640064    
                                                                 
 conv1d_2 (Conv1D)           (None, None, 64)          20544     
                                                                 
 global_max_pooling1d_2 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________

model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = model.fit(train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
313/313 [==============================] - 3s 7ms/step - loss: 0.5417 - accuracy: 0.6618 - val_loss: 0.3752 - val_accuracy: 0.8244
Epoch 2/3
313/313 [==============================] - 1s 4ms/step - loss: 0.2996 - accuracy: 0.8680 - val_loss: 0.3165 - val_accuracy: 0.8632
Epoch 3/3
313/313 [==============================] - 1s 4ms/step - loss: 0.1845 - accuracy: 0.9276 - val_loss: 0.3217 - val_accuracy: 0.8674

loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

79/79 [==============================] - 0s 2ms/step - loss: 0.3217 - accuracy: 0.8674
Loss:  0.32172858715057373
Accuracy: 86.74%

Exporter le modèle

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)

Question:  This is a fantastic movie.
Predicted label:  1
Question:  This is a bad movie.
Predicted label:  0
Question:  This movie was so bad that it was good.
Predicted label:  0
Question:  I will never say yes to watching this movie.
Predicted label:  0

Conclusion

Ce didacticiel a démontré plusieurs façons de charger et de prétraiter du texte. À l'étape suivante, vous pouvez explorer d'autres didacticiels de prétraitement de texte TensorFlow Text , tels que :

Vous pouvez également trouver de nouveaux ensembles de données sur TensorFlow Datasets . Et, pour en savoir plus sur tf.data , consultez le guide sur la création de pipelines d'entrée .

Charger le texte Restez organisé à l'aide des collections Enregistrez et classez les contenus selon vos préférences.

Exemple 1 : Prédire la balise d'une question Stack Overflow

Téléchargez et explorez le jeu de données

Charger le jeu de données

Préparer l'ensemble de données pour la formation

Configurer l'ensemble de données pour les performances

Former le modèle

Exporter le modèle

Exécuter l'inférence sur de nouvelles données

Exemple 2 : Prédire l'auteur des traductions de l'Iliade

Téléchargez et explorez le jeu de données

Charger le jeu de données

Préparer l'ensemble de données pour la formation

Diviser l'ensemble de données en ensembles d'apprentissage et de test

Former le modèle

Exporter le modèle

Exécuter l'inférence sur de nouvelles données

Télécharger plus d'ensembles de données à l'aide des ensembles de données TensorFlow (TFDS)

Préparer l'ensemble de données pour la formation

Créer, configurer et entraîner le modèle

Exporter le modèle

Conclusion

Charger le texte