Catat tanggalnya! Google I / O mengembalikan 18-20 Mei Daftar sekarang
Halaman ini diterjemahkan oleh Cloud Translation API.
Switch to English

Muat teks

Lihat di TensorFlow.org Jalankan di Google Colab Lihat sumber di GitHub Unduh buku catatan

Tutorial ini menunjukkan dua cara untuk memuat dan memproses teks.

  • Pertama, Anda akan menggunakan utilitas Keras dan lapisan. Jika Anda baru mengenal TensorFlow, Anda harus mulai dengan ini.

  • Selanjutnya, Anda akan menggunakan utilitas tingkat rendah seperti tf.data.TextLineDataset untuk memuat file teks, dan tf.text untuk tf.text data terlebih dahulu untuk kontrol butiran yang lebih halus.

# Be sure you're using the stable versions of both tf and tf-text, for binary compatibility.
pip install -q -U tensorflow
pip install -q -U tensorflow-text
import collections
import pathlib
import re
import string

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

Contoh 1: Memprediksi tag untuk pertanyaan Stack Overflow

Sebagai contoh pertama, Anda akan mengunduh kumpulan data pertanyaan pemrograman dari Stack Overflow. Setiap pertanyaan ("Bagaimana cara mengurutkan kamus berdasarkan nilai?") CSharp label dengan tepat satu tag ( Python , CSharp , JavaScript , atau Java ). Tugas Anda adalah mengembangkan model yang memprediksi tag untuk sebuah pertanyaan. Ini adalah contoh klasifikasi kelas jamak, jenis masalah pembelajaran mesin yang penting dan dapat diterapkan secara luas.

Unduh dan jelajahi kumpulan data

Selanjutnya, Anda akan mengunduh kumpulan data, dan menjelajahi struktur direktori.

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')
dataset_dir = pathlib.Path(dataset).parent
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz
6053888/6053168 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz')]
train_dir = dataset_dir/'train'
list(train_dir.iterdir())
[PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python')]

Direktori train/csharp , train/java , train/python dan train/javascript berisi banyak file teks, yang masing-masing merupakan pertanyaan Stack Overflow. Cetak file dan periksa datanya.

sample_file = train_dir/'python/1755.txt'
with open(sample_file) as f:
  print(f.read())
why does this blank program print true x=true.def stupid():.    x=false.stupid().print x

Muat set data

Selanjutnya, Anda akan memuat data dari disk dan mempersiapkannya ke dalam format yang sesuai untuk pelatihan. Untuk melakukannya, Anda akan menggunakan utilitas text_dataset_from_directory untuk membuattf.data.Dataset berlabel. Jika Anda baru mengenal tf.data , ini adalah kumpulan alat yang ampuh untuk membangun pipeline input.

preprocessing.text_dataset_from_directory mengharapkan struktur direktori sebagai berikut.

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

Saat menjalankan eksperimen machine learning, praktik terbaiknya adalah membagi kumpulan data Anda menjadi tiga bagian: latih , validasi , dan uji . Dataset Stack Overflow telah dibagi menjadi train dan test, tetapi tidak memiliki set validasi. Buat set validasi menggunakan pemisahan 80:20 dari data pelatihan dengan menggunakan argumen validation_split bawah ini.

batch_size = 32
seed = 42

raw_train_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)
Found 8000 files belonging to 4 classes.
Using 6400 files for training.

Seperti yang Anda lihat di atas, ada 8.000 contoh dalam folder pelatihan, yang 80% (atau 6.400) akan Anda gunakan untuk pelatihan. Seperti yang akan Anda lihat sebentar lagi, Anda dapat melatih model dengan meneruskantf.data.Dataset langsung ke model.fit . Pertama, ulangi kumpulan data dan cetak beberapa contoh, untuk merasakan datanya.

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])
Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle().    {.        system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5.        system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30..        double vertexangle;.        vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;.        return vertexangle;.    }//end method vertexangle..public void menu().{.    system.out.println(mynumsides); // prints out what the user puts in.    system.out.println(mysidelength); // prints out what the user puts in.    gotographic();.    calcr(mynumsides, mysidelength);.    calcr(mynumsides, mysidelength);.    print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{.    int numsides;.    double sidelength;.    scanner keyboard = new scanner(system.in);..    system.out.println(""welcome to the regular polygon program!"");.    system.out.println();..    system.out.print(""enter the number of sides of the polygon ==> "");.    numsides = keyboard.nextint();.    system.out.println();..    system.out.print(""enter the side length of each side ==> "");.    sidelength = keyboard.nextdouble();.    system.out.println();..    regularpolygon shape = new regularpolygon(numsides, sidelength);.    shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n'
Label: 1
Question:  b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ?    ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):..    print img.shape.    res=img.copy().    for x in range(img.shape[0]):.        for y in range(img.shape[1]):.            rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]).            labimg=rgbimg.convert_to(\'lab\', debug=false).            if (labimg.lab_l > treshold):.                res[x,y,:]=color.            else: .                res[x,y,:]=img[x,y,:]..    return res"\n'
Label: 3
Question:  b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:..  import blank.io.*;.    import blank.util.arraylist;.    import static blank.lang.system.out;..    public class rentalsystem{.    static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));.    static file file = new file(""file.txt"");.    static arraylist<string> list = new arraylist<string>();.    static int rows;..    public static void main(string[] args) throws exception{.        introduction();.        system.out.print(""nn"");.        login();.        system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");.        introduction();.        string repeat;.        do{.            loadfile();.            system.out.print(""nwhat do you want to do?nn"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     1. add customer    |   2. rent return |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     3. view list       |   4. search      |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                                             |   5. exit        |n"");.            system.out.print(""n                                              - - - - - - - - - -"");.            system.out.print(""nnchoice:"");.            int choice = integer.parseint(input.readline());.            switch(choice){.                case 1:.                    writedata();.                    break;.                case 2:.                    rentdata();.                    break;.                case 3:.                    viewlist();.                    break;.                case 4:.                    search();.                    break;.                case 5:.                    system.out.println(""goodbye!"");.                    system.exit(0);.                default:.                    system.out.print(""invalid choice: "");.                    break;.            }.            system.out.print(""ndo another task? [y/n] "");.            repeat = input.readline();.        }while(repeat.equals(""y""));..        if(repeat!=""y"") system.out.println(""ngoodbye!"");..    }..    public static void writedata() throws exception{.        system.out.print(""nname: "");.        string cname = input.readline();.        system.out.print(""address: "");.        string add = input.readline();.        system.out.print(""phone no.: "");.        string pno = input.readline();.        system.out.print(""rental amount: "");.        string ramount = input.readline();.        system.out.print(""tapenumber: "");.        string tno = input.readline();.        system.out.print(""title: "");.        string title = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);.        rentdata();.    }..    public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{.        filewriter fw = new filewriter(file, true);.        fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");.        fw.close();.    }..    public static void loadfile() throws exception{.        try{.            list.clear();.            fileinputstream fstream = new fileinputstream(file);.            bufferedreader br = new bufferedreader(new inputstreamreader(fstream));.            rows = 0;.            while( br.ready()).            {.                list.add(br.readline());.                rows++;.            }.            br.close();.        } catch(exception e){.            system.out.println(""list not yet loaded."");.        }.    }..    public static void viewlist(){.        system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |list of all costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        for(int i = 0; i <rows; i++){.            system.out.println(list.get(i));.        }.    }.        public static void rentdata()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |rent data list|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter customer name: "");.        string cname = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        system.out.print(""return date: "");.        string rdate = input.readline();.        system.out.print(""rent amount: "");.        string ramount = input.readline();..        system.out.print(""you pay:""+ramount);...    }.    public static void search()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |search costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter costumer name: "");.        string cname = input.readline();.        boolean found = false;..        for(int i=0; i < rows; i++){.            string temp[] = list.get(i).split("","");..            if(cname.equals(temp[0])){.            system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");.                found = true;.            }.        }..        if(!found){.            system.out.print(""no results."");.        }..    }..        public static boolean evaluate(string uname, string pass){.        if (uname.equals(""admin"")&&pass.equals(""12345"")) return true;.        else return false;.    }..    public static string login()throws exception{.        bufferedreader input=new bufferedreader(new inputstreamreader(system.in));.        int counter=0;.        do{.            system.out.print(""username:"");.            string uname =input.readline();.            system.out.print(""password:"");.            string pass =input.readline();..            boolean accept= evaluate(uname,pass);..            if(accept){.                break;.                }else{.                    system.out.println(""incorrect username or password!"");.                    counter ++;.                    }.        }while(counter<3);..            if(counter !=3) return ""login successful"";.            else return ""login failed"";.            }.        public static void introduction() throws exception{..        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        system.out.println(""                  !                  r e n t a l                  !"");.        system.out.println(""                   ! ~ ~ ~ ~ ~ !  =================  ! ~ ~ ~ ~ ~ !"");.        system.out.println(""                  !                  s y s t e m                  !"");.        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        }..}"\n'
Label: 1
Question:  b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{.    nomtable = ""listeinteretstable"";.    d.cnx.open();.    da = new sqldataadapter(""select nome from offices"", d.cnx);.    ds = new dataset();.    da.fill(ds, nomtable);.    datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{.    d.cnx.open();.    cmdb = new sqlcommandbuilder(da);.    da.update(ds, nomtable);.    d.cnx.close();.}"\n'
Label: 0
Question:  b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1<? super t> onnext, final action1<throwable> onerror) {.}...in the first parameter, what does the question mark and super mean?"\n'
Label: 1
Question:  b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n'
Label: 0
Question:  b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n"
Label: 0
Question:  b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true.   #do things...however i get a syntax error at ..if completed == true"\n'
Label: 3
Question:  b'"blank control flow i made a number which asks for 2 numbers with blank and responds with  the corresponding message for the case. how come it doesnt work  for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{.    class program.    {.        static void main(string[] args).        {.            string myinput;  // declaring the type of the variables.            int myint;..            string number1;.            int number;...            console.writeline(""enter a number"");.            myinput = console.readline(); //muyinput is a string  which is entry input.            myint = int32.parse(myinput); // myint converts the string into an integer..            if (myint > 0).                console.writeline(""your number {0} is greater than zero."", myint);.            else if (myint < 0).                console.writeline(""your number {0} is  less  than zero."", myint);.            else.                console.writeline(""your number {0} is equal zero."", myint);..            console.writeline(""enter another number"");.            number1 = console.readline(); .            number = int32.parse(myinput); ..            if (number < 0 || number == 0).                console.writeline(""your number {0} is  less  than zero or equal zero."", number);.            else if (number > 0 && number <= 10).                console.writeline(""your number {0} is  in the range from 0 to 10."", number);.            else.                console.writeline(""your number {0} is greater than 10."", number);..            console.writeline(""enter another number"");..        }.    }    .}"\n'
Label: 0
Question:  b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {.    private static final string proxy_host = ""proxy.****.com"";.    private static final int proxy_port = 6050;..    public static void main(string[] args) {.        httpclient client = new httpclient();.        httpmethod method = new getmethod(""https://kodeblank.org"");..        hostconfiguration config = client.gethostconfiguration();.        config.setproxy(proxy_host, proxy_port);..        string username = ""*****"";.        string password = ""*****"";.        credentials credentials = new usernamepasswordcredentials(username, password);.        authscope authscope = new authscope(proxy_host, proxy_port);..        client.getstate().setproxycredentials(authscope, credentials);..        try {.            client.executemethod(method);..            if (method.getstatuscode() == httpstatus.sc_ok) {.                string response = method.getresponsebodyasstring();.                system.out.println(""response = "" + response);.            }.        } catch (ioexception e) {.            e.printstacktrace();.        } finally {.            method.releaseconnection();.        }.    }.}...exception:...  dec 08, 2017 1:41:39 pm .          org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme.         info: ntlm authentication scheme selected.       dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect.         severe: credentials cannot be used for ntlm authentication: .           org.apache.commons.httpclient.usernamepasswordcredentials.           org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials .         cannot be used for ntlm authentication: .        enter code here .          org.apache.commons.httpclient.usernamepasswordcredentials.      at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332).        at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320).      at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491).      at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391).      at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323).      at httpgetproxy.main(httpgetproxy.blank:31).  dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge.  info: failure authenticating with ntlm @proxy.****.com:6050"\n'
Label: 1

Labelnya adalah 0 , 1 , 2 atau 3 . Untuk melihat mana dari berikut ini yang sesuai dengan label string mana, Anda dapat memeriksa properti class_names pada dataset.

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)
Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python

Selanjutnya, Anda akan membuat validasi dan menguji dataset. Anda akan menggunakan 1.600 review yang tersisa dari set pelatihan untuk validasi.

raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)
Found 8000 files belonging to 4 classes.
Using 1600 files for validation.
test_dir = dataset_dir/'test'
raw_test_ds = preprocessing.text_dataset_from_directory(
    test_dir, batch_size=batch_size)
Found 8000 files belonging to 4 classes.

Siapkan set data untuk pelatihan

Selanjutnya, Anda akan menstandarisasi, membuat token, dan membuat vektor data menggunakan lapisan preprocessing.TextVectorization .

  • Standardisasi mengacu pada pemrosesan awal teks, biasanya untuk menghapus tanda baca atau elemen HTML untuk menyederhanakan kumpulan data.

  • Tokenisasi mengacu pada pemisahan string menjadi token (misalnya, membagi kalimat menjadi kata-kata individual dengan memisahkan spasi).

  • Vektorisasi mengacu pada pengubahan token menjadi angka sehingga dapat dimasukkan ke dalam jaringan saraf.

Semua tugas ini dapat diselesaikan dengan lapisan ini. Anda dapat mempelajari lebih lanjut tentang masing-masing ini di dokumen API .

  • Standarisasi default mengubah teks menjadi huruf kecil dan menghapus tanda baca.

  • Tokenizer default terbagi di spasi.

  • Mode vektorisasi default adalah int . Ini menghasilkan indeks integer (satu per token). Mode ini dapat digunakan untuk membangun model yang memperhitungkan urutan kata. Anda juga dapat menggunakan mode lain, seperti binary , untuk membuat model bag-of-word.

Anda akan membangun dua mode untuk mempelajari lebih lanjut tentang ini. Pertama, Anda akan menggunakan model binary untuk membuat model bag-of-words. Selanjutnya, Anda akan menggunakan mode int dengan 1D ConvNet.

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

Untuk mode int , selain ukuran kosakata maksimum, Anda perlu menyetel panjang urutan maksimum yang eksplisit, yang akan menyebabkan lapisan mengisi atau memotong urutan menjadi nilai panjang_urutan yang tepat.

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

Selanjutnya, Anda akan memanggil adapt untuk menyesuaikan keadaan lapisan preprocessing ke dataset. Ini akan menyebabkan model membuat indeks string menjadi bilangan bulat.

# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

Lihat hasil menggunakan lapisan ini untuk memproses data sebelumnya:

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label
# Retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)
Question tf.Tensor(b'"function expected error in blank for dynamically created check box when it is clicked i want to grab the attribute value.it is working in ie 8,9,10 but not working in ie 11,chrome shows function expected error..<input type=checkbox checked=\'checked\' id=\'symptomfailurecodeid\' tabindex=\'54\' style=\'cursor:pointer;\' onclick=chkclickevt(this);  failurecodeid=""1"" >...function chkclickevt(obj) { .    alert(obj.attributes(""failurecodeid""));.}"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)
print("'binary' vectorized question:", 
      binary_vectorize_text(first_question, first_label)[0])
'binary' vectorized question: tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])
'int' vectorized question: tf.Tensor(
[[  38  450   65    7   16   12  892  265  186  451   44   11    6  685
     3   46    4 2062    2  485    1    6  158    7  479    1   26   20
   158    7  479    1  502   38  450    1 1767 1763    1    1    1    1
     1    1    1    1    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]], shape=(1, 250), dtype=int64)

Seperti yang Anda lihat di atas, mode binary mengembalikan larik yang menunjukkan token mana yang ada setidaknya sekali dalam input, sementara mode int menggantikan setiap token dengan bilangan bulat, sehingga mempertahankan urutannya. Anda bisa mencari token (string) yang sesuai dengan setiap bilangan bulat dengan memanggil .get_vocabulary() pada lapisan.

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))
1289 --->  roman
313 --->  source
Vocabulary size: 10000

Anda hampir siap untuk melatih model Anda. Sebagai langkah preprocessing terakhir, Anda akan menerapkan layer TextVectorization Anda buat sebelumnya ke dataset train, validation, dan test.

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

Konfigurasi set data untuk kinerja

Ini adalah dua metode penting yang harus Anda gunakan saat memuat data untuk memastikan bahwa I / O tidak menjadi pemblokiran.

.cache() menyimpan data dalam memori setelah dimuat dari disk. Ini akan memastikan kumpulan data tidak menjadi hambatan saat melatih model Anda. Jika kumpulan data Anda terlalu besar untuk dimasukkan ke dalam memori, Anda juga dapat menggunakan metode ini untuk membuat cache di disk yang berkinerja baik, yang lebih efisien untuk dibaca daripada banyak file kecil.

.prefetch() tumpang tindih dengan .prefetch() data dan eksekusi model saat pelatihan.

Anda dapat mempelajari lebih lanjut tentang kedua metode tersebut, serta cara menyimpan data ke cache ke disk di panduan kinerja data .

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

Latih modelnya

Saatnya membuat jaringan saraf kita. Untuk data vektor binary , latih model linier kumpulan kata sederhana:

binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)
Epoch 1/10
200/200 [==============================] - 2s 9ms/step - loss: 1.2359 - accuracy: 0.5427 - val_loss: 0.9108 - val_accuracy: 0.7744
Epoch 2/10
200/200 [==============================] - 1s 3ms/step - loss: 0.8149 - accuracy: 0.8277 - val_loss: 0.7481 - val_accuracy: 0.8031
Epoch 3/10
200/200 [==============================] - 1s 3ms/step - loss: 0.6482 - accuracy: 0.8616 - val_loss: 0.6631 - val_accuracy: 0.8125
Epoch 4/10
200/200 [==============================] - 1s 3ms/step - loss: 0.5492 - accuracy: 0.8832 - val_loss: 0.6100 - val_accuracy: 0.8225
Epoch 5/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4805 - accuracy: 0.9055 - val_loss: 0.5735 - val_accuracy: 0.8294
Epoch 6/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4287 - accuracy: 0.9177 - val_loss: 0.5470 - val_accuracy: 0.8369
Epoch 7/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3876 - accuracy: 0.9286 - val_loss: 0.5270 - val_accuracy: 0.8363
Epoch 8/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3537 - accuracy: 0.9332 - val_loss: 0.5115 - val_accuracy: 0.8394
Epoch 9/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3250 - accuracy: 0.9396 - val_loss: 0.4993 - val_accuracy: 0.8419
Epoch 10/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3003 - accuracy: 0.9479 - val_loss: 0.4896 - val_accuracy: 0.8438

Selanjutnya, Anda akan menggunakan lapisan vektorisasi int untuk membangun ConvNet 1D.

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model
# vocab_size is VOCAB_SIZE + 1 since 0 is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)
Epoch 1/5
200/200 [==============================] - 4s 8ms/step - loss: 1.3016 - accuracy: 0.3903 - val_loss: 0.7395 - val_accuracy: 0.6950
Epoch 2/5
200/200 [==============================] - 1s 6ms/step - loss: 0.6901 - accuracy: 0.7170 - val_loss: 0.5435 - val_accuracy: 0.7906
Epoch 3/5
200/200 [==============================] - 1s 6ms/step - loss: 0.4277 - accuracy: 0.8562 - val_loss: 0.4766 - val_accuracy: 0.8194
Epoch 4/5
200/200 [==============================] - 1s 6ms/step - loss: 0.2419 - accuracy: 0.9402 - val_loss: 0.4701 - val_accuracy: 0.8188
Epoch 5/5
200/200 [==============================] - 1s 6ms/step - loss: 0.1218 - accuracy: 0.9767 - val_loss: 0.4932 - val_accuracy: 0.8163

Bandingkan kedua model tersebut:

print("Linear model on binary vectorized data:")
print(binary_model.summary())
Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 4)                 40004     
=================================================================
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None
print("ConvNet model on int vectorized data:")
print(int_model.summary())
ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          640064    
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d (Global (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 260       
=================================================================
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None

Evaluasi kedua model pada data pengujian:

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))
250/250 [==============================] - 1s 5ms/step - loss: 0.5166 - accuracy: 0.8139
250/250 [==============================] - 1s 4ms/step - loss: 0.5116 - accuracy: 0.8117
Binary model accuracy: 81.39%
Int model accuracy: 81.17%

Ekspor modelnya

Pada kode di atas, Anda menerapkan layer TextVectorization ke dataset sebelum memasukkan teks ke model. Jika Anda ingin membuat model Anda mampu memproses string mentah (misalnya, untuk menyederhanakan penerapannya), Anda dapat menyertakan lapisan TextVectorization di dalam model Anda. Untuk melakukannya, Anda dapat membuat model baru menggunakan bobot yang baru saja Anda latih.

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))
250/250 [==============================] - 2s 5ms/step - loss: 0.5187 - accuracy: 0.8138
Accuracy: 81.39%

Sekarang model Anda dapat menggunakan string mentah sebagai masukan dan memprediksi skor untuk setiap label menggunakan model.predict . Tentukan fungsi untuk menemukan label dengan skor maksimum:

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

Jalankan inferensi pada data baru

inputs = [
    "how do I extract keys from a dict into a list?",  # python
    "debug public static void main(string[] args) {...}",  # java
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())
Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'

Menyertakan logika pemrosesan awal teks di dalam model Anda memungkinkan Anda mengekspor model untuk produksi yang menyederhanakan penerapan, dan mengurangi potensi kemiringan kereta / pengujian .

Ada perbedaan kinerja yang perlu diingat saat memilih tempat untuk menerapkan lapisan TextVectorization Anda. Menggunakannya di luar model Anda memungkinkan Anda melakukan pemrosesan CPU asinkron dan buffering data Anda saat melatih GPU. Jadi, jika Anda melatih model Anda pada GPU, Anda mungkin ingin menggunakan opsi ini untuk mendapatkan kinerja terbaik saat mengembangkan model Anda, lalu beralih ke menyertakan lapisan TextVectorization di dalam model Anda saat Anda siap untuk mempersiapkan penerapan .

Kunjungi tutorial ini untuk mempelajari lebih lanjut tentang menyimpan model.

Contoh 2: Memprediksi penulis terjemahan Illiad

Berikut ini adalah contoh penggunaan tf.data.TextLineDataset untuk memuat contoh dari file teks, dan tf.text untuk preprocess data. Dalam contoh ini, Anda akan menggunakan tiga terjemahan bahasa Inggris yang berbeda dari karya yang sama, Homer's Illiad, dan melatih model untuk mengidentifikasi penerjemah dengan satu baris teks.

Unduh dan jelajahi kumpulan data

Teks dari ketiga terjemahan tersebut adalah dengan:

File teks yang digunakan dalam tutorial ini telah mengalami beberapa tugas praproses seperti menghapus header dan footer dokumen, nomor baris, dan judul bab. Unduh file ringan ini secara lokal.

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step
[PosixPath('/home/kbuilder/.keras/datasets/Giant Panda'),
 PosixPath('/home/kbuilder/.keras/datasets/derby.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/flower_photos.tar.gz'),
 PosixPath('/home/kbuilder/.keras/datasets/spa-eng'),
 PosixPath('/home/kbuilder/.keras/datasets/heart.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/iris_test.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/train.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/butler.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/flower_photos'),
 PosixPath('/home/kbuilder/.keras/datasets/image.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/shakespeare.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/Fireboat'),
 PosixPath('/home/kbuilder/.keras/datasets/iris_training.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/320px-Felis_catus-cat_on_snow.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv.zip'),
 PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'),
 PosixPath('/home/kbuilder/.keras/datasets/ImageNetLabels.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/mnist.npz'),
 PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/spa-eng.zip')]

Muat set data

Anda akan menggunakan TextLineDataset , yang dirancang untuk membuattf.data.Dataset dari file teks di mana setiap contoh adalah baris teks dari file asli, sedangkan text_dataset_from_directory memperlakukan semua konten file sebagai satu contoh. TextLineDataset berguna untuk data teks yang utamanya berbasis garis (misalnya, puisi atau log error).

Iterasi melalui file-file ini, memuat masing-masing file ke dalam set data-nya sendiri. Setiap contoh perlu diberi label satu per satu, jadi gunakan tf.data.Dataset.map untuk menerapkan fungsi tf.data.Dataset.map ke masing-masing contoh. Ini akan mengulangi setiap contoh dalam kumpulan data, mengembalikan pasangan ( example, label ).

def labeler(example, index):
  return example, tf.cast(index, tf.int64)
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

Selanjutnya, Anda akan menggabungkan kumpulan data berlabel ini ke dalam satu kumpulan data, dan mengacaknya.

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

Cetak beberapa contoh seperti sebelumnya. Dataset belum di-batch, oleh karena itu setiap entri di all_labeled_data sesuai dengan satu titik data:

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())
Sentence:  b'To chariot driven, thou maim thyself and me.'
Label: 0
Sentence:  b'On choicest marrow, and the fat of lambs;'
Label: 1
Sentence:  b'And through the gorgeous breastplate, and within'
Label: 1
Sentence:  b'To visit there the parent of the Gods'
Label: 0
Sentence:  b'For safe escape from danger and from death.'
Label: 0
Sentence:  b'Achilles, ye at least the fight decline'
Label: 0
Sentence:  b"Which done, Achilles portion'd out to each"
Label: 0
Sentence:  b'Whom therefore thou devourest; else themselves'
Label: 0
Sentence:  b'Drove them afar into the host of Greece.'
Label: 0
Sentence:  b"Their succour; then I warn thee, while 'tis time,"
Label: 1

Siapkan set data untuk pelatihan

Alih-alih menggunakan lapisan Keras TextVectorization untuk TextVectorization kumpulan data teks kami, Anda sekarang akan menggunakan API tf.text untuk menstandarisasi dan membuat token data, membuat kosakata, dan menggunakan StaticVocabularyTable untuk memetakan token menjadi bilangan bulat untuk diumpankan ke model.

Meskipun tf.text menyediakan berbagai tokenizer, Anda akan menggunakan UnicodeScriptTokenizer untuk membuat token dari kumpulan data kami. Tentukan fungsi untuk mengubah teks menjadi huruf kecil dan membuatnya menjadi token. Anda akan menggunakan tf.data.Dataset.map untuk menerapkan tf.data.Dataset.map ke kumpulan data.

tokenizer = tf_text.UnicodeScriptTokenizer()
def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201: batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.

Anda dapat melakukan iterasi atas kumpulan data dan mencetak beberapa contoh tokenized.

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())
Tokens:  [b'to' b'chariot' b'driven' b',' b'thou' b'maim' b'thyself' b'and' b'me'
 b'.']
Tokens:  [b'on' b'choicest' b'marrow' b',' b'and' b'the' b'fat' b'of' b'lambs' b';']
Tokens:  [b'and' b'through' b'the' b'gorgeous' b'breastplate' b',' b'and' b'within']
Tokens:  [b'to' b'visit' b'there' b'the' b'parent' b'of' b'the' b'gods']
Tokens:  [b'for' b'safe' b'escape' b'from' b'danger' b'and' b'from' b'death' b'.']

Selanjutnya, Anda akan membangun kosakata dengan mengurutkan token berdasarkan frekuensi dan mempertahankan token VOCAB_SIZE teratas.

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])
Vocab size:  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']

Untuk mengonversi token menjadi bilangan bulat, gunakan kumpulan vocab untuk membuat StaticVocabularyTable . Anda akan memetakan token menjadi bilangan bulat dalam kisaran [ 2 , vocab_size + 2 ]. Seperti pada layer TextVectorization , 0 dicadangkan untuk menunjukkan padding dan 1 dicadangkan untuk menunjukkan token out-of-vocabulary (OOV).

keys = vocab
values = range(2, len(vocab) + 2)  # reserve 0 for padding, 1 for OOV

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

Terakhir, tentukan fungsi untuk menstandarisasi, membuat token, dan membuat vektor set data menggunakan tokenizer dan tabel pencarian:

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

Anda dapat mencoba ini pada satu contoh untuk melihat hasilnya:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())
Sentence:  b'To chariot driven, thou maim thyself and me.'
Vectorized sentence:  [   8  195  716    2   47 5605  552    4   40    7]

Sekarang jalankan fungsi preprocess pada dataset menggunakan tf.data.Dataset.map .

all_encoded_data = all_labeled_data.map(preprocess_text)

Pisahkan dataset menjadi train and test

Lapisan Keras TextVectorization juga TextVectorization dan mengisi data vektor. Pengisi diperlukan karena contoh di dalam kumpulan harus memiliki ukuran dan bentuk yang sama, tetapi contoh dalam kumpulan data ini tidak semuanya berukuran sama - setiap baris teks memiliki jumlah kata yang berbeda.tf.data.Dataset mendukungtf.data.Dataset pemisahan dantf.data.Dataset berlapis:

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

Sekarang, validation_data dan train_data bukanlah kumpulan pasangan ( example, label ), tetapi kumpulan kumpulan. Setiap batch adalah sepasang ( banyak contoh , banyak label ) yang direpresentasikan sebagai array. Menggambarkan:

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])
Text batch shape:  (64, 16)
Label batch shape:  (64,)
First text example:  tf.Tensor(
[   8  195  716    2   47 5605  552    4   40    7    0    0    0    0
    0    0], shape=(16,), dtype=int64)
First label example:  tf.Tensor(0, shape=(), dtype=int64)

Karena kami menggunakan 0 untuk padding dan 1 untuk token out-of-vocabulary (OOV), ukuran kosakata meningkat dua.

vocab_size += 2

Konfigurasikan kumpulan data untuk kinerja yang lebih baik seperti sebelumnya.

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

Latih modelnya

Anda dapat melatih model pada set data ini seperti sebelumnya.

model = create_model(vocab_size=vocab_size, num_labels=3)
model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])
history = model.fit(train_data, validation_data=validation_data, epochs=3)
Epoch 1/3
697/697 [==============================] - 30s 12ms/step - loss: 0.6900 - accuracy: 0.6660 - val_loss: 0.3815 - val_accuracy: 0.8368
Epoch 2/3
697/697 [==============================] - 5s 7ms/step - loss: 0.3173 - accuracy: 0.8705 - val_loss: 0.3622 - val_accuracy: 0.8460
Epoch 3/3
697/697 [==============================] - 4s 6ms/step - loss: 0.2159 - accuracy: 0.9167 - val_loss: 0.3895 - val_accuracy: 0.8466
loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 1s 2ms/step - loss: 0.3895 - accuracy: 0.8466
Loss:  0.3894515335559845
Accuracy: 84.66%

Ekspor modelnya

Untuk membuat model kita mampu mengambil string mentah sebagai masukan, Anda akan membuat lapisan TextVectorization yang melakukan langkah yang sama seperti fungsi praproses kustom kita. Karena Anda sudah melatih kosakata, Anda bisa menggunakan set_vocaublary alih-alih adapt yang melatih kosakata baru.

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)
preprocess_layer.set_vocabulary(vocab)
export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])
# Create a test dataset of raw strings
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
loss, accuracy = export_model.evaluate(test_ds)
print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
79/79 [==============================] - 7s 11ms/step - loss: 0.4626 - accuracy: 0.8128
Loss:  0.4913882315158844
Accuracy: 80.50%

Kerugian dan akurasi model pada set validasi yang dienkode dan model yang diekspor pada set validasi mentah adalah sama, seperti yang diharapkan.

Jalankan inferensi pada data baru

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]
predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())
Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0

Mendownload lebih banyak set data menggunakan TensorFlow Datasets (TFDS)

Anda dapat mendownload lebih banyak set data dari TensorFlow Dataset . Sebagai contoh, Anda akan mengunduh kumpulan data IMDB Large Movie Review , dan menggunakannya untuk melatih model untuk klasifikasi sentimen.

train_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)
val_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

Cetak beberapa contoh.

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())
Review:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label:  0
Review:  b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.'
Label:  0
Review:  b'Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.'
Label:  0
Review:  b'This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.'
Label:  1
Review:  b'As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably shows the hypocrisy of the female libido. When men are around they want to be pursued, but when no "men" are around, they become the pursuers of a 14 year old boy. And the boy becomes a man really fast (we should all be so lucky at this age!). He then gets up the courage to pursue his true love.'
Label:  1

Anda sekarang dapat memproses data dan melatih model seperti sebelumnya.

Siapkan set data untuk pelatihan

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call adapt
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)
# Configure datasets for performance as before
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

Latih modelnya

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 64)          640064    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
=================================================================
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________
model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = model.fit(train_ds, validation_data=val_ds, epochs=3)
Epoch 1/3
391/391 [==============================] - 6s 13ms/step - loss: 0.6123 - accuracy: 0.5805 - val_loss: 0.2976 - val_accuracy: 0.8807
Epoch 2/3
391/391 [==============================] - 4s 10ms/step - loss: 0.3141 - accuracy: 0.8609 - val_loss: 0.1708 - val_accuracy: 0.9423
Epoch 3/3
391/391 [==============================] - 4s 10ms/step - loss: 0.1977 - accuracy: 0.9211 - val_loss: 0.0944 - val_accuracy: 0.9776
loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))
391/391 [==============================] - 1s 3ms/step - loss: 0.0944 - accuracy: 0.9776
Loss:  0.09437894821166992
Accuracy: 97.76%

Ekspor modelnya

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])
# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]
predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)
Question:  This is a fantastic movie.
Predicted label:  1
Question:  This is a bad movie.
Predicted label:  0
Question:  This movie was so bad that it was good.
Predicted label:  0
Question:  I will never say yes to watching this movie.
Predicted label:  0

Kesimpulan

Tutorial ini mendemonstrasikan beberapa cara untuk memuat dan memproses teks. Sebagai langkah selanjutnya, Anda dapat mempelajari tutorial tambahan di situs web, atau mendownload kumpulan data baru dari Set Data TensorFlow .