テキストを読み込む

TensorFlow.org で表示

Google Colab で実行

GitHub でソースを表示

ノートブックをダウンロード

このチュートリアルでは、テキストを読み込んで前処理する 2 つの方法を紹介します。

まず、Keras ユーティリティと前処理レイヤーを使用します。これには、データを tf.data.Dataset に変換するための tf.keras.utils.text_dataset_from_directory とデータを標準化、トークン化、およびベクトル化するための tf.keras.layers.TextVectorization が含まれます。TensorFlow を初めて使用する場合は、これらから始める必要があります。
次に、tf.data.TextLineDataset などの低レベルのユーティリティを使用してテキストファイルを読み込み、text.UnicodeScriptTokenizer や text.case_fold_utf8 などの TensorFlow Text APIを使用して、よりきめ細かい制御のためにデータを前処理します。

pip install "tensorflow-text==2.8.*"

import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

例 1: StackOverflow の質問のタグを予測する

最初の例として、StackOverflow からプログラミングの質問のデータセットをダウンロードします。それぞれの質問 (「ディクショナリを値で並べ替えるにはどうすればよいですか？」) は、1 つのタグ (Python、CSharp、JavaScript、またはJava) でラベルされています。このタスクでは、質問のタグを予測するモデルを開発します。これは、マルチクラス分類の例です。マルチクラス分類は、重要で広く適用できる機械学習の問題です。

データセットをダウンロードして調査する

まず、tf.keras.utils.get_file を使用して Stack Overflow データセットをダウンロードし、ディレクトリの構造を調べます。

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz
6053888/6053168 [==============================] - 0s 0us/step
6062080/6053168 [==============================] - 0s 0us/step

list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz')]

train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python'),
 PosixPath('/tmp/.keras/train/java')]

train/csharp、train/java、train/python および train/javascript ディレクトリには、多くのテキストファイルが含まれています。それぞれが Stack Overflow の質問です。

サンプルファイルを出力してデータを調べます。

sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x

データセットを読み込む

次に、データをディスクから読み込み、トレーニングに適した形式に準備します。これを行うには、tf.keras.utils.text_dataset_from_directory ユーティリティを使用して、ラベル付きの tf.data.Dataset を作成します。これは、入力パイプラインを構築するための強力なツールのコレクションです。tf.data を始めて使用する場合は、tf.data: TensorFlow 入力パイプラインを構築するを参照してください。

tf.keras.utils.text_dataset_from_directory API は、次のようなディレクトリ構造を想定しています。

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

機械学習実験を実行するときは、データセットをトレーニング、検証、および、テストの 3 つに分割することをお勧めします。

Stack Overflow データセットは、すでにトレーニングセットとテストセットに分割されていますが、検証セットはありません。

tf.keras.utils.text_dataset_from_directory を使用し、validation_split を 0.2 (20%) に設定し、トレーニングデータを 80:20 に分割して検証セットを作成します。

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.

前のセル出力が示すように、トレーニングフォルダには 8,000 の例があり、そのうち 80％ (6,400) をトレーニングに使用します。tf.data.Dataset を Model.fit に直接渡すことで、モデルをトレーニングできます。詳細は、後ほど見ていきます。

まず、データセットを反復処理し、いくつかの例を出力して、データを確認します。

注意: 分類問題の難易度を上げるために、データセットの作成者は、プログラミングの質問で、Python、CSharp、JavaScript、Java という単語を blank に置き換えました。

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle().    {.        system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5.        system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30..        double vertexangle;.        vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;.        return vertexangle;.    }//end method vertexangle..public void menu().{.    system.out.println(mynumsides); // prints out what the user puts in.    system.out.println(mysidelength); // prints out what the user puts in.    gotographic();.    calcr(mynumsides, mysidelength);.    calcr(mynumsides, mysidelength);.    print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{.    int numsides;.    double sidelength;.    scanner keyboard = new scanner(system.in);..    system.out.println(""welcome to the regular polygon program!"");.    system.out.println();..    system.out.print(""enter the number of sides of the polygon ==&gt; "");.    numsides = keyboard.nextint();.    system.out.println();..    system.out.print(""enter the side length of each side ==&gt; "");.    sidelength = keyboard.nextdouble();.    system.out.println();..    regularpolygon shape = new regularpolygon(numsides, sidelength);.    shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n'
Label: 1
Question:  b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ?    ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):..    print img.shape.    res=img.copy().    for x in range(img.shape[0]):.        for y in range(img.shape[1]):.            rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]).            labimg=rgbimg.convert_to(\'lab\', debug=false).            if (labimg.lab_l &gt; treshold):.                res[x,y,:]=color.            else: .                res[x,y,:]=img[x,y,:]..    return res"\n'
Label: 3
Question:  b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:..  import blank.io.*;.    import blank.util.arraylist;.    import static blank.lang.system.out;..    public class rentalsystem{.    static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));.    static file file = new file(""file.txt"");.    static arraylist&lt;string&gt; list = new arraylist&lt;string&gt;();.    static int rows;..    public static void main(string[] args) throws exception{.        introduction();.        system.out.print(""nn"");.        login();.        system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");.        introduction();.        string repeat;.        do{.            loadfile();.            system.out.print(""nwhat do you want to do?nn"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     1. add customer    |   2. rent return |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     3. view list       |   4. search      |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                                             |   5. exit        |n"");.            system.out.print(""n                                              - - - - - - - - - -"");.            system.out.print(""nnchoice:"");.            int choice = integer.parseint(input.readline());.            switch(choice){.                case 1:.                    writedata();.                    break;.                case 2:.                    rentdata();.                    break;.                case 3:.                    viewlist();.                    break;.                case 4:.                    search();.                    break;.                case 5:.                    system.out.println(""goodbye!"");.                    system.exit(0);.                default:.                    system.out.print(""invalid choice: "");.                    break;.            }.            system.out.print(""ndo another task? [y/n] "");.            repeat = input.readline();.        }while(repeat.equals(""y""));..        if(repeat!=""y"") system.out.println(""ngoodbye!"");..    }..    public static void writedata() throws exception{.        system.out.print(""nname: "");.        string cname = input.readline();.        system.out.print(""address: "");.        string add = input.readline();.        system.out.print(""phone no.: "");.        string pno = input.readline();.        system.out.print(""rental amount: "");.        string ramount = input.readline();.        system.out.print(""tapenumber: "");.        string tno = input.readline();.        system.out.print(""title: "");.        string title = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);.        rentdata();.    }..    public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{.        filewriter fw = new filewriter(file, true);.        fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");.        fw.close();.    }..    public static void loadfile() throws exception{.        try{.            list.clear();.            fileinputstream fstream = new fileinputstream(file);.            bufferedreader br = new bufferedreader(new inputstreamreader(fstream));.            rows = 0;.            while( br.ready()).            {.                list.add(br.readline());.                rows++;.            }.            br.close();.        } catch(exception e){.            system.out.println(""list not yet loaded."");.        }.    }..    public static void viewlist(){.        system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |list of all costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        for(int i = 0; i &lt;rows; i++){.            system.out.println(list.get(i));.        }.    }.        public static void rentdata()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |rent data list|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter customer name: "");.        string cname = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        system.out.print(""return date: "");.        string rdate = input.readline();.        system.out.print(""rent amount: "");.        string ramount = input.readline();..        system.out.print(""you pay:""+ramount);...    }.    public static void search()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |search costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter costumer name: "");.        string cname = input.readline();.        boolean found = false;..        for(int i=0; i &lt; rows; i++){.            string temp[] = list.get(i).split("","");..            if(cname.equals(temp[0])){.            system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");.                found = true;.            }.        }..        if(!found){.            system.out.print(""no results."");.        }..    }..        public static boolean evaluate(string uname, string pass){.        if (uname.equals(""admin"")&amp;&amp;pass.equals(""12345"")) return true;.        else return false;.    }..    public static string login()throws exception{.        bufferedreader input=new bufferedreader(new inputstreamreader(system.in));.        int counter=0;.        do{.            system.out.print(""username:"");.            string uname =input.readline();.            system.out.print(""password:"");.            string pass =input.readline();..            boolean accept= evaluate(uname,pass);..            if(accept){.                break;.                }else{.                    system.out.println(""incorrect username or password!"");.                    counter ++;.                    }.        }while(counter&lt;3);..            if(counter !=3) return ""login successful"";.            else return ""login failed"";.            }.        public static void introduction() throws exception{..        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        system.out.println(""                  !                  r e n t a l                  !"");.        system.out.println(""                   ! ~ ~ ~ ~ ~ !  =================  ! ~ ~ ~ ~ ~ !"");.        system.out.println(""                  !                  s y s t e m                  !"");.        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        }..}"\n'
Label: 1
Question:  b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{.    nomtable = ""listeinteretstable"";.    d.cnx.open();.    da = new sqldataadapter(""select nome from offices"", d.cnx);.    ds = new dataset();.    da.fill(ds, nomtable);.    datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{.    d.cnx.open();.    cmdb = new sqlcommandbuilder(da);.    da.update(ds, nomtable);.    d.cnx.close();.}"\n'
Label: 0
Question:  b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1&lt;? super t&gt; onnext, final action1&lt;throwable&gt; onerror) {.}...in the first parameter, what does the question mark and super mean?"\n'
Label: 1
Question:  b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n'
Label: 0
Question:  b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n"
Label: 0
Question:  b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true.   #do things...however i get a syntax error at ..if completed == true"\n'
Label: 3
Question:  b'"blank control flow i made a number which asks for 2 numbers with blank and responds with  the corresponding message for the case. how come it doesnt work  for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{.    class program.    {.        static void main(string[] args).        {.            string myinput;  // declaring the type of the variables.            int myint;..            string number1;.            int number;...            console.writeline(""enter a number"");.            myinput = console.readline(); //muyinput is a string  which is entry input.            myint = int32.parse(myinput); // myint converts the string into an integer..            if (myint &gt; 0).                console.writeline(""your number {0} is greater than zero."", myint);.            else if (myint &lt; 0).                console.writeline(""your number {0} is  less  than zero."", myint);.            else.                console.writeline(""your number {0} is equal zero."", myint);..            console.writeline(""enter another number"");.            number1 = console.readline(); .            number = int32.parse(myinput); ..            if (number &lt; 0 || number == 0).                console.writeline(""your number {0} is  less  than zero or equal zero."", number);.            else if (number &gt; 0 &amp;&amp; number &lt;= 10).                console.writeline(""your number {0} is  in the range from 0 to 10."", number);.            else.                console.writeline(""your number {0} is greater than 10."", number);..            console.writeline(""enter another number"");..        }.    }    .}"\n'
Label: 0
Question:  b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {.    private static final string proxy_host = ""proxy.****.com"";.    private static final int proxy_port = 6050;..    public static void main(string[] args) {.        httpclient client = new httpclient();.        httpmethod method = new getmethod(""https://kodeblank.org"");..        hostconfiguration config = client.gethostconfiguration();.        config.setproxy(proxy_host, proxy_port);..        string username = ""*****"";.        string password = ""*****"";.        credentials credentials = new usernamepasswordcredentials(username, password);.        authscope authscope = new authscope(proxy_host, proxy_port);..        client.getstate().setproxycredentials(authscope, credentials);..        try {.            client.executemethod(method);..            if (method.getstatuscode() == httpstatus.sc_ok) {.                string response = method.getresponsebodyasstring();.                system.out.println(""response = "" + response);.            }.        } catch (ioexception e) {.            e.printstacktrace();.        } finally {.            method.releaseconnection();.        }.    }.}...exception:...  dec 08, 2017 1:41:39 pm .          org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme.         info: ntlm authentication scheme selected.       dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect.         severe: credentials cannot be used for ntlm authentication: .           org.apache.commons.httpclient.usernamepasswordcredentials.           org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials .         cannot be used for ntlm authentication: .        enter code here .          org.apache.commons.httpclient.usernamepasswordcredentials.      at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332).        at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320).      at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491).      at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391).      at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323).      at httpgetproxy.main(httpgetproxy.blank:31).  dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge.  info: failure authenticating with ntlm @proxy.****.com:6050"\n'
Label: 1

ラベルは、0、1、2 または 3 です。これらのどれがどの文字列ラベルに対応するかを確認するには、データセットの class_names プロパティを確認します。

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python

次に、tf.keras.utils.text_dataset_from_directory を使って検証およびテスト用データセットを作成します。トレーニング用セットの残りの 1,600 件のレビューを検証に使用します。

注意: tf.keras.utils.text_dataset_from_directory の validation_split および subset 引数を使用する場合は、必ずランダムシードを指定するか、shuffle=Falseを渡して、検証とトレーニング分割に重複がないようにします。

# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.

test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

Found 8000 files belonging to 4 classes.

トレーニング用データセットを準備する

次に、tf.keras.layers.TextVectorization レイヤーを使用して、データを標準化、トークン化、およびベクトル化します。

標準化とは、テキストを前処理することを指します。通常、句読点や HTML 要素を削除して、データセットを簡素化します。
トークン化とは、文字列をトークンに分割することです（たとえば、空白で分割することにより、文を個々の単語に分割します）。
ベクトル化とは、トークンを数値に変換して、ニューラルネットワークに入力できるようにすることです。

これらのタスクはすべて、このレイヤーで実行できます。これらの詳細については、tf.keras.layers.TextVectorization API ドキュメントを参照してください。

注意点 :

デフォルトの標準化では、テキストが小文字に変換され、句読点が削除されます (standardize='lower_and_strip_punctuation')。
デフォルトのトークナイザーは空白で分割されます (split='whitespace')。
デフォルトのベクトル化モードは int です (output_mode='int')。これは整数インデックスを出力します（トークンごとに1つ）。このモードは、語順を考慮したモデルを構築するために使用できます。binary などの他のモードを使用して、bag-of-word モデルを構築することもできます。

TextVectorization を使用した標準化、トークン化、およびベクトル化について詳しくみるために、2 つのモデルを作成します。

まず、'binary' ベクトル化モードを使用して、bag-of-words モデルを構築します。
次に、1D ConvNet で 'int' モードを使用します。

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

'int' モードの場合、最大語彙サイズに加えて、明示的な最大シーケンス長 (MAX_SEQUENCE_LENGTH) を設定する必要があります。これにより、レイヤーはシーケンスを正確に output_sequence_length 値にパディングまたは切り捨てます。

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

次に、TextVectorization.adapt を呼び出して、前処理レイヤーの状態をデータセットに適合させます。これにより、モデルは文字列から整数へのインデックスを作成します。

注意: TextVectorization.adapt を呼び出すときは、トレーニング用データのみを使用することが重要です (テスト用セットを使用すると情報が漏洩します)。

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

これらのレイヤーを使用してデータを前処理した結果を出力します。

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)

print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)

print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[ 55   6   2 410 211 229 121 895   4 124  32 245  43   5   1   1   5   1
    1   6   2 410 211 191 318  14   2  98  71 188   8   2 199  71 178   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 250), dtype=int64)

上に示したように、TextVectorization の 'binary' モードは、入力に少なくとも 1 回存在するトークンを示す配列を返しますが、'int' モードでは、各トークンが整数に置き換えられるため、トークンの順序が保持されます。

レイヤーで TextVectorization.get_vocabulary を呼び出すことにより、各整数が対応するトークン (文字列) を検索できます。

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 --->  roman
313 --->  source
Vocabulary size: 10000

モデルをトレーニングする準備がほぼ整いました。

最後の前処理ステップとして、トレーニング、検証、およびデータセットのテストのために前に作成した TextVectorization レイヤーを適用します。

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

パフォーマンスのためにデータセットを構成する

以下は、データを読み込むときに I/O がブロックされないようにするために使用する必要がある 2 つの重要な方法です。

Dataset.cache はデータをディスクから読み込んだ後、データをメモリに保持します。これにより、モデルのトレーニング中にデータセットがボトルネックになることを回避できます。データセットが大きすぎてメモリに収まらない場合は、この方法を使用して、パフォーマンスの高いオンディスクキャッシュを作成することもできます。これは、多くの小さなファイルを読み込むより効率的です。
Dataset.prefetch はトレーニング中にデータの前処理とモデルの実行をオーバーラップさせます。

以上の 2 つの方法とデータをディスクにキャッシュする方法についての詳細は、データパフォーマンスガイドの プリフェッチを参照してください。

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

モデルをトレーニングする

ニューラルネットワークを作成します。

'binary' のベクトル化されたデータの場合、単純な bag-of-words 線形モデルを定義し、それを構成してトレーニングします。

binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
200/200 [==============================] - 2s 4ms/step - loss: 1.1221 - accuracy: 0.6439 - val_loss: 0.9157 - val_accuracy: 0.7738
Epoch 2/10
200/200 [==============================] - 1s 3ms/step - loss: 0.7801 - accuracy: 0.8213 - val_loss: 0.7511 - val_accuracy: 0.7956
Epoch 3/10
200/200 [==============================] - 1s 3ms/step - loss: 0.6284 - accuracy: 0.8589 - val_loss: 0.6653 - val_accuracy: 0.8069
Epoch 4/10
200/200 [==============================] - 1s 3ms/step - loss: 0.5349 - accuracy: 0.8863 - val_loss: 0.6118 - val_accuracy: 0.8200
Epoch 5/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4688 - accuracy: 0.9045 - val_loss: 0.5751 - val_accuracy: 0.8281
Epoch 6/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4184 - accuracy: 0.9153 - val_loss: 0.5485 - val_accuracy: 0.8331
Epoch 7/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3782 - accuracy: 0.9281 - val_loss: 0.5284 - val_accuracy: 0.8363
Epoch 8/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3449 - accuracy: 0.9350 - val_loss: 0.5128 - val_accuracy: 0.8381
Epoch 9/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3166 - accuracy: 0.9425 - val_loss: 0.5005 - val_accuracy: 0.8425
Epoch 10/10
200/200 [==============================] - 1s 3ms/step - loss: 0.2923 - accuracy: 0.9486 - val_loss: 0.4908 - val_accuracy: 0.8413

次に、'int' ベクトル化レイヤーを使用して、1D ConvNet を構築します。

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

Epoch 1/5
200/200 [==============================] - 2s 4ms/step - loss: 1.1424 - accuracy: 0.5008 - val_loss: 0.7508 - val_accuracy: 0.7188
Epoch 2/5
200/200 [==============================] - 1s 4ms/step - loss: 0.6282 - accuracy: 0.7600 - val_loss: 0.5454 - val_accuracy: 0.8019
Epoch 3/5
200/200 [==============================] - 1s 4ms/step - loss: 0.3834 - accuracy: 0.8789 - val_loss: 0.4846 - val_accuracy: 0.8150
Epoch 4/5
200/200 [==============================] - 1s 4ms/step - loss: 0.2168 - accuracy: 0.9495 - val_loss: 0.4828 - val_accuracy: 0.8188
Epoch 5/5
200/200 [==============================] - 1s 4ms/step - loss: 0.1108 - accuracy: 0.9805 - val_loss: 0.5063 - val_accuracy: 0.8181

2 つのモデルを比較します。

print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 4)                 40004     
                                                                 
=================================================================
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None

print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 64)          640064    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
=================================================================
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None

テストデータで両方のモデルを評価します。

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

250/250 [==============================] - 1s 3ms/step - loss: 0.5173 - accuracy: 0.8148
250/250 [==============================] - 1s 2ms/step - loss: 0.5244 - accuracy: 0.8070
Binary model accuracy: 81.48%
Int model accuracy: 80.70%

注意: このサンプルデータセットは、かなり単純な分類問題を表しています。より複雑なデータセットと問題は、前処理戦略とモデルアーキテクチャに微妙ながら重要な違いをもたらします。さまざまなアプローチを比較するために、さまざまなハイパーパラメータとエポックを試してみてください。

モデルをエクスポートする

上記のコードでは、モデルにテキストをフィードする前に、tf.keras.layers.TextVectorization レイヤーをデータセットに適用しました。モデルで生の文字列を処理できるようにする場合 (たとえば、展開を簡素化するため)、モデル内に TextVectorization レイヤーを含めることができます。

これを行うには、トレーニングしたばかりの重みを使用して新しいモデルを作成できます。

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

250/250 [==============================] - 1s 4ms/step - loss: 0.5173 - accuracy: 0.8148
Accuracy: 81.48%

これで、モデルは生の文字列を入力として受け取り、Model.predict を使用して各ラベルのスコアを予測できます。最大スコアのラベルを見つける関数を定義します。

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.math.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

新しいデータで推論を実行する

inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'

モデル内にテキスト前処理ロジックを含めると、モデルを本番環境にエクスポートして展開を簡素化し、トレーニング/テストスキューの可能性を減らすことができます。

tf.keras.layers.TextVectorization を適用する場所を選択する際に性能の違いに留意する必要があります。モデルの外部で使用すると、GPU でトレーニングするときに非同期 CPU 処理とデータのバッファリングを行うことができます。したがって、GPU でモデルをトレーニングしている場合は、モデルの開発中に最高のパフォーマンスを得るためにこのオプションを使用し、デプロイの準備ができたらモデル内に TextVectorization レイヤーを含めるように切り替えることをお勧めします。

モデルの保存の詳細については、モデルの保存と読み込みチュートリアルをご覧ください。

例 2: イーリアスの翻訳者を予測する

以下に、tf.data.TextLineDataset を使用してテキストファイルから例を読み込み、TensorFlow Text を使用してデータを前処理する例を示します。この例では、ホーマーのイーリアスの 3 つの異なる英語翻訳を使用し、与えられた 1 行のテキストから翻訳者を識別するようにモデルをトレーニングします。

データセットをダウンロードして調査する

3 つのテキストの翻訳者は次のとおりです。

このチュートリアルで使われているテキストファイルは、ヘッダ、フッタ、行番号、章のタイトルの削除など、いくつかの典型的な前処理が行われています。

前処理後のファイルをローカルにダウンロードします。

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
827392/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
819200/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step
819200/807992 [==============================] - 0s 0us/step
[PosixPath('/home/kbuilder/.keras/datasets/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/spa-eng'),
 PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/facades'),
 PosixPath('/home/kbuilder/.keras/datasets/mnist.npz'),
 PosixPath('/home/kbuilder/.keras/datasets/flower_photos.tar.gz'),
 PosixPath('/home/kbuilder/.keras/datasets/kandinsky5.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/heart.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/ImageNetLabels.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/jena_climate_2009_2016.csv.zip'),
 PosixPath('/home/kbuilder/.keras/datasets/320px-Felis_catus-cat_on_snow.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/Giant Panda'),
 PosixPath('/home/kbuilder/.keras/datasets/flower_photos'),
 PosixPath('/home/kbuilder/.keras/datasets/shakespeare.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/facades.tar.gz'),
 PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'),
 PosixPath('/home/kbuilder/.keras/datasets/train.csv'),
 PosixPath('/home/kbuilder/.keras/datasets/derby.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/HIGGS.csv.gz'),
 PosixPath('/home/kbuilder/.keras/datasets/surf.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/bedroom_hrnet_tutorial.jpg'),
 PosixPath('/home/kbuilder/.keras/datasets/spa-eng.zip'),
 PosixPath('/home/kbuilder/.keras/datasets/butler.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/Fireboat'),
 PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/YellowLabradorLooking_new.jpg')]

データセットを読み込む

以前は、tf.keras.utils.text_dataset_from_directory では、ファイルのすべてのコンテンツが 1 つの例として扱われていました。ここでは、tf.data.TextLineDataset を使用します。これは、テキストファイルから tf.data.Dataset を作成するように設計されています。それぞれの例は、元のファイルからの行です。TextLineDataset は、主に行ベースのテキストデータ (詩やエラーログなど) に役立ちます。

これらのファイルを繰り返し処理し、各ファイルを独自のデータセットに読み込みます。各例には個別にラベルを付ける必要があるため、Dataset.map を使用して、それぞれにラベラー関数を適用します。これにより、データセット内のすべての例が繰り返され、 (example, label) ペアが返されます。

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

次に、Dataset.concatenate を使用し、これらのラベル付きデータセットを 1 つのデータセットに結合し、Dataset.shuffle を使用してシャッフルします。

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

前述の手順でいくつかの例を出力します。データセットはまだバッチ処理されていないため、all_labeled_data の各エントリは 1 つのデータポイントに対応します。

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())

Sentence:  b"the man's right shoulder, and then stuck in the ground. He stood stock"
Label: 2
Sentence:  b'For yet a child he left me, when he fell'
Label: 1
Sentence:  b'Save me, my brother! Pity me! Thy steeds'
Label: 0
Sentence:  b'prepared the mess she bade them drink it. When they had done so and had'
Label: 2
Sentence:  b'exhorts him to return to the field of battle. An interview succeeds'
Label: 0
Sentence:  b'Then said Achilles, "Son of Atreus, king of men Agamemnon, see to these'
Label: 2
Sentence:  b'The hand of Menelaus, and while all'
Label: 0
Sentence:  b"A huge bull's hide, all drench'd and soak'd with grease;"
Label: 1
Sentence:  b'He sat, where sat the other Powers divine,'
Label: 0
Sentence:  b'And in my cause lies slain, of any Greek'
Label: 1

トレーニング用データセットを準備する

tf.keras.layers.TextVectorization を使用してテキストデータセットを前処理する代わりに、TensorFlow Text API を使用してデータを標準化およびトークン化し、語彙を作成し、tf.lookup.StaticVocabularyTable を使用してトークンを整数にマッピングし、モデルにフィードします。(詳細については TensorFlow Text を参照してください)。

テキストを小文字に変換してトークン化する関数を定義します。

TensorFlow Text は、さまざまなトークナイザーを提供します。この例では、text.UnicodeScriptTokenizer を使用してデータセットをトークン化します。
Dataset.map を使用して、トークン化をデータセットに適用します。

tokenizer = tf_text.UnicodeScriptTokenizer()

def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

tokenized_ds = all_labeled_data.map(tokenize)

データセットを反復処理して、トークン化されたいくつかの例を出力します。

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())

Tokens:  [b'the' b'man' b"'" b's' b'right' b'shoulder' b',' b'and' b'then' b'stuck'
 b'in' b'the' b'ground' b'.' b'he' b'stood' b'stock']
Tokens:  [b'for' b'yet' b'a' b'child' b'he' b'left' b'me' b',' b'when' b'he'
 b'fell']
Tokens:  [b'save' b'me' b',' b'my' b'brother' b'!' b'pity' b'me' b'!' b'thy'
 b'steeds']
Tokens:  [b'prepared' b'the' b'mess' b'she' b'bade' b'them' b'drink' b'it' b'.'
 b'when' b'they' b'had' b'done' b'so' b'and' b'had']
Tokens:  [b'exhorts' b'him' b'to' b'return' b'to' b'the' b'field' b'of' b'battle'
 b'.' b'an' b'interview' b'succeeds']

次に、トークンを頻度で並べ替え、上位の VOCAB_SIZE トークンを保持することにより、語彙を構築します。

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])

Vocab size:  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']

トークンを整数に変換するには、vocab セットを使用して、tf.lookup.StaticVocabularyTable を作成します。トークンを [2, vocab_size + 2] の範囲の整数にマップします。TextVectorization レイヤーと同様に、0 はパディングを示すために予約されており、1 は語彙外 (OOV) トークンを示すために予約されています。

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

最後に、トークナイザーとルックアップテーブルを使用して、データセットを標準化、トークン化、およびベクトル化する関数を定義します。

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

1 つの例でこれを試して、出力を確認します。

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence:  b"the man's right shoulder, and then stuck in the ground. He stood stock"
Vectorized sentence:  [   3   86    5   29  274  527    2    4   33 2749   13    3  191    7
   12  108 3286]

次に、Dataset.map を使用して、データセットに対して前処理関数を実行します。

all_encoded_data = all_labeled_data.map(preprocess_text)

データセットをトレーニング用セットとテスト用セットに分割する

Keras TextVectorization レイヤーでも、ベクトル化されたデータをバッチ処理してパディングします。バッチ内の例は同じサイズと形状である必要があるため、パディングが必要です。これらのデータセットの例はすべて同じサイズではありません。テキストの各行には、異なる数の単語があります。

tf.data.Dataset は、データセットの分割とパディングのバッチ処理をサポートしています

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

validation_data および train_data は (example, label) ペアのコレクションではなく、バッチのコレクションです。各バッチは、配列として表される (多くの例、多くのラベル) のペアです。

以下に示します。

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Text batch shape:  (64, 19)
Label batch shape:  (64,)
First text example:  tf.Tensor(
[   3   86    5   29  274  527    2    4   33 2749   13    3  191    7
   12  108 3286    0    0], shape=(19,), dtype=int64)
First label example:  tf.Tensor(2, shape=(), dtype=int64)

パディングに 0 を使用し、語彙外 (OOV) トークンに 1 を使用するため、語彙のサイズが 2 つ増えました。

vocab_size += 2

以前と同じように、パフォーマンスを向上させるためにデータセットを構成します。

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

モデルをトレーニングする

以前と同じように、このデータセットでモデルをトレーニングできます。

model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
697/697 [==============================] - 28s 9ms/step - loss: 0.5163 - accuracy: 0.7686 - val_loss: 0.3744 - val_accuracy: 0.8406
Epoch 2/3
697/697 [==============================] - 3s 4ms/step - loss: 0.2777 - accuracy: 0.8872 - val_loss: 0.3690 - val_accuracy: 0.8472
Epoch 3/3
697/697 [==============================] - 3s 4ms/step - loss: 0.1862 - accuracy: 0.9295 - val_loss: 0.4152 - val_accuracy: 0.8382

loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

79/79 [==============================] - 1s 2ms/step - loss: 0.4152 - accuracy: 0.8382
Loss:  0.41523823142051697
Accuracy: 83.82%

モデルをエクスポートする

モデルが生の文字列を入力として受け取ることができるようにするには、カスタム前処理関数と同じ手順を実行する TextVectorization レイヤーを作成します。すでに語彙をトレーニングしているので、新しい語彙をトレーニングする TextVectorization.adapt の代わりに、TextVectorization.set_vocabulary を使用できます。

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/numpy/core/numeric.py:2468: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return bool(asarray(a1 == a2).all())
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/layers/preprocessing/index_lookup.py:458: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self.mask_token is not None and self.mask_token in tokens:

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

2022-12-15 01:18:17.427116: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185
79/79 [==============================] - 6s 7ms/step - loss: 0.5736 - accuracy: 0.7764
Loss:  0.5736407041549683
Accuracy: 77.64%

エンコードされた検証セットのモデルと生の検証セットのエクスポートされたモデルの損失と正確度は、予想どおり同じです。

新しいデータで推論を実行する

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.math.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

2022-12-15 01:18:20.841752: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185
Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0

TensorFlow Datasets (TFDS) を使用して、より多くのデータセットをダウンロードする

TensorFlow Dataset からより多くのデータセットをダウンロードできます。

この例では、IMDB 大規模映画レビューデータセットを使用して、感情分類のモデルをトレーニングします。

# Training set.
train_ds = tfds.load(
    'imdb_reviews',
    split='train[:80%]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

# Validation set.
val_ds = tfds.load(
    'imdb_reviews',
    split='train[80%:]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

いくつかの例を出力します。

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())

Review: b"Instead, go to the zoo, buy some peanuts and feed 'em to the monkeys. Monkeys are funny. People with amnesia who don't say much, just sit there with vacant eyes are not all that funny. Black comedy? There isn't a black person in it, and there isn't one funny thing in it either. Walmart buys these things up somehow and puts them on their dollar rack. It's labeled Unrated. I think they took out the topless scene. They may have taken out other stuff too, who knows? All we know is that whatever they took out, isn't there any more. The acting seemed OK to me. There's a lot of unfathomables tho. It's supposed to be a city? It's supposed to be a big lake? If it's so hot in the church people are fanning themselves, why are they all wearing coats?"
Label: 0
Review: b'Well, was Morgan Freeman any more unusual as God than George Burns? This film sure was better than that bore, "Oh, God". I was totally engrossed and LMAO all the way through. Carrey was perfect as the out of sorts anchorman wannabe, and Aniston carried off her part as the frustrated girlfriend in her usual well played performance. I, for one, don\'t consider her to be either ugly or untalented. I think my favorite scene was when Carrey opened up the file cabinet thinking it could never hold his life history. See if you can spot the file in the cabinet that holds the events of his bathroom humor: I was rolling over this one. Well written and even better played out, this comedy will go down as one of this funnyman\'s best.'
Label: 1
Review: b'I remember stumbling upon this special while channel-surfing in 1965. I had never heard of Barbra before. When the show was over, I thought "This is probably the best thing on TV I will ever see in my life." 42 years later, that has held true. There is still nothing so amazing, so honestly astonishing as the talent that was displayed here. You can talk about all the super-stars you want to, this is the most superlative of them all! You name it, she can do it. Comedy, pathos, sultry seduction, ballads, Barbra is truly a story-teller. Her ability to pull off anything she attempts is legendary. But this special was made in the beginning, and helped to create the legend that she quickly became. In spite of rising so far in such a short time, she has fulfilled the promise, revealing more of her talents as she went along. But they are all here from the very beginning. You will not be disappointed in viewing this.'
Label: 1
Review: b"Firstly, I would like to point out that people who have criticised this film have made some glaring errors. Anything that has a rating below 6/10 is clearly utter nonsense. Creep is an absolutely fantastic film with amazing film effects. The actors are highly believable, the narrative thought provoking and the horror and graphical content extremely disturbing. There is much mystique in this film. Many questions arise as the audience are revealed to the strange and freakish creature that makes habitat in the dark rat ridden tunnels. How was 'Craig' created and what happened to him? A fantastic film with a large chill factor. A film with so many unanswered questions and a film that needs to be appreciated along with others like 28 Days Later, The Bunker, Dog Soldiers and Deathwatch. Look forward to more of these fantastic films!!"
Label: 1
Review: b"I'm sorry but I didn't like this doc very much. I can think of a million ways it could have been better. The people who made it obviously don't have much imagination. The interviews aren't very interesting and no real insight is offered. The footage isn't assembled in a very informative way, either. It's too bad because this is a movie that really deserves spellbinding special features. One thing I'll say is that Isabella Rosselini gets more beautiful the older she gets. All considered, this only gets a '4.'"
Label: 0

これで、以前と同じようにデータを前処理してモデルをトレーニングできます。

注意: これは二項分類の問題であるため、モデルには tf.keras.losses.SparseCategoricalCrossentropy の代わりに tf.keras.losses.BinaryCrossentropy を使用します。

トレーニング用データセットを準備する

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

モデルを作成、構成、およびトレーニングする

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, None, 64)          640064    
                                                                 
 conv1d_2 (Conv1D)           (None, None, 64)          20544     
                                                                 
 global_max_pooling1d_2 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________

model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = model.fit(train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
313/313 [==============================] - 7s 6ms/step - loss: 0.5375 - accuracy: 0.6665 - val_loss: 0.3738 - val_accuracy: 0.8266
Epoch 2/3
313/313 [==============================] - 1s 4ms/step - loss: 0.3038 - accuracy: 0.8673 - val_loss: 0.3151 - val_accuracy: 0.8632
Epoch 3/3
313/313 [==============================] - 1s 4ms/step - loss: 0.1885 - accuracy: 0.9276 - val_loss: 0.3167 - val_accuracy: 0.8624

loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

79/79 [==============================] - 0s 2ms/step - loss: 0.3167 - accuracy: 0.8624
Loss:  0.3166657090187073
Accuracy: 86.24%

モデルをエクスポートする

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)

Question:  This is a fantastic movie.
Predicted label:  1
Question:  This is a bad movie.
Predicted label:  0
Question:  This movie was so bad that it was good.
Predicted label:  0
Question:  I will never say yes to watching this movie.
Predicted label:  1

まとめ

このチュートリアルでは、テキストを読み込んで前処理するいくつかの方法を示しました。次のステップとして、以下のような TensorFlow Text テキスト前処理チュートリアルをご覧ください。

TensorFlow Datasets でも新しいデータセットを見つけることができます。また、tf.data の詳細については、入力パイプラインの構築に関するガイドをご覧ください。