Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

טען טקסט

הצג באתר TensorFlow.org

הפעל בגוגל קולאב

צפה במקור ב-GitHub

הורד מחברת

מדריך זה מדגים שתי דרכים לטעינת טקסט ולעבד אותו מראש.

ראשית, תשתמש בכלי עזר של Keras ובשכבות עיבוד מקדים. אלה כוללים tf.keras.utils.text_dataset_from_directory להפיכת נתונים ל- tf.data.Dataset ו- tf.keras.layers.TextVectorization לסטנדרטיזציה של נתונים, טוקניזציה ווקטוריזציה. אם אתה חדש ב-TensorFlow, אתה צריך להתחיל עם אלה.
לאחר מכן, תשתמש בכלי שירות ברמה נמוכה יותר כמו tf.data.TextLineDataset לטעינת קבצי טקסט, וממשקי API של TensorFlow Text , כגון text.UnicodeScriptTokenizer ו- text.case_fold_utf8 , כדי לעבד מראש את הנתונים לשליטה עדינה יותר.

# Be sure you're using the stable versions of both `tensorflow` and
# `tensorflow-text`, for binary compatibility.
pip uninstall -y tf-nightly keras-nightly
pip install tensorflow
pip install tensorflow-text

import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

דוגמה 1: חזה את התג עבור שאלת Stack Overflow

כדוגמה ראשונה, תוריד מערך נתונים של שאלות תכנות מ-Stack Overflow. כל שאלה ( "איך אני ממיין מילון לפי ערך?" ) מסומנת בתג אחד בדיוק ( Python , CSharp , JavaScript או Java ). המשימה שלך היא לפתח מודל שמנבא את התג לשאלה. זוהי דוגמה לסיווג רב-מעמדי - סוג חשוב וישים נרחב של בעיית למידת מכונה.

הורד וחקור את מערך הנתונים

התחל בהורדת מערך הנתונים של Stack Overflow באמצעות tf.keras.utils.get_file , וחקירת מבנה הספריות:

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz
6053888/6053168 [==============================] - 0s 0us/step
6062080/6053168 [==============================] - 0s 0us/step

list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz'),
 PosixPath('/tmp/.keras/test')]

train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python')]

ספריות ה- train/csharp , train/java , train/python ו- train/javascript מכילות קבצי טקסט רבים, שכל אחד מהם הוא שאלת Stack Overflow.

הדפס קובץ לדוגמה ובדוק את הנתונים:

sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x

טען את מערך הנתונים

לאחר מכן, תטען את הנתונים מהדיסק ותכין אותם לפורמט המתאים לאימון. לשם כך, תשתמש בכלי השירות tf.keras.utils.text_dataset_from_directory כדי ליצור קובץ שכותרתו tf.data.Dataset . אם אתה חדש ב- tf.data , זה אוסף רב עוצמה של כלים לבניית צינורות קלט. (למידע נוסף ב- tf.data: בניית צינורות קלט של TensorFlow.)

ממשק ה-API של tf.keras.utils.text_dataset_from_directory מצפה למבנה ספריות כדלקמן:

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

בעת הפעלת ניסוי למידת מכונה, השיטה המומלצת היא לחלק את מערך הנתונים לשלושה פיצולים: הדרכה , אימות ובדיקה .

מערך הנתונים של Stack Overflow כבר חולק לקבוצות אימון ובדיקות, אך חסר לו ערכת אימות.

צור ערכת אימות באמצעות פיצול של 80:20 של נתוני האימון באמצעות tf.keras.utils.text_dataset_from_directory עם validation_split מוגדר ל 0.2 (כלומר 20%):

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.

כפי שהפלט הקודם של התא מציע, יש 8,000 דוגמאות בתיקיית ההדרכה, מהן תשתמש ב-80% (או 6,400) לאימון. עוד רגע תלמד שאתה יכול לאמן מודל על ידי העברת tf.data.Dataset ישירות ל- Model.fit .

ראשית, חזור על מערך הנתונים והדפיס כמה דוגמאות, כדי לקבל תחושה לגבי הנתונים.

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default constructor, which therefor ruins the whole rest of the program. can somebody help me?..for those of you who want to see more of my code: here you go..public double vertexangle().    {.        system.out.println(""the vertex angle method: "" + mynumsides);// prints out 5.        system.out.println(""the vertex angle method: "" + mysidelength); // prints out 30..        double vertexangle;.        vertexangle = ((mynumsides - 2.0) / mynumsides) * 180.0;.        return vertexangle;.    }//end method vertexangle..public void menu().{.    system.out.println(mynumsides); // prints out what the user puts in.    system.out.println(mysidelength); // prints out what the user puts in.    gotographic();.    calcr(mynumsides, mysidelength);.    calcr(mynumsides, mysidelength);.    print(); .}// end menu...this is my entire tester class:..public static void main(string[] arg).{.    int numsides;.    double sidelength;.    scanner keyboard = new scanner(system.in);..    system.out.println(""welcome to the regular polygon program!"");.    system.out.println();..    system.out.print(""enter the number of sides of the polygon ==&gt; "");.    numsides = keyboard.nextint();.    system.out.println();..    system.out.print(""enter the side length of each side ==&gt; "");.    sidelength = keyboard.nextdouble();.    system.out.println();..    regularpolygon shape = new regularpolygon(numsides, sidelength);.    shape.menu();.}//end main...for testing it i sent it numsides 4 and sidelength 100."\n'
Label: 1
Question:  b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds the skin area of an image. but it\'s ridiculously slow. i don\'t know how to make it faster ?    ..from colormath.color_objects import *..def skindetection(img, treshold=80, color=[255,20,147]):..    print img.shape.    res=img.copy().    for x in range(img.shape[0]):.        for y in range(img.shape[1]):.            rgbimg=rgbcolor(img[x,y,0],img[x,y,1],img[x,y,2]).            labimg=rgbimg.convert_to(\'lab\', debug=false).            if (labimg.lab_l &gt; treshold):.                res[x,y,:]=color.            else: .                res[x,y,:]=img[x,y,:]..    return res"\n'
Label: 3
Question:  b'"option and validation in blank i want to add a new option on my system where i want to add two text files, both rental.txt and customer.txt. inside each text are id numbers of the customer, the videotape they need and the price...i want to place it as an option on my code. right now i have:...add customer.rent return.view list.search.exit...i want to add this as my sixth option. say for example i ordered a video, it would display the price and would let me confirm the price and if i am going to buy it or not...here is my current code:..  import blank.io.*;.    import blank.util.arraylist;.    import static blank.lang.system.out;..    public class rentalsystem{.    static bufferedreader input = new bufferedreader(new inputstreamreader(system.in));.    static file file = new file(""file.txt"");.    static arraylist&lt;string&gt; list = new arraylist&lt;string&gt;();.    static int rows;..    public static void main(string[] args) throws exception{.        introduction();.        system.out.print(""nn"");.        login();.        system.out.print(""nnnnnnnnnnnnnnnnnnnnnn"");.        introduction();.        string repeat;.        do{.            loadfile();.            system.out.print(""nwhat do you want to do?nn"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     1. add customer    |   2. rent return |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                    |     3. view list       |   4. search      |n"");.            system.out.print(""n                    - - - - - - - - - - - - - - - - - - - - - - -"");.            system.out.print(""nn                                             |   5. exit        |n"");.            system.out.print(""n                                              - - - - - - - - - -"");.            system.out.print(""nnchoice:"");.            int choice = integer.parseint(input.readline());.            switch(choice){.                case 1:.                    writedata();.                    break;.                case 2:.                    rentdata();.                    break;.                case 3:.                    viewlist();.                    break;.                case 4:.                    search();.                    break;.                case 5:.                    system.out.println(""goodbye!"");.                    system.exit(0);.                default:.                    system.out.print(""invalid choice: "");.                    break;.            }.            system.out.print(""ndo another task? [y/n] "");.            repeat = input.readline();.        }while(repeat.equals(""y""));..        if(repeat!=""y"") system.out.println(""ngoodbye!"");..    }..    public static void writedata() throws exception{.        system.out.print(""nname: "");.        string cname = input.readline();.        system.out.print(""address: "");.        string add = input.readline();.        system.out.print(""phone no.: "");.        string pno = input.readline();.        system.out.print(""rental amount: "");.        string ramount = input.readline();.        system.out.print(""tapenumber: "");.        string tno = input.readline();.        system.out.print(""title: "");.        string title = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        createline(cname, add, pno, ramount,tno, title, dborrowed, ddate);.        rentdata();.    }..    public static void createline(string name, string address, string phone , string rental, string tapenumber, string title, string borrowed, string due) throws exception{.        filewriter fw = new filewriter(file, true);.        fw.write(""nname: ""+name + ""naddress: "" + address +""nphone no.: ""+ phone+""nrentalamount: ""+rental+""ntape no.: ""+ tapenumber+""ntitle: ""+ title+""ndate borrowed: ""+borrowed +""ndue date: ""+ due+"":rn"");.        fw.close();.    }..    public static void loadfile() throws exception{.        try{.            list.clear();.            fileinputstream fstream = new fileinputstream(file);.            bufferedreader br = new bufferedreader(new inputstreamreader(fstream));.            rows = 0;.            while( br.ready()).            {.                list.add(br.readline());.                rows++;.            }.            br.close();.        } catch(exception e){.            system.out.println(""list not yet loaded."");.        }.    }..    public static void viewlist(){.        system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |list of all costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        for(int i = 0; i &lt;rows; i++){.            system.out.println(list.get(i));.        }.    }.        public static void rentdata()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |rent data list|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter customer name: "");.        string cname = input.readline();.        system.out.print(""date borrowed: "");.        string dborrowed = input.readline();.        system.out.print(""due date: "");.        string ddate = input.readline();.        system.out.print(""return date: "");.        string rdate = input.readline();.        system.out.print(""rent amount: "");.        string ramount = input.readline();..        system.out.print(""you pay:""+ramount);...    }.    public static void search()throws exception.    {   system.out.print(""n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print("" |search costumers|"");.        system.out.print(""~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~"");.        system.out.print(""nenter costumer name: "");.        string cname = input.readline();.        boolean found = false;..        for(int i=0; i &lt; rows; i++){.            string temp[] = list.get(i).split("","");..            if(cname.equals(temp[0])){.            system.out.println(""search result:nyou are "" + temp[0] + "" from "" + temp[1] + "".""+ temp[2] + "".""+ temp[3] + "".""+ temp[4] + "".""+ temp[5] + "" is "" + temp[6] + "".""+ temp[7] + "" is "" + temp[8] + ""."");.                found = true;.            }.        }..        if(!found){.            system.out.print(""no results."");.        }..    }..        public static boolean evaluate(string uname, string pass){.        if (uname.equals(""admin"")&amp;&amp;pass.equals(""12345"")) return true;.        else return false;.    }..    public static string login()throws exception{.        bufferedreader input=new bufferedreader(new inputstreamreader(system.in));.        int counter=0;.        do{.            system.out.print(""username:"");.            string uname =input.readline();.            system.out.print(""password:"");.            string pass =input.readline();..            boolean accept= evaluate(uname,pass);..            if(accept){.                break;.                }else{.                    system.out.println(""incorrect username or password!"");.                    counter ++;.                    }.        }while(counter&lt;3);..            if(counter !=3) return ""login successful"";.            else return ""login failed"";.            }.        public static void introduction() throws exception{..        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        system.out.println(""                  !                  r e n t a l                  !"");.        system.out.println(""                   ! ~ ~ ~ ~ ~ !  =================  ! ~ ~ ~ ~ ~ !"");.        system.out.println(""                  !                  s y s t e m                  !"");.        system.out.println(""                  - - - - - - - - - - - - - - - - - - - - - - - - -"");.        }..}"\n'
Label: 1
Question:  b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand that does not return any key i dont know what is the problem this my code : ..string nomtable;..datatable listeetablissementtable = new datatable();.datatable listeinteretstable = new datatable();.dataset ds = new dataset();.sqldataadapter da;.sqlcommandbuilder cmdb;..private void listeinterets_click(object sender, eventargs e).{.    nomtable = ""listeinteretstable"";.    d.cnx.open();.    da = new sqldataadapter(""select nome from offices"", d.cnx);.    ds = new dataset();.    da.fill(ds, nomtable);.    datagridview1.datasource = ds.tables[nomtable];.}..private void sauvgarder_click(object sender, eventargs e).{.    d.cnx.open();.    cmdb = new sqlcommandbuilder(da);.    da.update(ds, nomtable);.    d.cnx.close();.}"\n'
Label: 0
Question:  b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like this:..public final subscription subscribe(final action1&lt;? super t&gt; onnext, final action1&lt;throwable&gt; onerror) {.}...in the first parameter, what does the question mark and super mean?"\n'
Label: 1
Question:  b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (interface - invoicecheck_out) do you know how?....i would like to call the object (variable) do you know how?..try to call (it`s ok)....try to call (how call this?)\n'
Label: 0
Question:  b"how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemtray doesn't look good overall. .what is the correct way of making icons for windows system tray?..screenshots: http://imgur.com/zsibwn9..icon: http://imgur.com/vsh4zo8\n"
Label: 0
Question:  b'"is there a way to check a variable that exists in a different script than the original one? i\'m trying to check if a variable, which was previously set to true in 2.py in 1.py, as 1.py is only supposed to continue if the variable is true...2.py..import os..completed = false..#some stuff here..completed = true...1.py..import 2 ..if completed == true.   #do things...however i get a syntax error at ..if completed == true"\n'
Label: 3
Question:  b'"blank control flow i made a number which asks for 2 numbers with blank and responds with  the corresponding message for the case. how come it doesnt work  for the second number ? .regardless what i enter for the second number , i am getting the message ""your number is in the range 0-10""...using system;.using system.collections.generic;.using system.linq;.using system.text;..namespace consoleapplication1.{.    class program.    {.        static void main(string[] args).        {.            string myinput;  // declaring the type of the variables.            int myint;..            string number1;.            int number;...            console.writeline(""enter a number"");.            myinput = console.readline(); //muyinput is a string  which is entry input.            myint = int32.parse(myinput); // myint converts the string into an integer..            if (myint &gt; 0).                console.writeline(""your number {0} is greater than zero."", myint);.            else if (myint &lt; 0).                console.writeline(""your number {0} is  less  than zero."", myint);.            else.                console.writeline(""your number {0} is equal zero."", myint);..            console.writeline(""enter another number"");.            number1 = console.readline(); .            number = int32.parse(myinput); ..            if (number &lt; 0 || number == 0).                console.writeline(""your number {0} is  less  than zero or equal zero."", number);.            else if (number &gt; 0 &amp;&amp; number &lt;= 10).                console.writeline(""your number {0} is  in the range from 0 to 10."", number);.            else.                console.writeline(""your number {0} is greater than 10."", number);..            console.writeline(""enter another number"");..        }.    }    .}"\n'
Label: 0
Question:  b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials cannot be used for ntlm authentication: exception in eclipse..whether it is possible mention eclipse to take system proxy settings directly?..public class httpgetproxy {.    private static final string proxy_host = ""proxy.****.com"";.    private static final int proxy_port = 6050;..    public static void main(string[] args) {.        httpclient client = new httpclient();.        httpmethod method = new getmethod(""https://kodeblank.org"");..        hostconfiguration config = client.gethostconfiguration();.        config.setproxy(proxy_host, proxy_port);..        string username = ""*****"";.        string password = ""*****"";.        credentials credentials = new usernamepasswordcredentials(username, password);.        authscope authscope = new authscope(proxy_host, proxy_port);..        client.getstate().setproxycredentials(authscope, credentials);..        try {.            client.executemethod(method);..            if (method.getstatuscode() == httpstatus.sc_ok) {.                string response = method.getresponsebodyasstring();.                system.out.println(""response = "" + response);.            }.        } catch (ioexception e) {.            e.printstacktrace();.        } finally {.            method.releaseconnection();.        }.    }.}...exception:...  dec 08, 2017 1:41:39 pm .          org.apache.commons.httpclient.auth.authchallengeprocessor selectauthscheme.         info: ntlm authentication scheme selected.       dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector executeconnect.         severe: credentials cannot be used for ntlm authentication: .           org.apache.commons.httpclient.usernamepasswordcredentials.           org.apache.commons.httpclient.auth.invalidcredentialsexception: credentials .         cannot be used for ntlm authentication: .        enter code here .          org.apache.commons.httpclient.usernamepasswordcredentials.      at org.apache.commons.httpclient.auth.ntlmscheme.authenticate(ntlmscheme.blank:332).        at org.apache.commons.httpclient.httpmethoddirector.authenticateproxy(httpmethoddirector.blank:320).      at org.apache.commons.httpclient.httpmethoddirector.executeconnect(httpmethoddirector.blank:491).      at org.apache.commons.httpclient.httpmethoddirector.executewithretry(httpmethoddirector.blank:391).      at org.apache.commons.httpclient.httpmethoddirector.executemethod(httpmethoddirector.blank:171).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:397).      at org.apache.commons.httpclient.httpclient.executemethod(httpclient.blank:323).      at httpgetproxy.main(httpgetproxy.blank:31).  dec 08, 2017 1:41:39 pm org.apache.commons.httpclient.httpmethoddirector processproxyauthchallenge.  info: failure authenticating with ntlm @proxy.****.com:6050"\n'
Label: 1

התוויות הן 0 , 1 , 2 או 3 . כדי לבדוק איזו מהן תואמת לאיזו תווית מחרוזת, אתה יכול לבדוק את המאפיין class_names במערך הנתונים:

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python

לאחר מכן, תיצור אימות וערכת בדיקה באמצעות tf.keras.utils.text_dataset_from_directory . אתה תשתמש ב-1,600 הביקורות הנותרות מתוך ערכת ההדרכה לצורך אימות.

# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.

test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

Found 8000 files belonging to 4 classes.

הכן את מערך הנתונים להדרכה

לאחר מכן, תתקן, תעשה אסימון וקטוריזציה של הנתונים באמצעות שכבת tf.keras.layers.TextVectorization .

סטנדרטיזציה מתייחסת לעיבוד מוקדם של הטקסט, בדרך כלל להסרת סימני פיסוק או רכיבי HTML כדי לפשט את מערך הנתונים.
טוקניזציה מתייחסת לפיצול מחרוזות לאסימונים (לדוגמה, פיצול משפט למילים בודדות על ידי פיצול על רווח לבן).
וקטוריזציה מתייחסת להמרת אסימונים למספרים כך שניתן יהיה להזין אותם לרשת עצבית.

ניתן לבצע את כל המשימות הללו בשכבה זו. (תוכל ללמוד עוד על כל אחד מאלה tf.keras.layers.TextVectorization API).

ציין זאת:

הסטנדרטיזציה המוגדרת כברירת מחדל ממירה טקסט לאותיות קטנות ומסירה סימני פיסוק ( standardize='lower_and_strip_punctuation' ).
ברירת המחדל של האסימון מתפצל על רווח לבן ( split='whitespace' ).
ברירת המחדל של מצב הווקטוריזציה הוא 'int' ( output_mode='int' ). זה מוציא מדדים שלמים (אחד לכל אסימון). ניתן להשתמש במצב זה כדי לבנות מודלים שלוקחים סדר מילים בחשבון. אתה יכול גם להשתמש במצבים אחרים - כמו 'binary' - כדי לבנות מודלים של שקיות מילים .

אתה תבנה שני מודלים כדי ללמוד עוד על סטנדרטיזציה, טוקניזציה ווקטוריזציה עם TextVectorization :

ראשית, תשתמש במצב הוקטוריזציה ה'בינארי 'binary' כדי לבנות מודל של שק של מילים.
לאחר מכן, תשתמש במצב 'int' עם 1D ConvNet.

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

עבור מצב 'int' , בנוסף לגודל אוצר המילים המרבי, עליך להגדיר אורך רצף מרבי מפורש ( MAX_SEQUENCE_LENGTH ), שיגרום לשכבה לרפד או לקצץ רצפים בדיוק לערכי output_sequence_length :

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

לאחר מכן, קרא ל- TextVectorization.adapt כדי להתאים את המצב של שכבת העיבוד המקדים למערך הנתונים. זה יגרום למודל לבנות אינדקס של מחרוזות למספרים שלמים.

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

הדפס את התוצאה של שימוש בשכבות אלה לעיבוד מוקדם של נתונים:

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)

print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)

print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[ 55   6   2 410 211 229 121 895   4 124  32 245  43   5   1   1   5   1
    1   6   2 410 211 191 318  14   2  98  71 188   8   2 199  71 178   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 250), dtype=int64)

כפי שמוצג לעיל, המצב 'binary' של TextVectorization מחזיר מערך המציין אילו אסימונים קיימים לפחות פעם אחת בקלט, בעוד שמצב 'int' מחליף כל אסימון במספר שלם, ובכך משמר את הסדר שלהם.

אתה יכול לחפש את האסימון (מחרוזת) שכל מספר שלם מתאים לו על ידי קריאה TextVectorization.get_vocabulary בשכבה:

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 --->  roman
313 --->  source
Vocabulary size: 10000

אתה כמעט מוכן לאמן את הדגם שלך.

כשלב עיבוד מקדים אחרון, תחיל את שכבות TextVectorization שיצרת קודם לכן על מערכי ההדרכה, האימות והבדיקות:

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

הגדר את מערך הנתונים לביצועים

אלו הן שתי שיטות חשובות שבהן אתה צריך להשתמש בעת טעינת נתונים כדי לוודא שהקלט/פלט לא ייחסם.

Dataset.cache שומר נתונים בזיכרון לאחר טעינתם מהדיסק. זה יבטיח שמערך הנתונים לא יהפוך לצוואר בקבוק בזמן אימון המודל שלך. אם מערך הנתונים שלך גדול מכדי להתאים לזיכרון, אתה יכול גם להשתמש בשיטה זו כדי ליצור מטמון בעל ביצועים בדיסק, שיותר יעיל לקריאה מקבצים קטנים רבים.
Dataset.prefetch חופף לעיבוד מקדים של נתונים וביצוע מודלים תוך כדי אימון.

תוכל ללמוד עוד על שתי השיטות, כמו גם כיצד לשמר נתונים בדיסק בסעיף 'אחזור מוקדם' של מדריך ביצועים טובים יותר עם tf.data API .

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

אימון הדגם

הגיע הזמן ליצור את הרשת העצבית שלך.

עבור הנתונים הקטוריים ה'בינאריים 'binary' , הגדר מודל ליניארי פשוט של שקית מילים, ולאחר מכן הגדר ואמן אותו:

binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
200/200 [==============================] - 2s 4ms/step - loss: 1.1170 - accuracy: 0.6509 - val_loss: 0.9165 - val_accuracy: 0.7844
Epoch 2/10
200/200 [==============================] - 1s 3ms/step - loss: 0.7781 - accuracy: 0.8169 - val_loss: 0.7522 - val_accuracy: 0.8050
Epoch 3/10
200/200 [==============================] - 1s 3ms/step - loss: 0.6274 - accuracy: 0.8591 - val_loss: 0.6664 - val_accuracy: 0.8163
Epoch 4/10
200/200 [==============================] - 1s 3ms/step - loss: 0.5342 - accuracy: 0.8866 - val_loss: 0.6129 - val_accuracy: 0.8188
Epoch 5/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4683 - accuracy: 0.9038 - val_loss: 0.5761 - val_accuracy: 0.8281
Epoch 6/10
200/200 [==============================] - 1s 3ms/step - loss: 0.4181 - accuracy: 0.9181 - val_loss: 0.5494 - val_accuracy: 0.8331
Epoch 7/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3779 - accuracy: 0.9287 - val_loss: 0.5293 - val_accuracy: 0.8388
Epoch 8/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3446 - accuracy: 0.9361 - val_loss: 0.5137 - val_accuracy: 0.8400
Epoch 9/10
200/200 [==============================] - 1s 3ms/step - loss: 0.3164 - accuracy: 0.9430 - val_loss: 0.5014 - val_accuracy: 0.8381
Epoch 10/10
200/200 [==============================] - 1s 3ms/step - loss: 0.2920 - accuracy: 0.9495 - val_loss: 0.4916 - val_accuracy: 0.8388

לאחר מכן, תשתמש בשכבה הווקטורית 'int' כדי לבנות ConvNet 1D:

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

Epoch 1/5
200/200 [==============================] - 9s 5ms/step - loss: 1.1471 - accuracy: 0.5016 - val_loss: 0.7856 - val_accuracy: 0.6913
Epoch 2/5
200/200 [==============================] - 1s 3ms/step - loss: 0.6378 - accuracy: 0.7550 - val_loss: 0.5494 - val_accuracy: 0.8056
Epoch 3/5
200/200 [==============================] - 1s 3ms/step - loss: 0.3900 - accuracy: 0.8764 - val_loss: 0.4845 - val_accuracy: 0.8206
Epoch 4/5
200/200 [==============================] - 1s 3ms/step - loss: 0.2234 - accuracy: 0.9447 - val_loss: 0.4819 - val_accuracy: 0.8188
Epoch 5/5
200/200 [==============================] - 1s 3ms/step - loss: 0.1146 - accuracy: 0.9809 - val_loss: 0.5038 - val_accuracy: 0.8150

השוו בין שני הדגמים:

print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 4)                 40004     
                                                                 
=================================================================
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None

print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 64)          640064    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
=================================================================
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None

הערך את שני המודלים על נתוני הבדיקה:

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

250/250 [==============================] - 1s 3ms/step - loss: 0.5178 - accuracy: 0.8151
250/250 [==============================] - 1s 2ms/step - loss: 0.5262 - accuracy: 0.8073
Binary model accuracy: 81.51%
Int model accuracy: 80.73%

ייצא את הדגם

בקוד למעלה, החלת tf.keras.layers.TextVectorization על מערך הנתונים לפני הזנת טקסט למודל. אם אתה רוצה להפוך את המודל שלך למסוגל לעבד מחרוזות גולמיות (לדוגמה, כדי לפשט את הפריסה שלו), אתה יכול לכלול את שכבת TextVectorization בתוך המודל שלך.

כדי לעשות זאת, אתה יכול ליצור דגם חדש באמצעות המשקולות שאמנת זה עתה:

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

250/250 [==============================] - 1s 4ms/step - loss: 0.5178 - accuracy: 0.8151
Accuracy: 81.51%

כעת, המודל שלך יכול לקחת מחרוזות גולמיות כקלט ולחזות ניקוד עבור כל תווית באמצעות Model.predict . הגדר פונקציה כדי למצוא את התווית עם הציון המקסימלי:

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

הפעל מסקנות על נתונים חדשים

inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'

הכללת ההיגיון של עיבוד מוקדם של הטקסט בתוך המודל שלך מאפשרת לך לייצא מודל לייצור שמפשט את הפריסה, ומפחית את הפוטנציאל להטיית רכבת/בדיקה .

יש לזכור הבדל בביצועים בעת בחירת היכן ליישם את tf.keras.layers.TextVectorization . השימוש בו מחוץ לדגם שלך מאפשר לך לבצע עיבוד CPU אסינכרוני ואגירת הנתונים שלך בעת אימון על GPU. אז אם אתה מאמן את הדגם שלך על ה-GPU, אתה כנראה רוצה ללכת עם האפשרות הזו כדי לקבל את הביצועים הטובים ביותר בזמן פיתוח המודל שלך, ולאחר מכן עבור לכלול את שכבת TextVectorization בתוך המודל שלך כאשר אתה מוכן להתכונן לפריסה .

בקר במדריך שמור וטען מודלים כדי ללמוד עוד על שמירת מודלים.

דוגמה 2: חזה את המחבר של תרגומי האיליאדה

להלן דוגמה לשימוש ב- tf.data.TextLineDataset לטעינת דוגמאות מקובצי טקסט, וב- TensorFlow Text לעיבוד מקדים של הנתונים. אתה תשתמש בשלושה תרגומים שונים לאנגלית של אותה יצירה, האיליאדה של הומרוס, ותאמן מודל לזהות את המתרגם בהינתן שורת טקסט אחת.

הורד וחקור את מערך הנתונים

הטקסטים של שלושת התרגומים הם מאת:

קבצי הטקסט המשמשים במדריך זה עברו כמה משימות עיבוד מקדים טיפוסיות כמו הסרת כותרת עליונה וכותרת תחתונה של מסמכים, מספרי שורות וכותרות פרקים.

הורד את הקבצים הקלילים האלה באופן מקומי:

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
827392/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
819200/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step
819200/807992 [==============================] - 0s 0us/step
[PosixPath('/home/kbuilder/.keras/datasets/derby.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/butler.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/cowper.txt'),
 PosixPath('/home/kbuilder/.keras/datasets/fashion-mnist'),
 PosixPath('/home/kbuilder/.keras/datasets/mnist.npz')]

טען את מערך הנתונים

בעבר, עם tf.keras.utils.text_dataset_from_directory כל תוכן הקובץ טופל כדוגמה יחידה. כאן, תשתמש ב- tf.data.TextLineDataset , אשר נועד ליצור tf.data.Dataset מקובץ טקסט כאשר כל דוגמה היא שורת טקסט מהקובץ המקורי. TextLineDataset שימושי עבור נתוני טקסט המבוססים בעיקר על שורה (לדוגמה, שירה או יומני שגיאות).

חזור על הקבצים האלה, טען כל אחד לתוך מערך הנתונים שלו. כל דוגמה צריכה להיות מסומנת בנפרד, אז השתמש ב- Dataset.map כדי להחיל פונקציית תוויות על כל אחת מהן. זה יחזור על כל דוגמה במערך הנתונים, ויחזיר ( example, label ) זוגות.

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

לאחר מכן, תשלב את מערכי הנתונים המסומנים הללו לתוך מערך נתונים יחיד באמצעות Dataset.concatenate , ותערב אותם עם Dataset.shuffle :

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

הדפס כמה דוגמאות כמו קודם. מערך הנתונים עדיין לא נערך באצווה, ולכן כל ערך ב- all_labeled_data מתאים לנקודת נתונים אחת:

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())

Sentence:  b'Beneath the yoke the flying coursers led.'
Label: 1
Sentence:  b'Too free a range, and watchest all I do;'
Label: 1
Sentence:  b'defence of their ships. Thus would any seer who was expert in these'
Label: 2
Sentence:  b'"From morn to eve I fell, a summer\'s day,"'
Label: 0
Sentence:  b'went to the city bearing a message of peace to the Cadmeians; on his'
Label: 2
Sentence:  b'darkness of the flying night, and tell it to Agamemnon. This might'
Label: 2
Sentence:  b"To that distinction, Nestor's son, whom yet"
Label: 0
Sentence:  b'A sounder judge of honour and disgrace:'
Label: 1
Sentence:  b'He wept as he spoke, and the elders sighed in concert as each thought'
Label: 2
Sentence:  b'to gather his bones for the silt in which I shall have hidden him, and'
Label: 2

הכן את מערך הנתונים להדרכה

במקום להשתמש ב- tf.keras.layers.TextVectorization כדי לעבד מראש את מערך הנתונים של הטקסט, כעת תשתמש בממשקי ה-API של TensorFlow Text לסטנדרטיזציה וטוקניזציה של הנתונים, לבנות אוצר מילים ולהשתמש ב- tf.lookup.StaticVocabularyTable כדי למפות אסימונים למספרים שלמים כדי להזין אותם דֶגֶם. (למידע נוסף על TensorFlow Text ).

הגדר פונקציה כדי להמיר את הטקסט לאותיות קטנות ולהפוך אותו לאסימונים:

TensorFlow Text מספק אסימונים שונים. בדוגמה זו, תשתמש ב- text.UnicodeScriptTokenizer כדי לעצב את מערך הנתונים.
תשתמש ב- Dataset.map כדי להחיל את האסימון על מערך הנתונים.

tokenizer = tf_text.UnicodeScriptTokenizer()

def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

tokenized_ds = all_labeled_data.map(tokenize)

אתה יכול לחזור על מערך הנתונים ולהדפיס כמה דוגמאות אסימונים:

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())

Tokens:  [b'beneath' b'the' b'yoke' b'the' b'flying' b'coursers' b'led' b'.']
Tokens:  [b'too' b'free' b'a' b'range' b',' b'and' b'watchest' b'all' b'i' b'do'
 b';']
Tokens:  [b'defence' b'of' b'their' b'ships' b'.' b'thus' b'would' b'any' b'seer'
 b'who' b'was' b'expert' b'in' b'these']
Tokens:  [b'"' b'from' b'morn' b'to' b'eve' b'i' b'fell' b',' b'a' b'summer' b"'"
 b's' b'day' b',"']
Tokens:  [b'went' b'to' b'the' b'city' b'bearing' b'a' b'message' b'of' b'peace'
 b'to' b'the' b'cadmeians' b';' b'on' b'his']

לאחר מכן, תבנה אוצר מילים על ידי מיון אסימונים לפי תדירות ושמירה על אסימוני VOCAB_SIZE המובילים:

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])

Vocab size:  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']

כדי להמיר את האסימונים למספרים שלמים, השתמש בערכת vocab כדי ליצור tf.lookup.StaticVocabularyTable . תמפה אסימונים למספרים שלמים בטווח [ 2 , vocab_size + 2 ]. כמו בשכבת TextVectorization , 0 שמור לציון ריפוד ו 1 שמור לציון אסימון מחוץ לאוצר המילים (OOV).

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

לבסוף, הגדר פונקציה לסטנדרטיזציה, אסימון וקטוריזציה של מערך הנתונים באמצעות טוקנייזר וטבלת החיפוש:

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

אתה יכול לנסות זאת בדוגמה אחת כדי להדפיס את הפלט:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence:  b'Beneath the yoke the flying coursers led.'
Vectorized sentence:  [234   3 811   3 446 749 248   7]

כעת הפעל את פונקציית ה-preprocess במערך הנתונים באמצעות Dataset.map :

all_encoded_data = all_labeled_data.map(preprocess_text)

פצל את מערך הנתונים לקבוצות הדרכה ובדיקות

שכבת TextVectorization גם מקבצת ומרפדת את הנתונים הווקטוריים. נדרש ריפוד מכיוון שהדוגמאות בתוך אצווה צריכות להיות באותו גודל וצורה, אך הדוגמאות במערך הנתונים הללו אינן כולן באותו גודל - לכל שורת טקסט יש מספר שונה של מילים.

tf.data.Dataset תומך בפיצול וערכי נתונים מרופדים:

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

כעת, validation_data ו- train_data אינם אוספים של זוגות ( example, label ), אלא אוספים של אצוות. כל אצווה היא זוג ( דוגמאות רבות, תוויות רבות ) המיוצגות כמערכים.

כדי להמחיש זאת:

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Text batch shape:  (64, 18)
Label batch shape:  (64,)
First text example:  tf.Tensor([234   3 811   3 446 749 248   7   0   0   0   0   0   0   0   0   0   0], shape=(18,), dtype=int64)
First label example:  tf.Tensor(1, shape=(), dtype=int64)

מכיוון שאתה משתמש ב 0 עבור ריפוד ו 1 עבור אסימונים מחוץ לאוצר המילים (OOV), גודל אוצר המילים גדל בשניים:

vocab_size += 2

הגדר את מערכי הנתונים לביצועים טובים יותר כמו קודם:

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

אימון הדגם

אתה יכול לאמן מודל על מערך הנתונים הזה כמו קודם:

model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
697/697 [==============================] - 27s 9ms/step - loss: 0.5238 - accuracy: 0.7658 - val_loss: 0.3814 - val_accuracy: 0.8306
Epoch 2/3
697/697 [==============================] - 3s 4ms/step - loss: 0.2852 - accuracy: 0.8847 - val_loss: 0.3697 - val_accuracy: 0.8428
Epoch 3/3
697/697 [==============================] - 3s 4ms/step - loss: 0.1924 - accuracy: 0.9279 - val_loss: 0.3917 - val_accuracy: 0.8424

loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

79/79 [==============================] - 1s 2ms/step - loss: 0.3917 - accuracy: 0.8424
Loss:  0.391705721616745
Accuracy: 84.24%

ייצא את הדגם

כדי להפוך את המודל למסוגל לקחת מחרוזות גולמיות כקלט, תיצור שכבת TextVectorization שמבצעת את אותם שלבים כמו פונקציית העיבוד המקדים המותאם אישית שלך. מכיוון שכבר אימנתם אוצר מילים, תוכלו להשתמש ב- TextVectorization.set_vocabulary (במקום ב- TextVectorization.adapt ), שמכשיר אוצר מילים חדש.

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

2022-02-05 02:26:40.203675: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185
79/79 [==============================] - 6s 8ms/step - loss: 0.4955 - accuracy: 0.7964
Loss:  0.4955357015132904
Accuracy: 79.64%

האובדן והדיוק של המודל על ערכת האימות המקודדת ושל המודל המיוצא בערכת האימות הגולמית זהים, כצפוי.

הפעל מסקנות על נתונים חדשים

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

2022-02-05 02:26:43.328949: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: sequential_4/text_vectorization_2/UnicodeScriptTokenize/Assert_1/AssertGuard/branch_executed/_185
Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0

הורד מערכי נתונים נוספים באמצעות TensorFlow Datasets (TFDS)

אתה יכול להוריד מערכי נתונים רבים נוספים מ- TensorFlow Datasets .

בדוגמה זו, תשתמש במערך הנתונים IMDB Large Movie Review כדי לאמן מודל לסיווג סנטימנטים:

# Training set.
train_ds = tfds.load(
    'imdb_reviews',
    split='train[:80%]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

# Validation set.
val_ds = tfds.load(
    'imdb_reviews',
    split='train[80%:]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

הדפס כמה דוגמאות:

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())

Review: b"Instead, go to the zoo, buy some peanuts and feed 'em to the monkeys. Monkeys are funny. People with amnesia who don't say much, just sit there with vacant eyes are not all that funny. Black comedy? There isn't a black person in it, and there isn't one funny thing in it either. Walmart buys these things up somehow and puts them on their dollar rack. It's labeled Unrated. I think they took out the topless scene. They may have taken out other stuff too, who knows? All we know is that whatever they took out, isn't there any more. The acting seemed OK to me. There's a lot of unfathomables tho. It's supposed to be a city? It's supposed to be a big lake? If it's so hot in the church people are fanning themselves, why are they all wearing coats?"
Label: 0
Review: b'Well, was Morgan Freeman any more unusual as God than George Burns? This film sure was better than that bore, "Oh, God". I was totally engrossed and LMAO all the way through. Carrey was perfect as the out of sorts anchorman wannabe, and Aniston carried off her part as the frustrated girlfriend in her usual well played performance. I, for one, don\'t consider her to be either ugly or untalented. I think my favorite scene was when Carrey opened up the file cabinet thinking it could never hold his life history. See if you can spot the file in the cabinet that holds the events of his bathroom humor: I was rolling over this one. Well written and even better played out, this comedy will go down as one of this funnyman\'s best.'
Label: 1
Review: b'I remember stumbling upon this special while channel-surfing in 1965. I had never heard of Barbra before. When the show was over, I thought "This is probably the best thing on TV I will ever see in my life." 42 years later, that has held true. There is still nothing so amazing, so honestly astonishing as the talent that was displayed here. You can talk about all the super-stars you want to, this is the most superlative of them all! You name it, she can do it. Comedy, pathos, sultry seduction, ballads, Barbra is truly a story-teller. Her ability to pull off anything she attempts is legendary. But this special was made in the beginning, and helped to create the legend that she quickly became. In spite of rising so far in such a short time, she has fulfilled the promise, revealing more of her talents as she went along. But they are all here from the very beginning. You will not be disappointed in viewing this.'
Label: 1
Review: b"Firstly, I would like to point out that people who have criticised this film have made some glaring errors. Anything that has a rating below 6/10 is clearly utter nonsense. Creep is an absolutely fantastic film with amazing film effects. The actors are highly believable, the narrative thought provoking and the horror and graphical content extremely disturbing. There is much mystique in this film. Many questions arise as the audience are revealed to the strange and freakish creature that makes habitat in the dark rat ridden tunnels. How was 'Craig' created and what happened to him? A fantastic film with a large chill factor. A film with so many unanswered questions and a film that needs to be appreciated along with others like 28 Days Later, The Bunker, Dog Soldiers and Deathwatch. Look forward to more of these fantastic films!!"
Label: 1
Review: b"I'm sorry but I didn't like this doc very much. I can think of a million ways it could have been better. The people who made it obviously don't have much imagination. The interviews aren't very interesting and no real insight is offered. The footage isn't assembled in a very informative way, either. It's too bad because this is a movie that really deserves spellbinding special features. One thing I'll say is that Isabella Rosselini gets more beautiful the older she gets. All considered, this only gets a '4.'"
Label: 0

כעת תוכל לעבד מראש את הנתונים ולהכשיר מודל כמו קודם.

הכן את מערך הנתונים להדרכה

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

צור, הגדר והכשיר את המודל

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, None, 64)          640064    
                                                                 
 conv1d_2 (Conv1D)           (None, None, 64)          20544     
                                                                 
 global_max_pooling1d_2 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________

model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = model.fit(train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
313/313 [==============================] - 3s 7ms/step - loss: 0.5417 - accuracy: 0.6618 - val_loss: 0.3752 - val_accuracy: 0.8244
Epoch 2/3
313/313 [==============================] - 1s 4ms/step - loss: 0.2996 - accuracy: 0.8680 - val_loss: 0.3165 - val_accuracy: 0.8632
Epoch 3/3
313/313 [==============================] - 1s 4ms/step - loss: 0.1845 - accuracy: 0.9276 - val_loss: 0.3217 - val_accuracy: 0.8674

loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

79/79 [==============================] - 0s 2ms/step - loss: 0.3217 - accuracy: 0.8674
Loss:  0.32172858715057373
Accuracy: 86.74%

ייצא את הדגם

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)

Question:  This is a fantastic movie.
Predicted label:  1
Question:  This is a bad movie.
Predicted label:  0
Question:  This movie was so bad that it was good.
Predicted label:  0
Question:  I will never say yes to watching this movie.
Predicted label:  0

סיכום

מדריך זה הדגים מספר דרכים לטעינת טקסט ולעבד אותו מראש. כשלב הבא, אתה יכול לחקור מדריכים נוספים לעיבוד מקדים של TensorFlow Text , כגון:

אתה יכול גם למצוא מערכי נתונים חדשים ב- TensorFlow Datasets . וכדי ללמוד עוד על tf.data , עיין במדריך לבניית צינורות קלט .

טען טקסט קל לארגן דפים בעזרת אוספים אפשר לשמור ולסווג תוכן על סמך ההעדפות שלך.

דוגמה 1: חזה את התג עבור שאלת Stack Overflow

הורד וחקור את מערך הנתונים

טען את מערך הנתונים

הכן את מערך הנתונים להדרכה

הגדר את מערך הנתונים לביצועים

אימון הדגם

ייצא את הדגם

הפעל מסקנות על נתונים חדשים

דוגמה 2: חזה את המחבר של תרגומי האיליאדה

הורד וחקור את מערך הנתונים

טען את מערך הנתונים

הכן את מערך הנתונים להדרכה

פצל את מערך הנתונים לקבוצות הדרכה ובדיקות

אימון הדגם

ייצא את הדגם

הפעל מסקנות על נתונים חדשים

הורד מערכי נתונים נוספים באמצעות TensorFlow Datasets (TFDS)

הכן את מערך הנתונים להדרכה

צור, הגדר והכשיר את המודל

ייצא את הדגם

סיכום

טען טקסט