케라스와 텐서플로 허브를 사용한 영화 리뷰 텍스트 분류하기

TensorFlow.org에서 보기 구글 코랩(Colab)에서 실행하기 깃허브(GitHub) 소스 보기

이 노트북은 영화 리뷰(review) 텍스트를 긍정(positive) 또는 부정(negative)으로 분류합니다. 이 예제는 이진(binary)-또는 클래스(class)가 두 개인- 분류 문제입니다. 이진 분류는 머신러닝에서 중요하고 널리 사용됩니다.

이 튜토리얼에서는 텐서플로 허브(TensorFlow Hub)와 케라스(Keras)를 사용한 기초적인 전이 학습(transfer learning) 애플리케이션을 보여줍니다.

여기에서는 인터넷 영화 데이터베이스(Internet Movie Database)에서 수집한 50,000개의 영화 리뷰 텍스트를 담은 IMDB 데이터셋을 사용하겠습니다. 25,000개 리뷰는 훈련용으로, 25,000개는 테스트용으로 나뉘어져 있습니다. 훈련 세트와 테스트 세트의 클래스는 균형이 잡혀 있습니다. 즉 긍정적인 리뷰와 부정적인 리뷰의 개수가 동일합니다.

이 노트북은 텐서플로에서 모델을 만들고 훈련하기 위한 고수준 파이썬 API인 tf.keras와 전이 학습 라이브러리이자 플랫폼인 텐서플로 허브를 사용합니다. tf.keras를 사용한 고급 텍스트 분류 튜토리얼은 MLCC 텍스트 분류 가이드를 참고하세요.

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np

import tensorflow as tf

import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("버전: ", tf.__version__)
print("즉시 실행 모드: ", tf.executing_eagerly())
print("허브 버전: ", hub.__version__)
print("GPU ", "사용 가능" if tf.config.experimental.list_physical_devices("GPU") else "사용 불가능")
버전:  2.1.0
즉시 실행 모드:  True
허브 버전:  0.7.0
GPU  사용 가능

IMDB 데이터셋 다운로드

IMDB 데이터셋은 텐서플로 데이터셋(TensorFlow datasets)에 포함되어 있습니다. 다음 코드는 IMDB 데이터셋을 컴퓨터(또는 코랩 런타임)에 다운로드합니다:

# 훈련 세트를 6대 4로 나눕니다.
# 결국 훈련에 15,000개 샘플, 검증에 10,000개 샘플, 테스트에 25,000개 샘플을 사용하게 됩니다.
train_validation_split = tfds.Split.TRAIN.subsplit([6, 4])

(train_data, validation_data), test_data = tfds.load(
    name="imdb_reviews", 
    split=(train_validation_split, tfds.Split.TEST),
    as_supervised=True)
Downloading and preparing dataset imdb_reviews (80.23 MiB) to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/0.1.0...

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…





HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


HBox(children=(FloatProgress(value=0.0, description='Shuffling...', max=10.0, style=ProgressStyle(description_…
WARNING:tensorflow:From /home/kbuilder/.local/lib/python3.6/site-packages/tensorflow_datasets/core/file_format_adapter.py:210: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

WARNING:tensorflow:From /home/kbuilder/.local/lib/python3.6/site-packages/tensorflow_datasets/core/file_format_adapter.py:210: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


HBox(children=(FloatProgress(value=0.0, description='Shuffling...', max=10.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


HBox(children=(FloatProgress(value=0.0, description='Shuffling...', max=20.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Reading...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=0.0, description='Writing...', max=2500.0, style=ProgressStyle(description_…
Dataset imdb_reviews downloaded and prepared to /home/kbuilder/tensorflow_datasets/imdb_reviews/plain_text/0.1.0. Subsequent calls will reuse this data.

데이터 탐색

잠시 데이터 형태를 알아 보죠. 이 데이터셋의 샘플은 전처리된 정수 배열입니다. 이 정수는 영화 리뷰에 나오는 단어를 나타냅니다. 레이블(label)은 정수 0 또는 1입니다. 0은 부정적인 리뷰이고 1은 긍정적인 리뷰입니다.

처음 10개의 샘플을 출력해 보겠습니다.

train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch
<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's not really Dickens at all.<br /><br />'Oliver!', on the other hand, is much closer to the mark. The mockery of officialdom is perfectly interpreted, from the blustering beadle to the drunken magistrate. The classic stand-off between the beadle and Mr Brownlow, in which the law is described as 'a ass, a idiot' couldn't have been better done. Harry Secombe is an ideal choice.<br /><br />But the blinding cruelty is also there, the callous indifference of the state, the cold, hunger, poverty and loneliness are all presented just as surely as The Master would have wished.<br /><br />And then there is crime. Ron Moody is a treasure as the sleazy Jewish fence, whilst Oliver Reid has Bill Sykes to perfection.<br /><br />Perhaps not surprisingly, Lionel Bart - himself a Jew from London's east-end - takes a liberty with Fagin by re-interpreting him as a much more benign fellow than was Dicken's original. In the novel, he was utterly ruthless, sending some of his own boys to the gallows in order to protect himself (though he was also caught and hanged). Whereas in the movie, he is presented as something of a wayward father-figure, a sort of charitable thief rather than a corrupter of children, the latter being a long-standing anti-semitic sentiment. Otherwise, very few liberties are taken with Dickens's original. All of the most memorable elements are included. Just enough menace and violence is retained to ensure narrative fidelity whilst at the same time allowing for children' sensibilities. Nancy is still beaten to death, Bullseye narrowly escapes drowning, and Bill Sykes gets a faithfully graphic come-uppance.<br /><br />Every song is excellent, though they do incline towards schmaltz. Mark Lester mimes his wonderfully. Both his and my favourite scene is the one in which the world comes alive to 'who will buy'. It's schmaltzy, but it's Dickens through and through.<br /><br />I could go on. I could commend the wonderful set-pieces, the contrast of the rich and poor. There is top-quality acting from more British regulars than you could shake a stick at.<br /><br />I ought to give it 10 points, but I'm feeling more like Scrooge today. Soak it up with your Christmas dinner. No original has been better realised.",
       b"Oh yeah! Jenna Jameson did it again! Yeah Baby! This movie rocks. It was one of the 1st movies i saw of her. And i have to say i feel in love with her, she was great in this move.<br /><br />Her performance was outstanding and what i liked the most was the scenery and the wardrobe it was amazing you can tell that they put a lot into the movie the girls cloth were amazing.<br /><br />I hope this comment helps and u can buy the movie, the storyline is awesome is very unique and i'm sure u are going to like it. Jenna amazed us once more and no wonder the movie won so many awards. Her make-up and wardrobe is very very sexy and the girls on girls scene is amazing. specially the one where she looks like an angel. It's a must see and i hope u share my interests",
       b"I saw this film on True Movies (which automatically made me sceptical) but actually - it was good. Why? Not because of the amazing plot twists or breathtaking dialogue (of which there is little) but because actually, despite what people say I thought the film was accurate in it's depiction of teenagers dealing with pregnancy.<br /><br />It's NOT Dawson's Creek, they're not graceful, cool witty characters who breeze through sexuality with effortless knowledge. They're kids and they act like kids would. <br /><br />They're blunt, awkward and annoyingly confused about everything. Yes, this could be by accident and they could just be bad actors but I don't think so. Dermot Mulroney gives (when not trying to be cool) a very believable performance and I loved him for it. Patricia Arquette IS whiny and annoying, but she was pregnant and a teenagers? The combination of the two isn't exactly lavender on your pillow. The plot was VERY predictable and but so what? I believed them, his stress and inability to cope - her brave, yet slightly misguided attempts to bring them closer together. I think the characters, acted by anyone else, WOULD indeed have been annoying and unbelievable but they weren't. It reflects the surreality of the situation they're in, that he's sitting in class and she walks on campus with the baby. I felt angry at her for that, I felt angry at him for being such a child and for blaming her. I felt it all.<br /><br />In the end, I loved it and would recommend it.<br /><br />Watch out for the scene where Dermot Mulroney runs from the disastrous counselling session - career performance.",
       b'This was a wonderfully clever and entertaining movie that I shall never tire of watching many, many times. The casting was magnificent in matching up the young with the older characters. There are those of us out here who really do appreciate good actors and an intelligent story format. As for Judi Dench, she is beautiful and a gift to any kind of production in which she stars. I always make a point to see Judi Dench in all her performances. She is a superb actress and a pleasure to watch as each transformation of her character comes to life. I can only be grateful when I see such an outstanding picture for most of the motion pictures made more recently lack good characters, good scripts and good acting. The movie public needs heroes, not deviant manikins, who lack ingenuity and talent. How wonderful to see old favorites like Leslie Caron, Olympia Dukakis and Cleo Laine. I would like to see this movie win the awards it deserves. Thank you again for a tremendous night of entertainment. I congratulate the writer, director, producer, and all those who did such a fine job.',
       b'I have no idea what the other reviewer is talking about- this was a wonderful movie, and created a sense of the era that feels like time travel. The characters are truly young, Mary is a strong match for Byron, Claire is juvenile and a tad annoying, Polidori is a convincing beaten-down sycophant... all are beautiful, curious, and decadent... not the frightening wrecks they are in Gothic.<br /><br />Gothic works as an independent piece of shock film, and I loved it for different reasons, but this works like a Merchant and Ivory film, and was from my readings the best capture of what the summer must have felt like. Romantic, yes, but completely rekindles my interest in the lives of Shelley and Byron every time I think about the film. One of my all-time favorites.',
       b"This was soul-provoking! I am an Iranian, and living in th 21st century, I didn't know that such big tribes have been living in such conditions at the time of my grandfather!<br /><br />You see that today, or even in 1925, on one side of the world a lady or a baby could have everything served for him or her clean and on-demand, but here 80 years ago, people ventured their life to go to somewhere with more grass. It's really interesting that these Persians bear those difficulties to find pasture for their sheep, but they lose many the sheep on their way.<br /><br />I praise the Americans who accompanied this tribe, they were as tough as Bakhtiari people.",
       b'Just because someone is under the age of 10 does not mean they are stupid. If your child likes this film you\'d better have him/her tested. I am continually amazed at how so many people can be involved in something that turns out so bad. This "film" is a showcase for digital wizardry AND NOTHING ELSE. The writing is horrid. I can\'t remember when I\'ve heard such bad dialogue. The songs are beyond wretched. The acting is sub-par but then the actors were not given much. Who decided to employ Joey Fatone? He cannot sing and he is ugly as sin.<br /><br />The worst thing is the obviousness of it all. It is as if the writers went out of their way to make it all as stupid as possible. Great children\'s movies are wicked, smart and full of wit - films like Shrek and Toy Story in recent years, Willie Wonka and The Witches to mention two of the past. But in the continual dumbing-down of American more are flocking to dreck like Finding Nemo (yes, that\'s right), the recent Charlie & The Chocolate Factory and eye-crossing trash like Red Riding Hood.',
       b"I absolutely LOVED this movie when I was a kid. I cried every time I watched it. It wasn't weird to me. I totally identified with the characters. I would love to see it again (and hope I wont be disappointed!). Pufnstuf rocks!!!! I was really drawn in to the fantasy world. And to me the movie was loooong. I wonder if I ever saw the series and have confused them? The acting I thought was strong. I loved Jack Wilde. He was so dreamy to an 10 year old (when I first saw the movie, not in 1970. I can still remember the characters vividly. The flute was totally believable and I can still 'feel' the evil woods. Witchy poo was scary - I wouldn't want to cross her path.",
       b'A very close and sharp discription of the bubbling and dynamic emotional world of specialy one 18year old guy, that makes his first experiences in his gay love to an other boy, during an vacation with a part of his family.<br /><br />I liked this film because of his extremly clear and surrogated storytelling , with all this "Sound-close-ups" and quiet moments wich had been full of intensive moods.<br /><br />',
       b"This is the most depressing film I have ever seen. I first saw it as a child and even thinking about it now really upsets me. I know it was set in a time when life was hard and I know these people were poor and the crops were vital. Yes, I get all that. What I find hard to take is I can't remember one single light moment in the entire film. Maybe it was true to life, I don't know. I'm quite sure the acting was top notch and the direction and quality of filming etc etc was wonderful and I know that every film can't have a happy ending but as a family film it is dire in my opinion.<br /><br />I wouldn't recommend it to anyone who wants to be entertained by a film. I can't stress enough how this film affected me as a child. I was talking about it recently and all the sad memories came flooding back. I think it would have all but the heartless reaching for the Prozac."],
      dtype=object)>

처음 10개의 레이블도 출력해 보겠습니다.

train_labels_batch
<tf.Tensor: shape=(10,), dtype=int64, numpy=array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0])>

모델 구성

신경망은 층을 쌓아서 만듭니다. 여기에는 세 개의 중요한 구조적 결정이 필요합니다:

  • 어떻게 텍스트를 표현할 것인가?
  • 모델에서 얼마나 많은 층을 사용할 것인가?
  • 각 층에서 얼마나 많은 은닉 유닛(hidden unit)을 사용할 것인가?

이 예제의 입력 데이터는 문장으로 구성됩니다. 예측할 레이블은 0 또는 1입니다.

텍스트를 표현하는 한 가지 방법은 문장을 임베딩(embedding) 벡터로 바꾸는 것입니다. 그러면 첫 번째 층으로 사전 훈련(pre-trained)된 텍스트 임베딩을 사용할 수 있습니다. 여기에는 두 가지 장점이 있습니다.

  • 텍스트 전처리에 대해 신경 쓸 필요가 없습니다.
  • 전이 학습의 장점을 이용합니다.
  • 임베딩은 고정 크기이기 때문에 처리 과정이 단순해집니다.

이 예제는 텐서플로 허브에 있는 사전 훈련된 텍스트 임베딩 모델google/tf2-preview/gnews-swivel-20dim/1을 사용하겠습니다.

테스트해 볼 수 있는 사전 훈련된 모델이 세 개 더 있습니다:

먼저 문장을 임베딩시키기 위해 텐서플로 허브 모델을 사용하는 케라스 층을 만들어 보죠. 그다음 몇 개의 샘플을 입력하여 테스트해 보겠습니다. 입력 텍스트의 길이에 상관없이 임베딩의 출력 크기는 (num_examples, embedding_dimension)가 됩니다.

embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])
<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 3.9819887 , -4.4838037 ,  5.177359  , -2.3643482 , -3.2938678 ,
        -3.5364532 , -2.4786978 ,  2.5525482 ,  6.688532  , -2.3076782 ,
        -1.9807833 ,  1.1315885 , -3.0339816 , -0.7604128 , -5.743445  ,
         3.4242578 ,  4.790099  , -4.03061   , -5.992149  , -1.7297493 ],
       [ 3.4232912 , -4.230874  ,  4.1488533 , -0.29553518, -6.802391  ,
        -2.5163853 , -4.4002395 ,  1.905792  ,  4.7512794 , -0.40538004,
        -4.3401685 ,  1.0361497 ,  0.9744097 ,  0.71507156, -6.2657013 ,
         0.16533905,  4.560262  , -1.3106939 , -3.1121316 , -2.1338716 ],
       [ 3.8508697 , -5.003031  ,  4.8700504 , -0.04324996, -5.893603  ,
        -5.2983093 , -4.004676  ,  4.1236343 ,  6.267754  ,  0.11632943,
        -3.5934832 ,  0.8023905 ,  0.56146765,  0.9192484 , -7.3066816 ,
         2.8202746 ,  6.2000837 , -3.5709393 , -4.564525  , -2.305622  ]],
      dtype=float32)>

이제 전체 모델을 만들어 보겠습니다:

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________

순서대로 층을 쌓아 분류기를 만듭니다:

  1. 첫 번째 층은 텐서플로 허브 층입니다. 이 층은 사전 훈련된 모델을 사용하여 하나의 문장을 임베딩 벡터로 매핑합니다. 여기서 사용하는 사전 훈련된 텍스트 임베딩 모델(google/tf2-preview/gnews-swivel-20dim/1)은 하나의 문장을 토큰(token)으로 나누고 각 토큰의 임베딩을 연결하여 반환합니다. 최종 차원은 (num_examples, embedding_dimension)입니다.
  2. 이 고정 크기의 출력 벡터는 16개의 은닉 유닛(hidden unit)을 가진 완전 연결 층(Dense)으로 주입됩니다.
  3. 마지막 층은 하나의 출력 노드를 가진 완전 연결 층입니다. sigmoid 활성화 함수를 사용하므로 확률 또는 신뢰도 수준을 표현하는 0~1 사이의 실수가 출력됩니다.

이제 모델을 컴파일합니다.

손실 함수와 옵티마이저

모델이 훈련하려면 손실 함수(loss function)과 옵티마이저(optimizer)가 필요합니다. 이 예제는 이진 분류 문제이고 모델이 확률을 출력하므로(출력층의 유닛이 하나이고 sigmoid 활성화 함수를 사용합니다), binary_crossentropy 손실 함수를 사용하겠습니다.

다른 손실 함수를 선택할 수 없는 것은 아닙니다. 예를 들어 mean_squared_error를 선택할 수 있습니다. 하지만 일반적으로 binary_crossentropy가 확률을 다루는데 적합합니다. 이 함수는 확률 분포 간의 거리를 측정합니다. 여기에서는 정답인 타깃 분포와 예측 분포 사이의 거리입니다.

나중에 회귀(regression) 문제(예를 들어 주택 가격을 예측하는 문제)에 대해 살펴 볼 때 평균 제곱 오차(mean squared error) 손실 함수를 어떻게 사용하는지 알아 보겠습니다.

이제 모델이 사용할 옵티마이저와 손실 함수를 설정해 보죠:

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

모델 훈련

이 모델을 512개의 샘플로 이루어진 미니배치(mini-batch)에서 20번의 에포크(epoch) 동안 훈련합니다. x_trainy_train 텐서에 있는 모든 샘플에 대해 20번 반복한다는 뜻입니다. 훈련하는 동안 10,000개의 검증 세트에서 모델의 손실과 정확도를 모니터링합니다:

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)
Epoch 1/20
30/30 [==============================] - 4s 130ms/step - loss: 1.5591 - accuracy: 0.4405 - val_loss: 1.0599 - val_accuracy: 0.3768
Epoch 2/20
30/30 [==============================] - 3s 111ms/step - loss: 0.9176 - accuracy: 0.4269 - val_loss: 0.7931 - val_accuracy: 0.4886
Epoch 3/20
30/30 [==============================] - 3s 99ms/step - loss: 0.7420 - accuracy: 0.5209 - val_loss: 0.7006 - val_accuracy: 0.5663
Epoch 4/20
30/30 [==============================] - 3s 106ms/step - loss: 0.6761 - accuracy: 0.5927 - val_loss: 0.6489 - val_accuracy: 0.6284
Epoch 5/20
30/30 [==============================] - 3s 98ms/step - loss: 0.6265 - accuracy: 0.6584 - val_loss: 0.6059 - val_accuracy: 0.6829
Epoch 6/20
30/30 [==============================] - 3s 98ms/step - loss: 0.5861 - accuracy: 0.7048 - val_loss: 0.5720 - val_accuracy: 0.7164
Epoch 7/20
30/30 [==============================] - 3s 97ms/step - loss: 0.5484 - accuracy: 0.7400 - val_loss: 0.5406 - val_accuracy: 0.7444
Epoch 8/20
30/30 [==============================] - 3s 107ms/step - loss: 0.5126 - accuracy: 0.7693 - val_loss: 0.5093 - val_accuracy: 0.7716
Epoch 9/20
30/30 [==============================] - 3s 100ms/step - loss: 0.4764 - accuracy: 0.7943 - val_loss: 0.4785 - val_accuracy: 0.7917
Epoch 10/20
30/30 [==============================] - 3s 101ms/step - loss: 0.4408 - accuracy: 0.8188 - val_loss: 0.4499 - val_accuracy: 0.8062
Epoch 11/20
30/30 [==============================] - 3s 99ms/step - loss: 0.4089 - accuracy: 0.8336 - val_loss: 0.4222 - val_accuracy: 0.8206
Epoch 12/20
30/30 [==============================] - 3s 100ms/step - loss: 0.3767 - accuracy: 0.8531 - val_loss: 0.3991 - val_accuracy: 0.8306
Epoch 13/20
30/30 [==============================] - 3s 98ms/step - loss: 0.3466 - accuracy: 0.8659 - val_loss: 0.3778 - val_accuracy: 0.8417
Epoch 14/20
30/30 [==============================] - 3s 100ms/step - loss: 0.3220 - accuracy: 0.8764 - val_loss: 0.3586 - val_accuracy: 0.8469
Epoch 15/20
30/30 [==============================] - 3s 107ms/step - loss: 0.2973 - accuracy: 0.8877 - val_loss: 0.3440 - val_accuracy: 0.8547
Epoch 16/20
30/30 [==============================] - 3s 108ms/step - loss: 0.2758 - accuracy: 0.8976 - val_loss: 0.3285 - val_accuracy: 0.8607
Epoch 17/20
30/30 [==============================] - 3s 98ms/step - loss: 0.2564 - accuracy: 0.9055 - val_loss: 0.3182 - val_accuracy: 0.8639
Epoch 18/20
30/30 [==============================] - 3s 98ms/step - loss: 0.2377 - accuracy: 0.9126 - val_loss: 0.3107 - val_accuracy: 0.8674
Epoch 19/20
30/30 [==============================] - 3s 99ms/step - loss: 0.2239 - accuracy: 0.9189 - val_loss: 0.3043 - val_accuracy: 0.8702
Epoch 20/20
30/30 [==============================] - 3s 98ms/step - loss: 0.2090 - accuracy: 0.9246 - val_loss: 0.2987 - val_accuracy: 0.8727

모델 평가

모델의 성능을 확인해 보죠. 두 개의 값이 반환됩니다. 손실(오차를 나타내는 숫자이므로 낮을수록 좋습니다)과 정확도입니다.

results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
loss: 0.318
accuracy: 0.863

이 예제는 매우 단순한 방식을 사용하므로 87% 정도의 정확도를 달성했습니다. 고급 방법을 사용한 모델은 95%에 가까운 정확도를 얻습니다.

더 읽을거리

문자열 입력을 다루는 조금 더 일반적인 방법과 훈련 과정에서 정확도나 손실 변화에 대한 자세한 분석은 여기를 참고하세요.

#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.