RNN을 사용한 텍스트 분류

TensorFlow.org에서 보기

Google Colab에서 실행

GitHub에서 소스 보기

노트북 다운로드

이 텍스트 분류 튜토리얼은 훈련 재발 성 신경 네트워크 상의 IMDB 큰 영화 리뷰 데이터 세트 심리 분석을.

설정

import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

tfds.disable_progress_bar()

가져 오기 matplotlib 플롯 그래프에 도우미 함수를 만듭니다

import matplotlib.pyplot as plt


def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

입력 파이프라인 설정

IMDB 큰 영화 리뷰 데이터 세트 이진 분류 데이터 세트 - 모든 리뷰 긍정적 또는 부정적 감정 중 하나를 가지고 있습니다.

사용하여 데이터 집합을 다운로드 TFDS을 . 참고 항목 로드 텍스트 튜토리얼 수동으로 이러한 종류의 데이터를로드하는 방법에 대한 자세한 내용을.

dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

처음에 이것은 (텍스트, 레이블 쌍)의 데이터세트를 반환합니다.

for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0

다음 훈련에 대한 데이터를 셔플하고 이들의 배치 작성 (text, label) 쌍을 :

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts: [b'This is arguably the worst film I have ever seen, and I have quite an appetite for awful (and good) movies. It could (just) have managed a kind of adolescent humour if it had been consistently tongue-in-cheek --\xc3\xa0 la ROCKY HORROR PICTURE SHOW, which was really very funny. Other movies, like PLAN NINE FROM OUTER SPACE, manage to be funny while (apparently) trying to be serious. As to the acting, it looks like they rounded up brain-dead teenagers and asked them to ad-lib the whole production. Compared to them, Tom Cruise looks like Alec Guinness. There was one decent interpretation -- that of the older ghoul-busting broad on the motorcycle.'
b"I saw this film in the worst possible circumstance. I'd already missed 15 minutes when I woke up to it on an international flight between Sydney and Seoul. I didn't know what I was watching, I thought maybe it was a movie of the week, but quickly became riveted by the performance of the lead actress playing a young woman who's child had been kidnapped. The premise started taking twist and turns I didn't see coming and by the end credits I was scrambling through the the in-flight guide to figure out what I had just watched. Turns out I was belatedly discovering Do-yeon Jeon who'd won Best Actress at Cannes for the role. I don't know if Secret Sunshine is typical of Korean cinema but I'm off to the DVD store to discover more."
b"Hello. I am Paul Raddick, a.k.a. Panic Attack of WTAF, Channel 29 in Philadelphia. Let me tell you about this god awful movie that powered on Adam Sandler's film career but was digitized after a short time.<br /><br />Going Overboard is about an aspiring comedian played by Sandler who gets a job on a cruise ship and fails...or so I thought. Sandler encounters babes that like History of the World Part 1 and Rebound. The babes were supposed to be engaged, but, actually, they get executed by Sawtooth, the meanest cannibal the world has ever known. Adam Sandler fared bad in Going Overboard, but fared better in Big Daddy, Billy Madison, and Jen Leone's favorite, 50 First Dates. Man, Drew Barrymore was one hot chick. Spanglish is red hot, Going Overboard ain't Dooley squat! End of file."]

labels: [0 1 0]

텍스트 인코더 만들기

로드 원료 텍스트 tfds 이 모델에 이용되기 전에 처리되어야한다. 교육 과정 텍스트에 가장 간단한 방법은 사용 TextVectorization 레이어를. 이 레이어에는 많은 기능이 있지만 이 튜토리얼은 기본 동작을 고수합니다.

레이어를 만들고 레이어에 데이터 세트의 텍스트를 통과 .adapt 방법 :

VOCAB_SIZE = 1000
encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

.adapt 방법은 층의 어휘를 설정한다. 다음은 처음 20개의 토큰입니다. 패딩 및 알 수 없는 토큰 다음에 빈도별로 정렬됩니다.

vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

어휘가 설정되면 레이어는 텍스트를 인덱스로 인코딩할 수 있습니다. (당신이 고정 된 설정하지 않는 한 지수의 텐서는 배치에서 가장 긴 순서로 0 채워집니다 output_sequence_length ) :

encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[ 11,   7,   1, ...,   0,   0,   0],
       [ 10, 208,  11, ...,   0,   0,   0],
       [  1,  10, 237, ...,   0,   0,   0]])

기본 설정을 사용하면 프로세스를 완전히 되돌릴 수 없습니다. 그 이유는 크게 세 가지입니다.

기본값 preprocessing.TextVectorization 의 standardize 인수는 "lower_and_strip_punctuation" .
제한된 어휘 크기와 문자 기반 대체 부족으로 인해 일부 알 수 없는 토큰이 생성됩니다.

for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Original:  b'This is arguably the worst film I have ever seen, and I have quite an appetite for awful (and good) movies. It could (just) have managed a kind of adolescent humour if it had been consistently tongue-in-cheek --\xc3\xa0 la ROCKY HORROR PICTURE SHOW, which was really very funny. Other movies, like PLAN NINE FROM OUTER SPACE, manage to be funny while (apparently) trying to be serious. As to the acting, it looks like they rounded up brain-dead teenagers and asked them to ad-lib the whole production. Compared to them, Tom Cruise looks like Alec Guinness. There was one decent interpretation -- that of the older ghoul-busting broad on the motorcycle.'
Round-trip:  this is [UNK] the worst film i have ever seen and i have quite an [UNK] for awful and good movies it could just have [UNK] a kind of [UNK] [UNK] if it had been [UNK] [UNK] [UNK] la [UNK] horror picture show which was really very funny other movies like [UNK] [UNK] from [UNK] space [UNK] to be funny while apparently trying to be serious as to the acting it looks like they [UNK] up [UNK] [UNK] and [UNK] them to [UNK] the whole production [UNK] to them tom [UNK] looks like [UNK] [UNK] there was one decent [UNK] that of the older [UNK] [UNK] on the [UNK]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

Original:  b"I saw this film in the worst possible circumstance. I'd already missed 15 minutes when I woke up to it on an international flight between Sydney and Seoul. I didn't know what I was watching, I thought maybe it was a movie of the week, but quickly became riveted by the performance of the lead actress playing a young woman who's child had been kidnapped. The premise started taking twist and turns I didn't see coming and by the end credits I was scrambling through the the in-flight guide to figure out what I had just watched. Turns out I was belatedly discovering Do-yeon Jeon who'd won Best Actress at Cannes for the role. I don't know if Secret Sunshine is typical of Korean cinema but I'm off to the DVD store to discover more."
Round-trip:  i saw this film in the worst possible [UNK] id already [UNK] [UNK] minutes when i [UNK] up to it on an [UNK] [UNK] between [UNK] and [UNK] i didnt know what i was watching i thought maybe it was a movie of the [UNK] but quickly became [UNK] by the performance of the lead actress playing a young woman whos child had been [UNK] the premise started taking twist and turns i didnt see coming and by the end credits i was [UNK] through the the [UNK] [UNK] to figure out what i had just watched turns out i was [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] best actress at [UNK] for the role i dont know if secret [UNK] is typical of [UNK] cinema but im off to the dvd [UNK] to [UNK] more                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Original:  b"Hello. I am Paul Raddick, a.k.a. Panic Attack of WTAF, Channel 29 in Philadelphia. Let me tell you about this god awful movie that powered on Adam Sandler's film career but was digitized after a short time.<br /><br />Going Overboard is about an aspiring comedian played by Sandler who gets a job on a cruise ship and fails...or so I thought. Sandler encounters babes that like History of the World Part 1 and Rebound. The babes were supposed to be engaged, but, actually, they get executed by Sawtooth, the meanest cannibal the world has ever known. Adam Sandler fared bad in Going Overboard, but fared better in Big Daddy, Billy Madison, and Jen Leone's favorite, 50 First Dates. Man, Drew Barrymore was one hot chick. Spanglish is red hot, Going Overboard ain't Dooley squat! End of file."
Round-trip:  [UNK] i am paul [UNK] [UNK] [UNK] [UNK] of [UNK] [UNK] [UNK] in [UNK] let me tell you about this god awful movie that [UNK] on [UNK] [UNK] film career but was [UNK] after a short [UNK] br going [UNK] is about an [UNK] [UNK] played by [UNK] who gets a job on a [UNK] [UNK] and [UNK] so i thought [UNK] [UNK] [UNK] that like history of the world part 1 and [UNK] the [UNK] were supposed to be [UNK] but actually they get [UNK] by [UNK] the [UNK] [UNK] the world has ever known [UNK] [UNK] [UNK] bad in going [UNK] but [UNK] better in big [UNK] [UNK] [UNK] and [UNK] [UNK] favorite [UNK] first [UNK] man [UNK] [UNK] was one hot [UNK] [UNK] is red hot going [UNK] [UNK] [UNK] [UNK] end of [UNK]

모델 만들기

모델의 정보 흐름 도면

위는 모델의 다이어그램입니다.

이 모델은 같은 빌드 할 수 있습니다 tf.keras.Sequential .
첫 번째 층은 인 encoder 토큰 인덱스의 시퀀스에 텍스트를 변환.
인코더 후 임베딩 레이어입니다. 임베딩 레이어는 단어당 하나의 벡터를 저장합니다. 호출되면 단어 인덱스 시퀀스를 벡터 시퀀스로 변환합니다. 이러한 벡터는 훈련 가능합니다. (충분한 데이터에 대한) 훈련 후 유사한 의미를 가진 단어는 종종 유사한 벡터를 갖습니다.
이 인덱스 룩업 훨씬 효율적 관통 한 핫 벡터 부호화를 전달하는 상응 동작보다 tf.keras.layers.Dense 층.
순환 신경망(RNN)은 요소를 반복하여 시퀀스 입력을 처리합니다. RNN은 한 타임 스텝의 출력을 다음 타임 스텝의 입력으로 전달합니다.
tf.keras.layers.Bidirectional 래퍼는 또한 RNN 층으로 사용될 수있다. 이것은 RNN 레이어를 통해 입력을 앞뒤로 전파한 다음 최종 출력을 연결합니다.
- 양방향 RNN의 주요 이점은 입력 시작 부분의 신호가 출력에 영향을 미치기 위해 모든 시간 단계를 통해 처리될 필요가 없다는 것입니다.
- 양방향 RNN의 주요 단점은 단어가 끝에 추가될 때 예측을 효율적으로 스트리밍할 수 없다는 것입니다.
RNN 단일 벡터에 순서를 전환 한 후에 두 layers.Dense 분류 출력으로 단일 로짓이 벡터 표현에서 일부 최종 처리 및 변환 할.

이를 구현하는 코드는 다음과 같습니다.

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

모델의 모든 레이어가 단일 입력만 갖고 단일 출력을 생성하기 때문에 여기에서 Keras 순차 모델이 사용된다는 점에 유의하십시오. Stateful RNN 레이어를 사용하려는 경우 RNN 레이어 상태를 검색하고 재사용할 수 있도록 Keras 기능 API 또는 모델 서브클래싱으로 모델을 빌드할 수 있습니다. 확인하시기 바랍니다 Keras RNN 가이드 자세한 내용은.

매립 층 마스킹 용도 변화하는 시퀀스 길이를 처리한다. 애프터 모든 층 Embedding 지원 마스크 :

print([layer.supports_masking for layer in model.layers])

[False, True, True, True, True]

이것이 예상대로 작동하는지 확인하려면 문장을 두 번 평가하십시오. 첫째, 마스크에 패딩이 없도록 단독으로:

# predict on a sample text without padding.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

[-0.00012211]

이제 더 긴 문장으로 배치에서 다시 평가하십시오. 결과는 동일해야 합니다.

# predict on a sample text with padding

padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])

[-0.00012211]

Keras 모델을 컴파일하여 교육 프로세스를 구성합니다.

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

모델 훈련

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/10
391/391 [==============================] - 39s 84ms/step - loss: 0.6454 - accuracy: 0.5630 - val_loss: 0.4888 - val_accuracy: 0.7568
Epoch 2/10
391/391 [==============================] - 30s 75ms/step - loss: 0.3925 - accuracy: 0.8200 - val_loss: 0.3663 - val_accuracy: 0.8464
Epoch 3/10
391/391 [==============================] - 30s 75ms/step - loss: 0.3319 - accuracy: 0.8525 - val_loss: 0.3402 - val_accuracy: 0.8385
Epoch 4/10
391/391 [==============================] - 30s 75ms/step - loss: 0.3183 - accuracy: 0.8616 - val_loss: 0.3289 - val_accuracy: 0.8438
Epoch 5/10
391/391 [==============================] - 30s 75ms/step - loss: 0.3088 - accuracy: 0.8656 - val_loss: 0.3254 - val_accuracy: 0.8646
Epoch 6/10
391/391 [==============================] - 32s 81ms/step - loss: 0.3043 - accuracy: 0.8686 - val_loss: 0.3242 - val_accuracy: 0.8521
Epoch 7/10
391/391 [==============================] - 30s 76ms/step - loss: 0.3019 - accuracy: 0.8696 - val_loss: 0.3315 - val_accuracy: 0.8609
Epoch 8/10
391/391 [==============================] - 32s 76ms/step - loss: 0.3007 - accuracy: 0.8688 - val_loss: 0.3245 - val_accuracy: 0.8609
Epoch 9/10
391/391 [==============================] - 31s 77ms/step - loss: 0.2981 - accuracy: 0.8707 - val_loss: 0.3294 - val_accuracy: 0.8599
Epoch 10/10
391/391 [==============================] - 31s 78ms/step - loss: 0.2969 - accuracy: 0.8742 - val_loss: 0.3218 - val_accuracy: 0.8547

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

391/391 [==============================] - 15s 38ms/step - loss: 0.3185 - accuracy: 0.8582
Test Loss: 0.3184521794319153
Test Accuracy: 0.8581600189208984

plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

(0.0, 0.6627909764647484)

png

새 문장에 대한 예측 실행:

예측이 >= 0.0이면 양수이고 그렇지 않으면 음수입니다.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))

두 개 이상의 LSTM 레이어 쌓기

Keras 재발 층은에 의해 제어되는 두 개의 모드를 사용할 수 있습니다 return_sequences 생성자 인수를 :

경우 False 은 각각의 입력 시퀀스의 마지막 출력 리턴 (형상 (BATCH_SIZE의 2D 텐서를 output_features)). 이것은 이전 모델에서 사용된 기본값입니다.
경우 True 각 타임 스텝에 대한 연속 출력의 전체 시퀀스 (모양의 3 차원 텐서 반환 (batch_size, timesteps, output_features) ).

다음은 무엇인지와 같은 정보 외모의 흐름 return_sequences=True :

layered_bidirectional

사용에 대한 흥미로운 것은 RNN 가진 return_sequences=True 그것과 같은 다른 RNN 층에 전달 될 수 있도록 출력은 여전히 입력처럼 3 축을 갖는 것이다 :

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/10
391/391 [==============================] - 71s 149ms/step - loss: 0.6502 - accuracy: 0.5625 - val_loss: 0.4923 - val_accuracy: 0.7573
Epoch 2/10
391/391 [==============================] - 55s 138ms/step - loss: 0.4067 - accuracy: 0.8198 - val_loss: 0.3727 - val_accuracy: 0.8271
Epoch 3/10
391/391 [==============================] - 54s 136ms/step - loss: 0.3417 - accuracy: 0.8543 - val_loss: 0.3343 - val_accuracy: 0.8510
Epoch 4/10
391/391 [==============================] - 53s 134ms/step - loss: 0.3242 - accuracy: 0.8607 - val_loss: 0.3268 - val_accuracy: 0.8568
Epoch 5/10
391/391 [==============================] - 53s 135ms/step - loss: 0.3174 - accuracy: 0.8652 - val_loss: 0.3213 - val_accuracy: 0.8516
Epoch 6/10
391/391 [==============================] - 52s 132ms/step - loss: 0.3098 - accuracy: 0.8671 - val_loss: 0.3294 - val_accuracy: 0.8547
Epoch 7/10
391/391 [==============================] - 53s 134ms/step - loss: 0.3063 - accuracy: 0.8697 - val_loss: 0.3158 - val_accuracy: 0.8594
Epoch 8/10
391/391 [==============================] - 52s 132ms/step - loss: 0.3043 - accuracy: 0.8692 - val_loss: 0.3184 - val_accuracy: 0.8521
Epoch 9/10
391/391 [==============================] - 53s 133ms/step - loss: 0.3016 - accuracy: 0.8704 - val_loss: 0.3208 - val_accuracy: 0.8609
Epoch 10/10
391/391 [==============================] - 54s 136ms/step - loss: 0.2975 - accuracy: 0.8740 - val_loss: 0.3301 - val_accuracy: 0.8651

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

391/391 [==============================] - 26s 65ms/step - loss: 0.3293 - accuracy: 0.8646
Test Loss: 0.329334557056427
Test Accuracy: 0.8646399974822998

# predict on a sample text without padding.

sample_text = ('The movie was not good. The animation and the graphics '
               'were terrible. I would not recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions)

[[-1.6796288]]

plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')

png

같은 다른 기존의 재발 층 확인 GRU 레이어를 .

사용자 정의 RNNs을 구축 interestied있는 경우, 참조 Keras RNN 가이드 .

RNN을 사용한 텍스트 분류 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

설정