此页面由 Cloud Translation API 翻译。
Switch to English

Unicode字符串

在TensorFlow.org上查看 在Google Colab中运行 在GitHub上查看源代码 下载笔记本

介绍

处理自然语言的模型通常使用不同的字符集来处理不同的语言。 Unicode是一种标准的编码系统,用于表示几乎所有语言的字符。每个字符都使用介于00x10FFFF之间的唯一整数代码点进行编码。 Unicode字符串是零个或多个代码点的序列。

本教程说明了如何在TensorFlow中表示Unicode字符串,以及如何使用标准字符串ops的Unicode等效项来操作它们。它将基于脚本检测将Unicode字符串分成令牌。

import tensorflow as tf

tf.string数据类型

基本TensorFlow tf.string dtype允许你建立字节串的张量。 Unicode字符串默认情况下是utf-8编码的。

tf.constant(u"Thanks 😊")
<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

tf.string张量可以容纳不同长度的字节串,因为字节串被视为原子单位。张量尺寸中不包括字符串长度。

tf.constant([u"You're", u"welcome!"]).shape
TensorShape([2])

表示Unicode

在TensorFlow中有两种表示Unicode字符串的标准方法:

  • string标量—使用已知的字符编码对代码点序列进行编码
  • int32向量—其中每个位置都包含一个代码点。

例如,以下三个值均表示Unicode字符串"语言处理" (在中文中表示“语言处理”):

# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant(u"语言处理")
text_utf8
<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>
# Unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be
<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>
# Unicode string, represented as a vector of Unicode code points.
text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars
<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

在制图表达之间转换

TensorFlow提供了在这些不同表示之间进行转换的操作:

tf.strings.unicode_decode(text_utf8,
                          input_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>
tf.strings.unicode_encode(text_chars,
                          output_encoding='UTF-8')
<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>
tf.strings.unicode_transcode(text_utf8,
                             input_encoding='UTF8',
                             output_encoding='UTF-16-BE')
<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

批次尺寸

解码多个字符串时,每个字符串中的字符数可能不相等。返回结果是tf.RaggedTensor ,其中最里面的维的长度根据每个字符串中的字符数而变化:

# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
              [u'hÃllo',  u'What is the weather tomorrow',  u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
                                               input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
  print(sentence_chars)
[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]

您可以使用此tf.RaggedTensor直接,或将其转换为密tf.Tensor与填充或tf.SparseTensor使用方法tf.RaggedTensor.to_tensortf.RaggedTensor.to_sparse

batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())
[[   104    195    108    108    111     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [    87    104     97    116     32    105    115     32    116    104
     101     32    119    101     97    116    104    101    114     32
     116    111    109    111    114    114    111    119]
 [    71    246    246    100    110    105    103    104    116     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]]

batch_chars_sparse = batch_chars_ragged.to_sparse()

当编码具有相同长度的多个字符串时,可以将tf.Tensor用作输入:

tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]],
                          output_encoding='UTF-8')
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

在对多个长度可变的字符串进行编码时, tf.RaggedTensor用作输入:

tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

如果您的张量包含填充或稀疏格式的多个字符串, tf.RaggedTensor在调用unicode_encode之前将其转换为unicode_encode

tf.strings.unicode_encode(
    tf.RaggedTensor.from_sparse(batch_chars_sparse),
    output_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>
tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
    output_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

Unicode操作

字符长度

tf.strings.length操作具有参数unit ,该参数指示应如何计算长度。 unit默认为"BYTE" ,但可以将其设置为其他值,例如"UTF8_CHAR""UTF16_CHAR" ,以确定每个编码string中Unicode代码点的数量。

# Note that the final character takes up 4 bytes in UTF8.
thanks = u'Thanks 😊'.encode('UTF-8')
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))
11 bytes; 8 UTF-8 characters

字符子串

类似地, tf.strings.substr操作接受“ unit ”参数,并使用它来确定“ pos ”和“ len ”参数所包含的偏移量。

# default: unit='BYTE'. With len=1, we return a single byte.
tf.strings.substr(thanks, pos=7, len=1).numpy()
b'\xf0'
# Specifying unit='UTF8_CHAR', we return a single character, which in this case
# is 4 bytes.
print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())
b'\xf0\x9f\x98\x8a'

拆分Unicode字符串

tf.strings.unicode_split操作将unicode字符串拆分为各个字符的子字符串:

tf.strings.unicode_split(thanks, 'UTF-8').numpy()
array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
      dtype=object)

字符的字节偏移

要将tf.strings.unicode_decode生成的字符张量与原始字符串对齐,了解每个字符开始位置的偏移量很有用。该方法tf.strings.unicode_decode_with_offsets类似于unicode_decode ,不同之处在于它返回一个包含启动每个字符的偏移的第二张量。

codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
  print("At byte offset {}: codepoint {}".format(offset, codepoint))
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

Unicode脚本

每个Unicode代码点都属于一个称为脚本的代码点的单个集合。角色的脚本有助于确定角色可能使用的语言。例如,知道西里尔字母为“Б”时,表示包含该角色的现代文本很可能来自斯拉夫语,例如俄语或乌克兰语。

TensorFlow提供tf.strings.unicode_script操作来确定给定代码点使用哪个脚本。脚本代码是与Unicode国际组件 (ICU) UScriptCode值相对应的int32值。

uscript = tf.strings.unicode_script([33464, 1041])  # ['芸', 'Б']

print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]
[17  8]

tf.strings.unicode_script操作也可以应用于代码点的多维tf.Tensortf.RaggedTensor

print(tf.strings.unicode_script(batch_chars_ragged))
<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>

示例:简单细分

分段是将文本拆分为类似单词的单元的任务。当使用空格字符分隔单词时,这通常很容易,但是某些语言(例如中文和日语)不使用空格,而某些语言(例如德语)包含必须分解以分析其含义的长化合物。在Web文本中,经常将不同的语言和脚本混合在一起,例如“ NY株価”(纽约证券交易所)。

通过使用脚本中的更改来近似单词边界,我们可以执行非常粗略的细分(无需实现任何ML模型)。这将适用于上面的“ NY株価”示例之类的字符串。它也适用于大多数使用空格的语言,因为各种脚本的空格字符都归类为USCRIPT_COMMON,这是一种特殊的脚本代码,不同于任何实际的文本。

# dtype: string; shape: [num_sentences]
#
# The sentences to process.  Edit this line to try out different inputs!
sentence_texts = [u'Hello, world.', u'世界こんにちは']

首先,我们将句子解码为字符代码点,然后为每个字符找到脚本标识。

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_codepoint[i, j] is the codepoint for the j'th character in
# the i'th sentence.
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_scripts[i, j] is the unicode script of the j'th character in
# the i'th sentence.
sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)
<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>

接下来,我们使用这些脚本标识符来确定应在何处添加单词边界。我们在每个句子的开头以及脚本与前一个字符不同的每个字符处添加单词边界:

# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_starts_word[i, j] is True if the j'th character in the i'th
# sentence is the start of a word.
sentence_char_starts_word = tf.concat(
    [tf.fill([sentence_char_script.nrows(), 1], True),
     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
    axis=1)

# dtype: int64; shape: [num_words]
#
# word_starts[i] is the index of the character that starts the i'th word (in
# the flattened list of characters from all sentences).
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)
tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)

然后,我们可以使用这些起始偏移量来构建一个RaggedTensor其中包含所有批次中的单词列表:

# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word.
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts)
print(word_char_codepoint)
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

最后,我们可以将代码点RaggedTensor回句子:

# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)

# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)
<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>

为了使最终结果更易于阅读,我们可以将其编码回UTF-8字符串:

tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()
[[b'Hello', b', ', b'world', b'.'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]