text.FastBertNormalizer

Normalizes a tensor of UTF-8 strings.

text.FastBertNormalizer(
    lower_case_nfd_strip_accents=False, model_buffer=None
)

Args
`lower_case_nfd_strip_accents`	(optional). - If true, it first lowercases the text, applies NFD normalization, strips accents characters, and then replaces control characters with whitespaces. - If false, it only replaces control characters with whitespaces.
`model_buffer`	(optional) bytes object (or a uint8 tf.Tenosr) that contains the fast bert normalizer model in flatbuffer format (see fast_bert_normalizer_model.fbs). If not `None`, all other arguments are ignored.

Methods

`normalize`

View source

normalize(
    input
)

Tokenizes a tensor of UTF-8 strings.

Example:

texts = [["They're", "the", "Greatest", "\xC0bc"]]
normalizer = FastBertNormalizer(lower_case_nfd_strip_accents=True)
normalizer.normalize(texts)
<tf.RaggedTensor [[b"they're", b'the', b'greatest', b'abc']]>

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

Returns
An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

`normalize_with_offsets`

View source

normalize_with_offsets(
    input
)

Normalizes a tensor of UTF-8 strings and returns offsets map.

Example:

texts = ["They're", "the", "Greatest", "\xC0bc"]
normalizer = FastBertNormalizer(lower_case_nfd_strip_accents=True)
normalized_text, offsets = (
  normalizer.normalize_with_offsets(texts))
normalized_text
<tf.Tensor: shape=(4,), dtype=string, numpy=array([b"they're", b'the',
b'greatest', b'abc'], dtype=object)>
offsets
<tf.RaggedTensor [[0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3], [0, 1, 2, 3, 4, 5,
6, 7, 8], [0, 2, 3, 4]]>

Args
`input`	An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

Returns
A tuple `(normalized_texts, offsets)` where:
`normalized_texts`	is a `Tensor` or `RaggedTensor`.
`offsets`	is a `RaggedTensor` of the byte offsets from the output to the input. For example, if the input is `input[i1...iN]` with `N` strings, `offsets[i1...iN, k]` is the byte offset in `inputs[i1...iN]` for the `kth` byte in `normalized_texts[i1...iN]`. Note that `offsets[i1...iN, ...]` also covers the position following the last byte in `normalized_texts[i1...iN]`, so that we know the byte offset position in `input[i1...iN]` that corresponds to the end of `normalized_texts[i1...iN]`.

text.FastBertNormalizer Stay organized with collections Save and categorize content based on your preferences.

Args

Methods

normalize

Example:

normalize_with_offsets

Example:

text.FastBertNormalizer

`normalize`

`normalize_with_offsets`