TFRecord এবং tf.train.উদাহরণ

TensorFlow.org এ দেখুন

Google Colab-এ চালান

GitHub-এ উৎস দেখুন

নোটবুক ডাউনলোড করুন

TFRecord বিন্যাস বাইনারি রেকর্ডের একটি ক্রম সংরক্ষণ করার জন্য একটি সহজ বিন্যাস।

প্রোটোকল বাফার হল একটি ক্রস-প্ল্যাটফর্ম, স্ট্রাকচার্ড ডেটার দক্ষ সিরিয়ালাইজেশনের জন্য ক্রস-ভাষা লাইব্রেরি।

প্রোটোকল বার্তাগুলিকে .proto ফাইল দ্বারা সংজ্ঞায়িত করা হয়, এটি প্রায়শই একটি বার্তার ধরন বোঝার সবচেয়ে সহজ উপায়।

tf.train.Example বার্তা (বা protobuf) হল একটি নমনীয় বার্তার ধরন যা একটি {"string": value} ম্যাপিং প্রতিনিধিত্ব করে। এটি TensorFlow-এর সাথে ব্যবহারের জন্য ডিজাইন করা হয়েছে এবং TFX- এর মতো উচ্চ-স্তরের API গুলোতে ব্যবহার করা হয়।

এই নোটবুকটি দেখায় কিভাবে tf.train.Example বার্তা তৈরি, পার্স এবং ব্যবহার করতে হয় এবং তারপরে .tfrecord tf.train.Example সিরিয়ালাইজ, লিখতে এবং পড়তে হয়৷

দ্রষ্টব্য: সাধারণভাবে, আপনার একাধিক ফাইল জুড়ে আপনার ডেটা ভাগ করা উচিত যাতে আপনি I/O (একক হোস্টের মধ্যে বা একাধিক হোস্ট জুড়ে) সমান্তরাল করতে পারেন। মূল নিয়মটি হল হোস্ট পড়ার ডেটার চেয়ে কমপক্ষে 10 গুণ বেশি ফাইল থাকতে হবে। একই সময়ে, প্রতিটি ফাইল যথেষ্ট বড় হওয়া উচিত (অন্তত 10 MB+ এবং আদর্শভাবে 100 MB+) যাতে আপনি I/O প্রিফেচিং থেকে উপকৃত হতে পারেন। উদাহরণস্বরূপ, বলুন আপনার কাছে X GB ডেটা আছে এবং আপনি N হোস্ট পর্যন্ত প্রশিক্ষণের পরিকল্পনা করছেন। আদর্শভাবে, আপনার ডেটা ~ 10*N ফাইলে ভাগ করা উচিত, যতক্ষণ না ~ X/(10*N) 10 MB+ (এবং আদর্শভাবে 100 MB+)। যদি এটি তার থেকে কম হয়, তাহলে সমান্তরালতা সুবিধা এবং I/O প্রিফেচিং সুবিধাগুলি বন্ধ করার জন্য আপনাকে কম শার্ড তৈরি করতে হতে পারে।

সেটআপ

import tensorflow as tf

import numpy as np
import IPython.display as display

`tf.train.Example`

`tf.train.Example` জন্য ডেটা প্রকার

মৌলিকভাবে, একটি tf.train.Example .উদাহরণ হল একটি {"string": tf.train.Feature} ম্যাপিং।

tf.train.Feature বার্তার ধরন নিম্নলিখিত তিনটি প্রকারের একটি গ্রহণ করতে পারে (রেফারেন্সের জন্য .proto ফাইলটি দেখুন)। বেশিরভাগ অন্যান্য জেনেরিক প্রকারগুলিকে এর মধ্যে একটিতে বাধ্য করা যেতে পারে:

tf.train.BytesList (নিম্নলিখিত প্রকারগুলি জোরপূর্বক করা যেতে পারে)
- string
- byte
tf.train.FloatList (নিম্নলিখিত ধরনের জোর করা যেতে পারে)
- float ( float32 )
- double ( float64 )
tf.train.Int64List (নিম্নলিখিত ধরনের জোর করা যেতে পারে)
- bool
- enum
- int32
- uint32
- int64
- uint64

একটি আদর্শ TensorFlow প্রকারকে tf.train.Example -compatible tf.train.Feature এ রূপান্তর করতে, আপনি নীচের শর্টকাট ফাংশনগুলি ব্যবহার করতে পারেন৷ মনে রাখবেন যে প্রতিটি ফাংশন একটি স্কেলার ইনপুট মান নেয় এবং একটি tf.train.Feature প্রদান করে। উপরের তিনটি list একটির মধ্যে একটি বৈশিষ্ট্য রয়েছে:

# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

এই ফাংশনগুলি কীভাবে কাজ করে তার কিছু উদাহরণ নীচে দেওয়া হল। বিভিন্ন ধরনের ইনপুট এবং প্রমিত আউটপুট প্রকারগুলি নোট করুন। যদি একটি ফাংশনের জন্য ইনপুট টাইপ উপরে উল্লিখিত জবরদস্তিমূলক প্রকারগুলির একটির সাথে মেলে না, তবে ফাংশনটি একটি ব্যতিক্রম উত্থাপন করবে (যেমন _int64_feature(1.0) ত্রুটি বের করবে কারণ 1.0 একটি float- অতএব, এটি পরিবর্তে _float_feature ফাংশনের সাথে ব্যবহার করা উচিত ):

print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}

সমস্ত প্রোটো বার্তা .SerializeToString পদ্ধতি ব্যবহার করে একটি বাইনারি-স্ট্রিং-এ সিরিয়ালাইজ করা যেতে পারে:

feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

একটি `tf.train.Example` বার্তা তৈরি করা হচ্ছে

ধরুন আপনি বিদ্যমান ডেটা থেকে একটি tf.train.Example বার্তা তৈরি করতে চান। বাস্তবে, ডেটাসেট যেকোনো জায়গা থেকে আসতে পারে, কিন্তু tf.train.Example তৈরি করার পদ্ধতি। একটি একক পর্যবেক্ষণ থেকে উদাহরণ বার্তা একই হবে:

প্রতিটি পর্যবেক্ষণের মধ্যে, প্রতিটি মানকে একটি tf.train.Feature . বৈশিষ্ট্যে রূপান্তর করতে হবে যাতে উপরের ফাংশনগুলির একটি ব্যবহার করে 3টি সামঞ্জস্যপূর্ণ প্রকারের একটি রয়েছে৷
আপনি বৈশিষ্ট্যের নাম স্ট্রিং থেকে # 1 এ উত্পাদিত এনকোড করা বৈশিষ্ট্য মান পর্যন্ত একটি মানচিত্র (অভিধান) তৈরি করুন।
ধাপ 2 এ উত্পাদিত মানচিত্রটি একটি Features বার্তায় রূপান্তরিত হয়।

এই নোটবুকে, আপনি NumPy ব্যবহার করে একটি ডেটাসেট তৈরি করবেন।

এই ডেটাসেটে 4টি বৈশিষ্ট্য থাকবে:

একটি বুলিয়ান বৈশিষ্ট্য, সমান সম্ভাবনা সহ False বা True
একটি পূর্ণসংখ্যা বৈশিষ্ট্য অভিন্নভাবে এলোমেলোভাবে [0, 5] থেকে নির্বাচিত
সূচক হিসাবে পূর্ণসংখ্যা বৈশিষ্ট্য ব্যবহার করে একটি স্ট্রিং টেবিল থেকে উত্পন্ন একটি স্ট্রিং বৈশিষ্ট্য
একটি আদর্শ স্বাভাবিক বিতরণ থেকে একটি ফ্লোট বৈশিষ্ট্য

উপরের প্রতিটি বিতরণ থেকে 10,000টি স্বাধীনভাবে এবং অভিন্নভাবে বিতরণ করা পর্যবেক্ষণ সমন্বিত একটি নমুনা বিবেচনা করুন:

# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature.
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution.
feature3 = np.random.randn(n_observations)

এই বৈশিষ্ট্যগুলির প্রতিটিকে একটি tf.train.Example -সামঞ্জস্যপূর্ণ টাইপ _bytes_feature , _float_feature , _int64_feature ব্যবহার করে বাধ্য করা যেতে পারে। তারপরে আপনি এই এনকোড করা বৈশিষ্ট্যগুলি থেকে একটি tf.train.Example . উদাহরণ বার্তা তৈরি করতে পারেন:

def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.train.Example message ready to be written to a file.
  """
  # Create a dictionary mapping the feature name to the tf.train.Example-compatible
  # data type.
  feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
  }

  # Create a Features message using tf.train.Example.

  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

উদাহরণস্বরূপ, ধরুন আপনার ডেটাসেট থেকে একটি একক পর্যবেক্ষণ আছে, [False, 4, bytes('goat'), 0.9876] । আপনি create_message() ব্যবহার করে এই পর্যবেক্ষণের জন্য tf.train.Example বার্তা তৈরি এবং মুদ্রণ করতে পারেন। প্রতিটি একক পর্যবেক্ষণ উপরোক্ত অনুসারে একটি Features বার্তা হিসাবে লেখা হবে। উল্লেখ্য যে tf.train.Example বার্তাটি Features বার্তার চারপাশে একটি মোড়ক মাত্র:

# This is an example observation from the dataset.

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'

বার্তাটি ডিকোড করতে tf.train.Example.FromString পদ্ধতি ব্যবহার করুন।

example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

TFRrecords বিন্যাস বিবরণ

একটি TFRecord ফাইলে রেকর্ডের একটি ক্রম থাকে। ফাইলটি শুধুমাত্র ক্রমানুসারে পড়া যাবে।

প্রতিটি রেকর্ডে ডেটা-পেলোডের জন্য একটি বাইট-স্ট্রিং এবং ডেটা-দৈর্ঘ্য এবং CRC-32C ( 32-বিট CRC Castagnoli বহুপদী ব্যবহার করে) অখণ্ডতা যাচাইয়ের জন্য হ্যাশ রয়েছে।

প্রতিটি রেকর্ড নিম্নলিখিত বিন্যাসে সংরক্ষণ করা হয়:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

ফাইল তৈরি করার জন্য রেকর্ডগুলি একসাথে সংযুক্ত করা হয়। CRCগুলি এখানে বর্ণনা করা হয়েছে , এবং একটি CRC এর মুখোশ হল:

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

দ্রষ্টব্য: TFRecord ফাইলগুলিতে tf.train.Example ব্যবহার করার কোন প্রয়োজন নেই। tf.train.Example হল বাইট-স্ট্রিং-এ অভিধানগুলিকে সিরিয়ালাইজ করার একটি পদ্ধতি। টেনসরফ্লোতে ডিকোড করা যেতে পারে এমন যেকোনো বাইট-স্ট্রিং একটি TFRecord ফাইলে সংরক্ষণ করা যেতে পারে। উদাহরণগুলির মধ্যে রয়েছে: পাঠ্যের লাইন, JSON ( tf.io.decode_json_example ব্যবহার করে), এনকোড করা চিত্র ডেটা, বা সিরিয়ালাইজড tf.Tensors ( tf.io.serialize_tensor / tf.io.parse_tensor ব্যবহার করে)। আরও বিকল্পের জন্য tf.io মডিউল দেখুন।

tf.data ব্যবহার করে `tf.data` ফাইল

tf.data মডিউলটি টেনসরফ্লোতে ডেটা পড়ার এবং লেখার জন্য সরঞ্জাম সরবরাহ করে।

একটি TFRecord ফাইল লেখা

একটি ডেটাসেটে ডেটা পাওয়ার সবচেয়ে সহজ উপায় হল from_tensor_slices পদ্ধতি ব্যবহার করা।

একটি অ্যারেতে প্রয়োগ করা হলে, এটি স্কেলারগুলির একটি ডেটাসেট প্রদান করে:

tf.data.Dataset.from_tensor_slices(feature1)

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

একটি টিপল অ্যারেতে প্রয়োগ করা হলে, এটি টিপলের একটি ডেটাসেট প্রদান করে:

features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.bool, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.float64, name=None))>

# Use `take(1)` to only pull one example from the dataset.
for f0,f1,f2,f3 in features_dataset.take(1):
  print(f0)
  print(f1)
  print(f2)
  print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'goat', shape=(), dtype=string)
tf.Tensor(0.5251196235602504, shape=(), dtype=float64)

একটি Dataset প্রতিটি উপাদানে একটি ফাংশন প্রয়োগ করতে tf.data.Dataset.map পদ্ধতি ব্যবহার করুন।

ম্যাপ করা ফাংশনটি অবশ্যই টেনসরফ্লো গ্রাফ মোডে কাজ করবে—এটি অবশ্যই কাজ করবে এবং টিএফ. tf.Tensors রিটার্ন করবে। একটি নন-টেনসর ফাংশন, যেমন serialize_example , এটিকে সামঞ্জস্যপূর্ণ করতে tf.py_function দিয়ে মোড়ানো যেতে পারে।

tf.py_function ব্যবহার করার জন্য আকৃতি এবং টাইপ তথ্য নির্দিষ্ট করতে হবে যা অন্যথায় অনুপলব্ধ:

def tf_serialize_example(f0,f1,f2,f3):
  tf_string = tf.py_function(
    serialize_example,
    (f0, f1, f2, f3),  # Pass these args to the above function.
    tf.string)      # The return type is `tf.string`.
  return tf.reshape(tf_string, ()) # The result is a scalar.

tf_serialize_example(f0, f1, f2, f3)

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>

ডেটাসেটের প্রতিটি উপাদানে এই ফাংশনটি প্রয়োগ করুন:

serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

def generator():
  for features in features_dataset:
    yield serialize_example(*features)

serialized_features_dataset = tf.data.Dataset.from_generator(
    generator, output_types=tf.string, output_shapes=())

serialized_features_dataset

<FlatMapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

এবং তাদের একটি TFRecord ফাইলে লিখুন:

filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

WARNING:tensorflow:From /tmp/ipykernel_25215/3575438268.py:2: TFRecordWriter.__init__ (from tensorflow.python.data.experimental.ops.writers) is deprecated and will be removed in a future version.
Instructions for updating:
To write TFRecords to disk, use `tf.io.TFRecordWriter`. To save and load the contents of a dataset, use `tf.data.experimental.save` and `tf.data.experimental.load`

একটি TFRecord ফাইল পড়া

এছাড়াও আপনি tf.data.TFRecordDataset ক্লাস ব্যবহার করে TFRecord ফাইলটি পড়তে পারেন।

tf.data ব্যবহার করে tf.data ফাইলগুলি ব্যবহার করার বিষয়ে আরও তথ্য পাওয়া যাবে tf.data: বিল্ড টেনসরফ্লো ইনপুট পাইপলাইন গাইডে।

TFRecordDataset s ব্যবহার করা ইনপুট ডেটা মানককরণ এবং কর্মক্ষমতা অপ্টিমাইজ করার জন্য কার্যকর হতে পারে।

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

এই মুহুর্তে ডেটাসেটে সিরিয়ালাইজড tf.train.Example বার্তা রয়েছে। এটির উপর পুনরাবৃত্তি করা হলে এগুলিকে স্কেলার স্ট্রিং টেনসর হিসাবে ফিরিয়ে দেয়।

শুধুমাত্র প্রথম 10টি রেকর্ড দেখাতে .take পদ্ধতি ব্যবহার করুন।

for raw_record in raw_dataset.take(10):
  print(repr(raw_record))

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x9d\xfa\x98\xbe\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03dog\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04a\xc0r?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x92Q(?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04>\xc0\xe5>\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nU\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04I!\xde\xbe\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xe0\x1a\xab\xbf\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x87\xb2\xd7?\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04n\xe19>\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x1as\xd9\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>

এই টেনসরগুলিকে নীচের ফাংশনটি ব্যবহার করে পার্স করা যেতে পারে। উল্লেখ্য যে tf.data.Dataset feature_description গ্রাফ-এক্সিকিউশন ব্যবহার করে এবং তাদের আকার এবং টাইপ স্বাক্ষর তৈরি করতে এই বিবরণের প্রয়োজন:

# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

বিকল্পভাবে, একবারে পুরো ব্যাচ পার্স করতে tf.parse example ব্যবহার করুন। tf.data.Dataset.map পদ্ধতি ব্যবহার করে ডেটাসেটের প্রতিটি আইটেমে এই ফাংশনটি প্রয়োগ করুন:

parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset element_spec={'feature0': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature1': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.string, name=None), 'feature3': TensorSpec(shape=(), dtype=tf.float32, name=None)}>

ডেটাসেটে পর্যবেক্ষণগুলি প্রদর্শন করতে আগ্রহী সম্পাদন ব্যবহার করুন। এই ডেটাসেটে 10,000টি পর্যবেক্ষণ রয়েছে, কিন্তু আপনি শুধুমাত্র প্রথম 10টি প্রদর্শন করবেন। ডেটা বৈশিষ্ট্যের অভিধান হিসাবে প্রদর্শিত হয়। প্রতিটি আইটেম একটি tf.Tensor . টেনসর , এবং এই টেনসরের numpy উপাদান বৈশিষ্ট্যটির মান প্রদর্শন করে:

for parsed_record in parsed_dataset.take(10):
  print(repr(parsed_record))

{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.5251196>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.29878703>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.94824797>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.65749466>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.44873232>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.4338477>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.3367577>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=1.6851357>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.18152401>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.6988251>}

এখানে, tf.parse_example ফাংশন tf.train.Example ক্ষেত্রগুলিকে স্ট্যান্ডার্ড টেনসরে আনপ্যাক করে।

পাইথনে TFRecord ফাইল

tf.io মডিউলে TFRecord ফাইল পড়া এবং লেখার জন্য বিশুদ্ধ-পাইথন ফাংশনও রয়েছে।

একটি TFRecord ফাইল লেখা

এরপর, test.tfrecord ফাইলে 10,000টি পর্যবেক্ষণ লিখুন। প্রতিটি পর্যবেক্ষণ একটি tf.train.Example বার্তায় রূপান্তরিত হয়, তারপর ফাইলে লেখা হয়। তারপর আপনি যাচাই করতে পারেন যে test.tfrecord ফাইলটি তৈরি করা হয়েছে:

# Write the `tf.train.Example` observations to the file.
with tf.io.TFRecordWriter(filename) as writer:
  for i in range(n_observations):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

du -sh {filename}

984K    test.tfrecord

একটি TFRecord ফাইল পড়া

এই ক্রমিক টেনসরগুলিকে সহজেই tf.train.Example.ParseFromString ব্যবহার করে পার্স করা যেতে পারে:

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.5251196026802063
      }
    }
  }
}

এটি একটি tf.train.Example . উদাহরণ প্রোটো প্রদান করে যা ব্যবহার করা কঠিন, কিন্তু এটি মৌলিকভাবে একটি উপস্থাপনা:

Dict[str,
     Union[List[float],
           List[int],
           List[str]]]

নিম্নলিখিত কোডটি TensorFlow Ops ব্যবহার না করেই Example NumPy অ্যারের অভিধানে রূপান্তরিত করে। বিস্তারিত জানার জন্য PROTO ফাইলটি পড়ুন।

result = {}
# example.features.feature is the dictionary
for key, feature in example.features.feature.items():
  # The values are the Feature objects which contain a `kind` which contains:
  # one of three fields: bytes_list, float_list, int64_list

  kind = feature.WhichOneof('kind')
  result[key] = np.array(getattr(feature, kind).value)

result

{'feature3': array([0.5251196]),
 'feature1': array([4]),
 'feature0': array([0]),
 'feature2': array([b'goat'], dtype='|S4')}

ওয়াকথ্রু: ইমেজ ডেটা পড়া এবং লেখা

এটি TFRecords ব্যবহার করে ইমেজ ডেটা কীভাবে পড়তে এবং লিখতে হয় তার একটি এন্ড-টু-এন্ড উদাহরণ। ইনপুট ডেটা হিসাবে একটি চিত্র ব্যবহার করে, আপনি একটি TFRecord ফাইল হিসাবে ডেটা লিখবেন, তারপর ফাইলটি আবার পড়বেন এবং চিত্রটি প্রদর্শন করবেন।

এটি কার্যকর হতে পারে যদি, উদাহরণস্বরূপ, আপনি একই ইনপুট ডেটাসেটে একাধিক মডেল ব্যবহার করতে চান৷ ইমেজ ডেটা কাঁচা সংরক্ষণ করার পরিবর্তে, এটি TFRecords ফরম্যাটে প্রিপ্রসেস করা যেতে পারে এবং এটি পরবর্তী সমস্ত প্রক্রিয়াকরণ এবং মডেলিংয়ে ব্যবহার করা যেতে পারে।

প্রথমে, আসুন বরফের মধ্যে একটি বিড়ালের এই ছবিটি এবং নির্মাণাধীন উইলিয়ামসবার্গ ব্রিজ, NYC-এর এই ছবিটি ডাউনলোড করি।

ছবি আনুন

cat_in_snow  = tf.keras.utils.get_file(
    '320px-Felis_catus-cat_on_snow.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')

williamsburg_bridge = tf.keras.utils.get_file(
    '194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg
24576/17858 [=========================================] - 0s 0us/step
32768/17858 [=======================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg
16384/15477 [===============================] - 0s 0us/step
24576/15477 [===============================================] - 0s 0us/step

display.display(display.Image(filename=cat_in_snow))
display.display(display.HTML('Image cc-by: <a "href=https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg">Von.grzanka</a>'))

jpeg

display.display(display.Image(filename=williamsburg_bridge))
display.display(display.HTML('<a "href=https://commons.wikimedia.org/wiki/File:New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg">From Wikimedia</a>'))

jpeg

TFRecord ফাইলটি লিখুন

আগের মতো, বৈশিষ্ট্যগুলিকে tf.train.Example এর সাথে সামঞ্জস্যপূর্ণ প্রকার হিসাবে এনকোড করুন। এটি কাঁচা ইমেজ স্ট্রিং বৈশিষ্ট্য, সেইসাথে উচ্চতা, প্রস্থ, গভীরতা, এবং নির্বিচারে label বৈশিষ্ট্য সংরক্ষণ করে। পরবর্তীটি ব্যবহার করা হয় যখন আপনি বিড়াল চিত্র এবং সেতু চিত্রের মধ্যে পার্থক্য করতে ফাইলটি লেখেন। বিড়াল চিত্রের জন্য 0 এবং সেতু চিত্রের জন্য 1 ব্যবহার করুন:

image_labels = {
    cat_in_snow : 0,
    williamsburg_bridge : 1,
}

# This is an example, just using the cat image.
image_string = open(cat_in_snow, 'rb').read()

label = image_labels[cat_in_snow]

# Create a dictionary with features that may be relevant.
def image_example(image_string, label):
  image_shape = tf.io.decode_jpeg(image_string).shape

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(image_string),
  }

  return tf.train.Example(features=tf.train.Features(feature=feature))

for line in str(image_example(image_string, label)).split('\n')[:15]:
  print(line)
print('...')

features {
  feature {
    key: "depth"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "height"
    value {
      int64_list {
        value: 213
      }
...

লক্ষ্য করুন যে সমস্ত বৈশিষ্ট্য এখন tf.train.Example বার্তায় সংরক্ষিত আছে। এর পরে, উপরের কোডটিকে কার্যকরী করুন এবং images.tfrecords নামের একটি ফাইলে উদাহরণ বার্তা লিখুন:

# Write the raw image files to `images.tfrecords`.
# First, process the two images into `tf.train.Example` messages.
# Then, write to a `.tfrecords` file.
record_file = 'images.tfrecords'
with tf.io.TFRecordWriter(record_file) as writer:
  for filename, label in image_labels.items():
    image_string = open(filename, 'rb').read()
    tf_example = image_example(image_string, label)
    writer.write(tf_example.SerializeToString())

du -sh {record_file}

36K images.tfrecords

TFRecord ফাইলটি পড়ুন

আপনার কাছে এখন images.tfrecords —এবং আপনি যা লিখেছেন তা পড়ার জন্য এখন এটির রেকর্ডগুলির উপর পুনরাবৃত্তি করতে পারেন। প্রদত্ত যে এই উদাহরণে আপনি শুধুমাত্র চিত্রটি পুনরুত্পাদন করবেন, শুধুমাত্র আপনার প্রয়োজন হবে কাঁচা চিত্রের স্ট্রিং। উপরে বর্ণিত গেটার ব্যবহার করে এটি বের করুন, যেমন example.features.feature['image_raw'].bytes_list.value[0] । কোন রেকর্ডটি বিড়াল এবং কোনটি সেতু তা নির্ধারণ করতে আপনি লেবেলগুলি ব্যবহার করতে পারেন:

raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

parsed_image_dataset = raw_image_dataset.map(_parse_image_function)
parsed_image_dataset

<MapDataset element_spec={'depth': TensorSpec(shape=(), dtype=tf.int64, name=None), 'height': TensorSpec(shape=(), dtype=tf.int64, name=None), 'image_raw': TensorSpec(shape=(), dtype=tf.string, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'width': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

TFRecord ফাইল থেকে ছবি পুনরুদ্ধার করুন:

for image_features in parsed_image_dataset:
  image_raw = image_features['image_raw'].numpy()
  display.display(display.Image(data=image_raw))

jpeg

সেটআপ

tf.train.Example

tf.train.Example জন্য ডেটা প্রকার

একটি tf.train.Example বার্তা তৈরি করা হচ্ছে

TFRrecords বিন্যাস বিবরণ

tf.data ব্যবহার করে tf.data ফাইল

একটি TFRecord ফাইল লেখা

একটি TFRecord ফাইল পড়া

পাইথনে TFRecord ফাইল

একটি TFRecord ফাইল লেখা

একটি TFRecord ফাইল পড়া

ওয়াকথ্রু: ইমেজ ডেটা পড়া এবং লেখা

ছবি আনুন

TFRecord ফাইলটি লিখুন

TFRecord ফাইলটি পড়ুন

`tf.train.Example`

`tf.train.Example` জন্য ডেটা প্রকার

একটি `tf.train.Example` বার্তা তৈরি করা হচ্ছে

tf.data ব্যবহার করে `tf.data` ফাইল