本頁面由 Cloud Translation API 翻譯而成。
Switch to English

從TensorFlow IO讀取PostgreSQL數據庫

在TensorFlow.org上查看 在Google Colab中運行 在GitHub上查看源代碼 下載筆記本

總覽

本教程說明如何從PostgreSQL數據庫服務器創建tf.data.Dataset ,以便可以將創建的Dataset傳遞給tf.keras進行訓練或推理。

SQL數據庫是數據科學家的重要數據來源。 PostgreSQL是最流行的開源SQL數據庫之一,在企業中廣泛使用,用於全面存儲關鍵數據和事務數據。直接從PostgreSQL數據庫服務器創建Dataset並將Dataset傳遞給tf.keras進行訓練或推理,可以大大簡化數據管道,並幫助數據科學家專注於構建機器學習模型。

設置和使用

安裝所需的tensorflow-io軟件包並重新啟動運行時

 try:
  %tensorflow_version 2.x
except Exception:
  pass

!pip install -q tensorflow-io
 

安裝和設置PostgreSQL(可選)

為了演示在Google Colab上的用法,您將安裝PostgreSQL服務器。還需要密碼和一個空數據庫。

如果您不在Google Colab上運行此筆記本,或者希望使用現有數據庫,請跳過以下設置,然後繼續進行下一部分。

 # Install postgresql server
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a database with name `tfio_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;'
 
Preconfiguring packages ...
Selecting previously unselected package libpq5:amd64.
(Reading database ... 226072 files and directories currently installed.)
Preparing to unpack .../0-libpq5_10.12-0ubuntu0.18.04.1_amd64.deb ...
Unpacking libpq5:amd64 (10.12-0ubuntu0.18.04.1) ...
Selecting previously unselected package postgresql-client-common.
Preparing to unpack .../1-postgresql-client-common_190ubuntu0.1_all.deb ...
Unpacking postgresql-client-common (190ubuntu0.1) ...
Selecting previously unselected package postgresql-client-10.
Preparing to unpack .../2-postgresql-client-10_10.12-0ubuntu0.18.04.1_amd64.deb ...
Unpacking postgresql-client-10 (10.12-0ubuntu0.18.04.1) ...
Selecting previously unselected package ssl-cert.
Preparing to unpack .../3-ssl-cert_1.0.39_all.deb ...
Unpacking ssl-cert (1.0.39) ...
Selecting previously unselected package postgresql-common.
Preparing to unpack .../4-postgresql-common_190ubuntu0.1_all.deb ...
Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common'
Unpacking postgresql-common (190ubuntu0.1) ...
Selecting previously unselected package postgresql-10.
Preparing to unpack .../5-postgresql-10_10.12-0ubuntu0.18.04.1_amd64.deb ...
Unpacking postgresql-10 (10.12-0ubuntu0.18.04.1) ...
Selecting previously unselected package postgresql.
Preparing to unpack .../6-postgresql_10+190ubuntu0.1_all.deb ...
Unpacking postgresql (10+190ubuntu0.1) ...
Selecting previously unselected package sysstat.
Preparing to unpack .../7-sysstat_11.6.1-1ubuntu0.1_amd64.deb ...
Unpacking sysstat (11.6.1-1ubuntu0.1) ...
Setting up sysstat (11.6.1-1ubuntu0.1) ...

Creating config file /etc/default/sysstat with new version
update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode
Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service.
Setting up ssl-cert (1.0.39) ...
Setting up libpq5:amd64 (10.12-0ubuntu0.18.04.1) ...
Setting up postgresql-client-common (190ubuntu0.1) ...
Setting up postgresql-common (190ubuntu0.1) ...
Adding user postgres to group ssl-cert

Creating config file /etc/postgresql-common/createcluster.conf with new version
Building PostgreSQL dictionaries from installed myspell/hunspell packages...
Removing obsolete dictionary files:
Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service.
Setting up postgresql-client-10 (10.12-0ubuntu0.18.04.1) ...
update-alternatives: using /usr/share/postgresql/10/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode
Setting up postgresql-10 (10.12-0ubuntu0.18.04.1) ...
Creating new PostgreSQL cluster 10/main ...
/usr/lib/postgresql/10/bin/initdb -D /var/lib/postgresql/10/main --auth-local peer --auth-host md5
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/10/main ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start

Ver Cluster Port Status Owner    Data directory              Log file
10  main    5432 down   postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log
update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode
Setting up postgresql (10+190ubuntu0.1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for ureadahead (0.100.0-21) ...
Processing triggers for libc-bin (2.27-3ubuntu1.2) ...
Processing triggers for systemd (237-3ubuntu10.38) ...
ALTER ROLE
NOTICE:  database "tfio_demo" does not exist, skipping
DROP DATABASE
CREATE DATABASE

設置必要的環境變量

以下環境變量基於上一節中的PostgreSQL設置。如果您使用不同的設置或正在使用現有數據庫,則應相應地更改它們:

 %env TFIO_DEMO_DATABASE_NAME=tfio_demo
%env TFIO_DEMO_DATABASE_HOST=localhost
%env TFIO_DEMO_DATABASE_PORT=5432
%env TFIO_DEMO_DATABASE_USER=postgres
%env TFIO_DEMO_DATABASE_PASS=postgres
 
env: TFIO_DEMO_DATABASE_NAME=tfio_demo
env: TFIO_DEMO_DATABASE_HOST=localhost
env: TFIO_DEMO_DATABASE_PORT=5432
env: TFIO_DEMO_DATABASE_USER=postgres
env: TFIO_DEMO_DATABASE_PASS=postgres

在PostgreSQL服務器中準備數據

出於演示目的,本教程將創建一個數據庫,並向其中填充一些數據。本教程中使用的數據來自UCI機器學習存儲庫中的 空氣質量數據集

以下是空氣質量數據集的子集的預覽:

日期|時間| CO(GT)| PT08.S1(CO)| NMHC(GT)| C6H6(GT)| PT08.S2(NMHC)| NOx(GT)| PT08.S3(NOx)| NO2(GT)| PT08.S4(NO2)| PT08.S5(O3)| T | RH | AH | ---- | ---- | ------ | ----------- | -------- | -------- | --- ---------- | ---- | ---------- | ------- | ------------ |- --------- |-|-|-|| 10/03/2004 | 18.00.00 | 2,6 | 1360 | 150 | 11,9 | 1046 | 166 | 1056 | 113 | 1692 | 1268 | 13,6 | 48,9 | 0,7578 | 10/03/2004 | 19.00.00 | 2 | 1292 | 112 | 9,4 | 955 | 103 | 1174 | 92 | 1559 | 972 | 13,3 | 47,7 | 0,7255 | 10/03/2004 | 20.00.00 | 2,2 | 1402 | 88 | 9,0 | 939 | 131 | 1140 | 114 | 1555 | 1074 | 11,9 | 54,0 | 0,7502 | 10/03/2004 | 21.00.00 | 2,2 | 1376 | 80 | 9,2 | 948 | 172 | 1092 | 122 | 1584 | 1203 | 11,0 | 60,0 | 0,7867 | 10/03/2004 | 22.00.00 | 1,6 | 1272 | 51 | 6,5 | 836 | 131 | 1205 | 116 | 1490 | 1110 | 11,2 | 59,6 | 0,7888 |

參考部分提供了有關空氣質量數據集和UCI機器學習存儲庫的更多信息。

為了幫助簡化數據準備,已經準備了空氣質量數據集的sql版本,並且可以通過AirQualityUCI.sql獲得

創建表的語句是:

 CREATE TABLE AirQualityUCI (
  Date DATE,
  Time TIME,
  CO REAL,
  PT08S1 INT,
  NMHC REAL,
  C6H6 REAL,
  PT08S2 INT,
  NOx REAL,
  PT08S3 INT,
  NO2 REAL,
  PT08S4 INT,
  PT08S5 INT,
  T REAL,
  RH REAL,
  AH REAL
);
 

在數據庫中創建表並填充數據的完整命令為:

 !curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql

!PGPASSWORD=$TFIO_DEMO_DATABASE_PASS psql -q -h $TFIO_DEMO_DATABASE_HOST -p $TFIO_DEMO_DATABASE_PORT -U $TFIO_DEMO_DATABASE_USER -d $TFIO_DEMO_DATABASE_NAME -f AirQualityUCI.sql
 

從PostgreSQL服務器創建數據集並在TensorFlow中使用它

從PostgreSQL服務器創建數據集tfio.experimental.IODataset.from_sql使用queryendpoint參數調用tfio.experimental.IODataset.from_sql一樣容易。該query是對錶中選擇列的SQL查詢, endpoint參數是地址和數據庫名稱:

 import os
import tensorflow_io as tfio

endpoint="postgresql://{}:{}@{}?port={}&dbname={}".format(
    os.environ['TFIO_DEMO_DATABASE_USER'],
    os.environ['TFIO_DEMO_DATABASE_PASS'],
    os.environ['TFIO_DEMO_DATABASE_HOST'],
    os.environ['TFIO_DEMO_DATABASE_PORT'],
    os.environ['TFIO_DEMO_DATABASE_NAME'],
)

dataset = tfio.experimental.IODataset.from_sql(
    query="SELECT co, pt08s1 FROM AirQualityUCI;",
    endpoint=endpoint)

print(dataset.element_spec)
 
{'co': TensorSpec(shape=(), dtype=tf.float32, name=None), 'pt08s1': TensorSpec(shape=(), dtype=tf.int32, name=None)}

從上面的dataset.element_spec的輸出中可以看到,創建的Dataset的元素是一個python dict對象,其中數據庫表的列名稱為鍵:

 {
  'co': TensorSpec(shape=(), dtype=tf.float32, name=None),
  'pt08s1': TensorSpec(shape=(), dtype=tf.int32, name=None),
}
 

應用進一步的操作非常方便。例如,您可以選擇Dataset noxno2字段,然後計算差異:

 dataset = tfio.experimental.IODataset.from_sql(
    query="SELECT nox, no2 FROM AirQualityUCI;",
    endpoint=endpoint)

dataset = dataset.map(lambda e: (e['nox'] - e['no2']))

# check only the first 20 record
dataset = dataset.take(20)

print("NOx - NO2:")
for difference in dataset:
  print(difference.numpy())
 
NOx - NO2:
53.0
11.0
17.0
50.0
15.0
-7.0
-15.0
-14.0
-15.0
0.0
-13.0
-12.0
-14.0
16.0
62.0
28.0
14.0
3.0
9.0
34.0

現在可以將創建的Dataset直接傳遞給tf.keras ,以進行訓練或推理。

參考文獻

  • Dua D.和Graff C.(2019年)。 UCI機器學習存儲庫[ http://archive.ics.uci.edu/ml ]。加州爾灣市:加州大學信息與計算機科學學院。
  • S. De Vito,E。Massera,M。Piga,L。Martinotto,G。Di Francia,關於在城市污染監測場景中估算苯的電子鼻的現場校準,傳感器和執行器B:化學,第129卷,問題2,2008年2月22日,第750-757頁,ISSN 0925-4005