ETL for Oracle to PostgreSQL 2 - Pentaho Data Integrator (PDI, kettle)-阿里云开发者社区

ETL for Oracle to PostgreSQL 2 - Pentaho Data Integrator (PDI, kettle)

2018-05-06 2966

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

RDS MySQL DuckDB 分析主实例，基础系列 4核8GB

RDS MySQL DuckDB 分析主实例，集群系列 4核8GB

RDS AI 助手，专业版

简介：

背景

原文

https://wiki.postgresql.org/wiki/Migrating_from_one_database_to_another_with_Pentaho_ETL

Migration (schema + data) from one database to another can easily be done with Pentaho ETL. It's an opensource software and I personally recommend you to take a look at.

Steps for migration are very simple:

Create a New Job
Create Source Database Connection
Create Destination Database Connection
From Wizard menu, choose Copy Tables Wizard...
Choose Source and Destination
Run the task

Bulk Loader

https://wiki.pentaho.com/display/EAI/PostgreSQL+Bulk+Loader

除了使用普通的insert方法，Kettle可以借助PostgreSQL psql command实现COPY写入，速度提升明显

Description

The PostgreSQL bulk loader is an experimental step in which we will to stream data from inside Kettle to the psql command using "COPY DATA FROM STDIN" into the database.
This way of loading data offers the best of both worlds : the performance of a bulk load and the flexibility of a Pentaho Data Integration transformation.

Make sure to check out the "#Set up authentication" section below!

Note: This step does not work with a JNDI defined connection, only JDBC is supported.
Note: This step does not support timestamps at the moment (5.3). Timestamps should be converted to Date before this step. Using timestamps results in null-values in the table.

Options

Option	Description
Step name	Name of the step. Note: This name has to be unique in a single transformation.
Connection	Name of the database connection on which the target table resides. Note: The password of this database connection is not used, see below in the "#Set up authentication" section! Since PDI-1901 is fixed in 3.2.3, the username of the connection is used and added to the -U parameter, otherwise the logged in user acount would be taken.
Target schema	The name of the Schema for the table to write data to. This is important for data sources that allow for table names with dots '.' in it.
Target table	Name of the target table.
psql path	Full path to the psql utility.
Load action	Insert, Truncate. Insert inserts, truncate first truncates the table. Note: Don't use 'Truncate' when you are running the transformation clustered or multiple step copies! In this case, truncate the table before the transformation starts, for example in a job.
Fields to load	This table contains a list of fields to load data from, properties include: Table field: Table field to be loaded in the PostgreSQL table; Stream field: Field to be taken from the incoming rows; Date mask: Either "Pass through, "Date" or "DateTime", determines how date/timestamps will be loaded in PostgreSQL.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.

Set Up Authentication

"psql" doesn't allow you to specify the password. Here is a part of the connection options:

Connection options:
-h HOSTNAME database server host or socket directory (default: "/var/run/postgresql")
-p PORT database server port (default: "5432")
-U NAME database user name (default: "matt" - if you are not Matt:
Since PDI 3.2.3 the username of the connection is taken, see PDI-1901.)
-W prompt for password (should happen automatically)

As you can see there is no way to specify a password for the database. It will always prompt for a password on the console no matter what.

To overcome this you need to set up trusted authentication on the PostgreSQL server.

To make this happen, change the pg_hba.conf file (on my box this is /etc/postgresql/8.2/main/pg_hba.conf) and add a line like this:

host    all         all         192.168.1.0/24        trust

This basically means that everyone from the 192.168.1.0 network (mask 255.255.255.0) can log into postgres on all databases with any username. If you are running Kettle on the same server, change it to localhost:

host    all         all         127.0.0.1/32        trust

This is much safer of-course. Make sure you don't invite any strangers onto your PostgreSQL database!

TIP! Make sure to restart your database server after you made this change

密码配置部分，原文有点问题，实际上还有很多方法可以配置密码。

1、修改pg_hba.conf不需要重启，RELOAD即可。

2、不一定要配置为trust模式，可以在.pgpass内设置密码

export PGPASSFILE="/home/digoal/.pgpass"  
  
vi /home/digoal/.pgpass  
hostname:port:database:username:password  
  
chmod 400 /home/digoal/.pgpass

3、不一定要配置为trust模式，可以在环境变量中设置密码

https://www.postgresql.org/docs/10/static/libpq-envars.html

export PGHOST=xxx.xxx.xxx.xxx  
export PGPORT=5432  
export PGDATABASE=digoal  
export PGUSER=digoal  
export PGPASSWORD=pwd

参考

https://wiki.pentaho.com/display/EAI/.03+Database+Connections

ETL for Oracle to PostgreSQL 2 - Pentaho Data Integrator (PDI, kettle)

标签

背景

原文

Bulk Loader

Description

Options

Metadata Injection Support

Set Up Authentication

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

推荐镜像