ETL for Oracle to PostgreSQL 2 - Pentaho Data Integrator (PDI, kettle)

本文涉及的产品
RDS MySQL DuckDB 分析主实例,基础系列 4核8GB
RDS MySQL DuckDB 分析主实例,集群系列 4核8GB
RDS AI 助手,专业版
简介:

标签

PostgreSQL , Oracle , ETL , Pentaho Data Integrator , PDI , kettle


背景

原文

https://wiki.postgresql.org/wiki/Migrating_from_one_database_to_another_with_Pentaho_ETL

Migration (schema + data) from one database to another can easily be done with Pentaho ETL. It's an opensource software and I personally recommend you to take a look at.

Steps for migration are very simple:

  1. Create a New Job

  2. Create Source Database Connection

  3. Create Destination Database Connection

  4. From Wizard menu, choose Copy Tables Wizard...

  5. Choose Source and Destination

  6. Run the task

Bulk Loader

https://wiki.pentaho.com/display/EAI/PostgreSQL+Bulk+Loader

除了使用普通的insert方法,Kettle可以借助PostgreSQL psql command实现COPY写入,速度提升明显

Description

The PostgreSQL bulk loader is an experimental step in which we will to stream data from inside Kettle to the psql command using "COPY DATA FROM STDIN" into the database.
This way of loading data offers the best of both worlds : the performance of a bulk load and the flexibility of a Pentaho Data Integration transformation.

Make sure to check out the "#Set up authentication" section below!

Note: This step does not work with a JNDI defined connection, only JDBC is supported.
Note: This step does not support timestamps at the moment (5.3). Timestamps should be converted to Date before this step. Using timestamps results in null-values in the table.

Options

Option Description
Step name Name of the step. Note: This name has to be unique in a single transformation.
Connection Name of the database connection on which the target table resides. Note: The password of this database connection is not used, see below in the "#Set up authentication" section! Since PDI-1901 is fixed in 3.2.3, the username of the connection is used and added to the -U parameter, otherwise the logged in user acount would be taken.
Target schema The name of the Schema for the table to write data to. This is important for data sources that allow for table names with dots '.' in it.
Target table Name of the target table.
psql path Full path to the psql utility.
Load action Insert, Truncate. Insert inserts, truncate first truncates the table. Note: Don't use 'Truncate' when you are running the transformation clustered or multiple step copies! In this case, truncate the table before the transformation starts, for example in a job.
Fields to load This table contains a list of fields to load data from, properties include: Table field: Table field to be loaded in the PostgreSQL table; Stream field: Field to be taken from the incoming rows; Date mask: Either "Pass through, "Date" or "DateTime", determines how date/timestamps will be loaded in PostgreSQL.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.

Set Up Authentication

"psql" doesn't allow you to specify the password. Here is a part of the connection options:

Connection options:
-h HOSTNAME database server host or socket directory (default: "/var/run/postgresql")
-p PORT database server port (default: "5432")
-U NAME database user name (default: "matt" - if you are not Matt:
Since PDI 3.2.3 the username of the connection is taken, see PDI-1901.)
-W prompt for password (should happen automatically)

As you can see there is no way to specify a password for the database. It will always prompt for a password on the console no matter what.

To overcome this you need to set up trusted authentication on the PostgreSQL server.

To make this happen, change the pg_hba.conf file (on my box this is /etc/postgresql/8.2/main/pg_hba.conf) and add a line like this:

host    all         all         192.168.1.0/24        trust  

This basically means that everyone from the 192.168.1.0 network (mask 255.255.255.0) can log into postgres on all databases with any username. If you are running Kettle on the same server, change it to localhost:

host    all         all         127.0.0.1/32        trust  

This is much safer of-course. Make sure you don't invite any strangers onto your PostgreSQL database!

TIP! Make sure to restart your database server after you made this change

密码配置部分,原文有点问题,实际上还有很多方法可以配置密码。

1、修改pg_hba.conf不需要重启,RELOAD即可。

2、不一定要配置为trust模式,可以在.pgpass内设置密码

export PGPASSFILE="/home/digoal/.pgpass"  
  
vi /home/digoal/.pgpass  
hostname:port:database:username:password  
  
chmod 400 /home/digoal/.pgpass  

3、不一定要配置为trust模式,可以在环境变量中设置密码

https://www.postgresql.org/docs/10/static/libpq-envars.html

export PGHOST=xxx.xxx.xxx.xxx  
export PGPORT=5432  
export PGDATABASE=digoal  
export PGUSER=digoal  
export PGPASSWORD=pwd  

参考

https://wiki.pentaho.com/display/EAI/.03+Database+Connections

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍如何基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
5月前
|
存储 人工智能 关系型数据库
阿里云AnalyticDB for PostgreSQL 入选VLDB 2025:统一架构破局HTAP,Beam+Laser引擎赋能Data+AI融合新范式
在数据驱动与人工智能深度融合的时代,企业对数据仓库的需求早已超越“查得快”这一基础能力。面对传统数仓挑战,阿里云瑶池数据库AnalyticDB for PostgreSQL(简称ADB-PG)创新性地构建了统一架构下的Shared-Nothing与Shared-Storage双模融合体系,并自主研发Beam混合存储引擎与Laser向量化执行引擎,全面解决HTAP场景下性能、弹性、成本与实时性的矛盾。 近日,相关研究成果发表于在英国伦敦召开的数据库领域顶级会议 VLDB 2025,标志着中国自研云数仓技术再次登上国际舞台。
564 0
|
9月前
|
Oracle 关系型数据库 数据库
【赵渝强老师】在PostgreSQL中访问Oracle
本文介绍了如何在PostgreSQL中使用oracle_fdw扩展访问Oracle数据库数据。首先需从Oracle官网下载三个Instance Client安装包并解压,设置Oracle环境变量。接着从GitHub下载oracle_fdw扩展,配置pg_config环境变量后编译安装。之后启动PostgreSQL服务器,在数据库中创建oracle_fdw扩展及外部数据库服务,建立用户映射。最后通过创建外部表实现对Oracle数据的访问。文末附有具体操作步骤与示例代码。
596 6
【赵渝强老师】在PostgreSQL中访问Oracle
|
Oracle NoSQL 关系型数据库
主流数据库对比:MySQL、PostgreSQL、Oracle和Redis的优缺点分析
主流数据库对比:MySQL、PostgreSQL、Oracle和Redis的优缺点分析
2906 3
|
分布式计算 DataWorks 关系型数据库
DataWorks操作报错合集之使用连接串模式新增PostgreSQL数据源时遇到了报错"not support data sync channel, error code: 0001",该怎么办
DataWorks是阿里云提供的一站式大数据开发与治理平台,支持数据集成、数据开发、数据服务、数据质量管理、数据安全管理等全流程数据处理。在使用DataWorks过程中,可能会遇到各种操作报错。以下是一些常见的报错情况及其可能的原因和解决方法。
|
SQL 监控 Oracle
关系型数据库Oracle 的Data Guard:
【7月更文挑战第7天】
278 3
|
SQL Oracle 关系型数据库
关系型数据库Oracle Data Guard
【7月更文挑战第11天】
157 1
|
Oracle 关系型数据库 数据库
|
人工智能 Oracle 关系型数据库
一篇文章弄懂Oracle和PostgreSQL的Database Link
一篇文章弄懂Oracle和PostgreSQL的Database Link
|
Cloud Native 关系型数据库 OLAP
从0~1,基于DMS面向AnalyticDB PostgreSQL的数据ETL链路开发
在传统数仓中,往往采用资源预购的方式,缺少面向业务的资源调整灵活性。 在数据分析这种存在明显业务波峰波谷或分时请求的场景下,实例无法按需使用,造成了大量成本浪费。云原生数仓AnalyticDB PostgreSQL产品自2022年2月正式发布了Serverless版之后,依托于内核强大的资源管理能力...
|
Oracle 关系型数据库 数据库

相关产品

  • 云原生数据库 PolarDB
  • 推荐镜像

    更多