1、DataX安装使用文档
1.1 前置条件
Linux JDK(1.8以上,推荐1.8 ) Python(推荐Python2.6.X)
1.2 下载
https://github.com/alibaba/DataX/blob/master/userGuid.md
选择下载编译好的DataX
1.3 直接解压安装
tar -zxvf 压缩包
1.4 运行
进入bin目录然后运行python datax.py -r {YOUR_READER} -w {YOUR_WRITER} 开始生成模板文件。
1.4.1举例:
python datax.py -r mysqlREADER -w mysql_WRITER
复制生成好的json文件然后根据参数配置,一定要严格按照json文件格式走。
1.4.2这里的python datax.py -r {YOUR_READER} -w {YOUR_WRITER} 根据写的数据源到数据源不同会生成不同的模板
如MySQL到MySQL
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "id" ], //列名需要自己写 "connection": [ { "jdbcUrl": [ "jdbc:mysql:// ip地址:3306/amc" ], //url数据库链接地址 "table": ["test1" ]//表名需要一定要加双引号 } ], "password": "Ly2n7", //必写 "username": "amcc",//必写 "where": "" } }, "writer": { "name": "mysqlwriter", "parameter": { "column": [ "id" ], "connection": [ { "jdbcUrl": "jdbc:mysql://ip地址:3306/amc", "table": [ "test2" ] } ], "password": "L1jhy2f5Iwn7", "preSql": [ ], "session": [ ], "username": "amcc", "writeMode": "insert"//必写 } } } ], "setting": { "speed": { "channel": "3" } } } }
- 把json文件配置好后放在DataX的job目录下
- 在bin目录下运行命令 python py ../job/你的json文件
2、DataX安装依赖clickhouse
官方不支持clickhouse,需要重新编译
2.1安装maven
1.maven安装参考https://www.cnblogs.com/laoayi/p/12867990.html,如果已安装跳过该步骤
1.下载 官网地址: http://maven.apache.org/download.cgi
curl -O https://mirror.bit.edu.cn/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
2.解压
tar -zxvf apache-maven-3.6.3-bin.tar.gz
3.修改环境变量
vim /etc/profile
exportMAVEN_HOME=/opt/maven/apache-maven-3.6.3 exportPATH=$MAVEN_HOME/bin:$PATH
修改maven镜像地址
vim /opt/maven/apache-maven-3.6.3/conf/settings.xml
添加如下代码:
<mirror><id>alimaven</id><name>aliyun maven</name><url>http://maven.aliyun.com/nexus/content/repositories/central/</url><mirrorOf>central</mirrorOf></mirror>
source /etc/profile //使用环境变量生效
4.查看是否成功安装
mvn -version
[root@ambari-03 maven]# mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /opt/maven/apache-maven-3.6.3
Java version: 1.8.0_181, vendor: Oracle Corporation, runtime: /usr/local/java/jdk/jdk1.8.0_181/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-957.el7.x86_64", arch: "amd64", family: "unix"
2.2 下载datax源码
安装git
yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel
git --version //查看是否安装成功
git clone git@github.com:alibaba/DataX.git //下载datax源码 权限失败需要登录用户
使用另一种curl方式下载源码
curl -O https://gitee.com/jarynpl/DataX/repository/archive/master.zip
unzip master.zip //解压
2.3 通过maven打包
$ cd {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
2.4 编译失败问题处理
网址https://github.com/alibaba/datax/issues/676
[ERROR] Failed to execute goal on project clickhousewriter: Could not resolve dependencies for project com.alibaba.datax:clickhousewriter:jar:0.0.1-SNAPSHOT: Could not find artifact com.alibaba.datax:simulator:jar:0.0.1-SNAPSHOT in alimaven (http://maven.aliyun.com/nexus/content/repositories/central/) -> [Help 1]
Clickhousewriter 存在问题,可以在该模块clickhousewriter目录下pom.xml 文件中,注释掉该依赖(该依赖用作测试,代码里面也没有单元测试)
<dependency> <groupId>com.alibaba.datax</groupId> <artifactId>simulator</artifactId> <version>${datax-project-version}</version> <scope>test</scope> </dependency>
另外,需要把 ClickhouseWriter.java 中 15行,引入的 ClickHouseType去掉,未识别该类
importru.yandex.clickhouse.ClickHouseTuple;
然后尝试编译成功。
打包成功,日志显示如下:
打包好的文件及tar包在target目录下,复制.gz并解压就可以用
clickhouse的writer配置文件如下:
"writer":{ "name":"clickhousewriter", "parameter":{ "username":"admin", "password":"admin", "column":[ "cldjcs", "sjyid", "clbs", "cph", "clls", "gps_rqsj", "gps_jd", "gps_wd", "gps_fx", "gps_sd", "gps_wxdw", "lxid", "lxbb", "sxx", "zdxlh", "jczzt", "ktcbj", "yyzt", "sjjjbjzt"], "connection":[ { "jdbcUrl":"jdbc:clickhouse://10.10.101.2:8123", "table":[ "tcps_gps_and_zdsj"] } ], "batchSize":65536, "batchByteSize":134217728, "dryRun":false, "writeMode":"insert"} }
3、添加hive的jdbc读
3.1 下载datax3.0到本地,并解压
#我是在Linux上操作的,命令如下:wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz tar -zxvf datax.tar.gz
3.2 下载hive驱动
下载好的hive jdbc驱动,上传到datax/plugin/reader/rdbmsreader/libs目录,直接到第三步。
也可以上Cloudera manager官方文档找到hive jdbc驱动安装页面 https://docs.cloudera.com/documentation/enterprise/latest/topics/hive_jdbc_odbc_driver_install.html
点击进入下载页面,选择适合自己的版本,鼠标右键复制链接,到Linux进行下载并解压
#进入上面解压好的datax的rdbmsreader插件的libs目录下cd datax/plugin/reader/rdbmsreader/libs #在Linux上用刚才复制的链接下载驱动wget https://downloads.cloudera.com/connectors/ClouderaHiveJDBC-2.6.10.1012.zip #解压到当前目录 unzip ClouderaHiveJDBC-2.6.10.1012.zip#然后从解压后的目录中找到hive驱动jar包HiveJDBC41.jar,复制到libs目录cp HiveJDBC41.jar ./
3.3 修改plugin.json文件
cd datax/plugin/reader/rdbmsreader vim plugin.json
在文件中加入hive驱动org.apache.hive.jdbc.HiveDriver和com.cloudera.impala.jdbc41.Driver
{ "name": "rdbmsreader", "class": "com.alibaba.datax.plugin.reader.rdbmsreader.RdbmsReader", "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", "developer": "alibaba", "drivers":["com.cloudera.impala.jdbc41.Driver","dm.jdbc.driver.DmDriver", "com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver", "ru.yandex.clickhouse.ClickHouseDriver", "org.apache.hive.jdbc.HiveDriver"] }
3.4 编写job文件
我们创建一个json文件,读取数据源为hive,抽取之后将结果打印出来即可。
cd datax/job vim hive_rdbms.json
{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "rdbmsreader", "parameter": { "username": "default", "password": "default", "column": [ "*" ], "connection": [ { "table": [ "bank_data" ], "jdbcUrl": [ "jdbc:hive2://ip:10000/default" ] } ] } }, "writer": { "name": "streamwriter", "parameter": { "fieldDelimiter": "\t", "print": "true" } } } ] } }