一般来说,基于Hadoop的MapReduce框架来处理数据,主要是面向海量大数据,对于这类数据,Hadoop能够使其真正发挥其能力。对于海量小文件,不是说不能使用Hadoop来处理,只不过直接进行处理效率不会高,而且海量的小文件对于HDFS的架构设计来说,会占用NameNode大量的内存来保存文件的元数据(Bookkeeping)。另外,由于文件比较小,我们是指远远小于HDFS默认Block大小(64M),比如1k~2M,都很小了,在进行运算的时候,可能无法最大限度地充分Locality特性带来的优势,导致大量的数据在集群中传输,开销很大。
但是,实际应用中,也存在类似的场景,海量的小文件的处理需求也大量存在。那么,我们在使用Hadoop进行计算的时候,需要考虑将小数据转换成大数据,比如通过合并压缩等方法,可以使其在一定程度上,能够提高使用Hadoop集群计算方式的适应性。Hadoop也内置了一些解决方法,而且提供的API,可以很方便地实现。
下面,我们通过自定义InputFormat和RecordReader来实现对海量小文件的并行处理。
基本思路描述如下:
在Mapper中将小文件合并,输出结果的文件中每行由两部分组成,一部分是小文件名称,另一部分是该小文件的内容。
编程实现
我们实现一个WholeFileInputFormat,用来控制Mapper的输入规格,其中对于输入过程中处理文本行的读取使用的是自定义的WholeFileRecordReader。当Map任务执行完成后,我们直接将Map的输出原样输出到HDFS中,使用了一个最简单的IdentityReducer。
现在,看一下我们需要实现哪些内容:
- 读取每个小文件内容的WholeFileRecordReader
- 定义输入小文件的规格描述WholeFileInputFormat
- 用来合并小文件的Mapper实现WholeSmallfilesMapper
- 输出合并后的文件Reducer实现IdentityReducer
- 配置运行将多个小文件合并成一个大文件
接下来,详细描述上面的几点内容。
输入的键值对类型,对小文件,每个文件对应一个InputSplit,我们读取这个InputSplit实际上就是具有一个Block的整个文件的内容,将整个文件的内容读取到BytesWritable,也就是一个字节数组。
01 |
package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole; |
03 |
import java.io.IOException; |
05 |
import org.apache.hadoop.fs.FSDataInputStream; |
06 |
import org.apache.hadoop.fs.FileSystem; |
07 |
import org.apache.hadoop.fs.Path; |
08 |
import org.apache.hadoop.io.BytesWritable; |
09 |
import org.apache.hadoop.io.IOUtils; |
10 |
import org.apache.hadoop.io.NullWritable; |
11 |
import org.apache.hadoop.mapreduce.InputSplit; |
12 |
import org.apache.hadoop.mapreduce.JobContext; |
13 |
import org.apache.hadoop.mapreduce.RecordReader; |
14 |
import org.apache.hadoop.mapreduce.TaskAttemptContext; |
15 |
import org.apache.hadoop.mapreduce.lib.input.FileSplit; |
17 |
public class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> { |
19 |
private FileSplit fileSplit; |
20 |
private JobContext jobContext; |
21 |
private NullWritable currentKey = NullWritable.get(); |
22 |
private BytesWritable currentValue; |
23 |
private boolean finishConverting = false ; |
26 |
public NullWritable getCurrentKey() throws IOException, InterruptedException { |
31 |
public BytesWritable getCurrentValue() throws IOException, InterruptedException { |
36 |
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { |
37 |
this .fileSplit = (FileSplit) split; |
38 |
this .jobContext = context; |
39 |
context.getConfiguration().set( "map.input.file" , fileSplit.getPath().getName()); |
43 |
public boolean nextKeyValue() throws IOException, InterruptedException { |
44 |
if (!finishConverting) { |
45 |
currentValue = new BytesWritable(); |
46 |
int len = ( int ) fileSplit.getLength(); |
47 |
byte [] content = new byte [len]; |
48 |
Path file = fileSplit.getPath(); |
49 |
FileSystem fs = file.getFileSystem(jobContext.getConfiguration()); |
50 |
FSDataInputStream in = null ; |
53 |
IOUtils.readFully(in, content, 0 , len); |
54 |
currentValue.set(content, 0 , len); |
57 |
IOUtils.closeStream(in); |
60 |
finishConverting = true ; |
67 |
public float getProgress() throws IOException { |
69 |
if (finishConverting) { |
76 |
public void close() throws IOException { |
实现RecordReader接口,最核心的就是处理好迭代多行文本的内容的逻辑,每次迭代通过调用nextKeyValue()方法来判断是否还有可读的文本行,直接设置当前的Key和Value,分别在方法getCurrentKey()和getCurrentValue()中返回对应的值。
另外,我们设置了”map.input.file”的值是文件名称,以便在Map任务中取出并将文件名称作为键写入到输出。
01 |
package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole; |
03 |
import java.io.IOException; |
05 |
import org.apache.hadoop.io.BytesWritable; |
06 |
import org.apache.hadoop.io.NullWritable; |
07 |
import org.apache.hadoop.mapreduce.InputSplit; |
08 |
import org.apache.hadoop.mapreduce.RecordReader; |
09 |
import org.apache.hadoop.mapreduce.TaskAttemptContext; |
10 |
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; |
12 |
public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> { |
15 |
public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { |
16 |
RecordReader<NullWritable, BytesWritable> recordReader = new WholeFileRecordReader(); |
17 |
recordReader.initialize(split, context); |
这个类实现比较简单,继承自FileInputFormat后需要实现createRecordReader()方法,返回用来读文件记录的RecordReader,直接使用前面实现的WholeFileRecordReader创建一个实例,然后调用initialize()方法进行初始化。
01 |
package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole; |
03 |
import java.io.IOException; |
05 |
import org.apache.hadoop.io.BytesWritable; |
06 |
import org.apache.hadoop.io.NullWritable; |
07 |
import org.apache.hadoop.io.Text; |
08 |
import org.apache.hadoop.mapreduce.Mapper; |
10 |
public class WholeSmallfilesMapper extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> { |
12 |
private Text file = new Text(); |
15 |
protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { |
16 |
String fileName = context.getConfiguration().get( "map.input.file" ); |
18 |
context.write(file, value); |
01 |
package org.shirdrn.kodz.inaction.hadoop.smallfiles; |
03 |
import java.io.IOException; |
05 |
import org.apache.hadoop.mapreduce.Reducer; |
07 |
public class IdentityReducer<Text, BytesWritable> extends Reducer<Text, BytesWritable, Text, BytesWritable> { |
10 |
protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException { |
11 |
for (BytesWritable value : values) { |
12 |
context.write(key, value); |
这个是Reduce任务的实现,只是将Map任务的输出原样写入到HDFS中。
01 |
package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole; |
03 |
import java.io.IOException; |
05 |
import org.apache.hadoop.conf.Configuration; |
06 |
import org.apache.hadoop.fs.Path; |
07 |
import org.apache.hadoop.io.BytesWritable; |
08 |
import org.apache.hadoop.io.Text; |
09 |
import org.apache.hadoop.mapreduce.Job; |
10 |
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; |
11 |
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; |
12 |
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; |
13 |
import org.apache.hadoop.util.GenericOptionsParser; |
14 |
import org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer; |
16 |
public class WholeCombinedSmallfiles { |
18 |
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { |
20 |
Configuration conf = new Configuration(); |
21 |
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); |
22 |
if (otherArgs.length != 2 ) { |
23 |
System.err.println( "Usage: conbinesmallfiles <in> <out>" ); |
27 |
Job job = new Job(conf, "combine smallfiles" ); |
29 |
job.setJarByClass(WholeCombinedSmallfiles. class ); |
30 |
job.setMapperClass(WholeSmallfilesMapper. class ); |
31 |
job.setReducerClass(IdentityReducer. class ); |
33 |
job.setMapOutputKeyClass(Text. class ); |
34 |
job.setMapOutputValueClass(BytesWritable. class ); |
35 |
job.setOutputKeyClass(Text. class ); |
36 |
job.setOutputValueClass(BytesWritable. class ); |
38 |
job.setInputFormatClass(WholeFileInputFormat. class ); |
39 |
job.setOutputFormatClass(SequenceFileOutputFormat. class ); |
41 |
job.setNumReduceTasks( 5 ); |
43 |
FileInputFormat.addInputPath(job, new Path(otherArgs[ 0 ])); |
44 |
FileOutputFormat.setOutputPath(job, new Path(otherArgs[ 1 ])); |
46 |
int exitFlag = job.waitForCompletion( true ) ? 0 : 1 ; |
47 |
System.exit(exitFlag); |
这是是程序的入口,主要是对MapReduce任务进行配置,只需要设置好对应的配置即可。我们设置了5个Reduce任务,最终会有5个输出结果文件。
这里,我们的Reduce任务执行的输出格式为SequenceFileOutputFormat定义的,就是SequenceFile,二进制文件。
运行程序
1 |
jar -cvf combine-smallfiles.jar -C ./ org/shirdrn/kodz/inaction/hadoop/smallfiles |
2 |
xiaoxiang@ubuntu3:~$ cd /opt/stone/cloud/hadoop-1.0.3 |
3 |
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs - mkdir /user/xiaoxiang/datasets/smallfiles |
4 |
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal /opt/stone/cloud/dataset/smallfiles/* /user/xiaoxiang/datasets/smallfiles |
001 |
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop jar combine-smallfiles.jar org.shirdrn.kodz.inaction.hadoop.smallfiles.whole.WholeCombinedSmallfiles /user/xiaoxiang/datasets/smallfiles /user/xiaoxiang/output/smallfiles/whole |
002 |
13/03/23 14:09:24 INFO input.FileInputFormat: Total input paths to process : 117 |
003 |
13/03/23 14:09:24 INFO mapred.JobClient: Running job: job_201303111631_0016 |
004 |
13/03/23 14:09:25 INFO mapred.JobClient: map 0% reduce 0% |
005 |
13/03/23 14:09:40 INFO mapred.JobClient: map 1% reduce 0% |
006 |
13/03/23 14:09:46 INFO mapred.JobClient: map 3% reduce 0% |
007 |
13/03/23 14:09:52 INFO mapred.JobClient: map 5% reduce 0% |
008 |
13/03/23 14:09:58 INFO mapred.JobClient: map 6% reduce 0% |
009 |
13/03/23 14:10:04 INFO mapred.JobClient: map 8% reduce 0% |
010 |
13/03/23 14:10:10 INFO mapred.JobClient: map 10% reduce 0% |
011 |
13/03/23 14:10:13 INFO mapred.JobClient: map 10% reduce 1% |
012 |
13/03/23 14:10:16 INFO mapred.JobClient: map 11% reduce 1% |
013 |
13/03/23 14:10:22 INFO mapred.JobClient: map 13% reduce 1% |
014 |
13/03/23 14:10:28 INFO mapred.JobClient: map 15% reduce 1% |
015 |
13/03/23 14:10:34 INFO mapred.JobClient: map 17% reduce 1% |
016 |
13/03/23 14:10:40 INFO mapred.JobClient: map 18% reduce 2% |
017 |
13/03/23 14:10:46 INFO mapred.JobClient: map 20% reduce 2% |
018 |
13/03/23 14:10:52 INFO mapred.JobClient: map 22% reduce 2% |
019 |
13/03/23 14:10:58 INFO mapred.JobClient: map 23% reduce 2% |
020 |
13/03/23 14:11:04 INFO mapred.JobClient: map 25% reduce 3% |
021 |
13/03/23 14:11:10 INFO mapred.JobClient: map 27% reduce 3% |
022 |
13/03/23 14:11:16 INFO mapred.JobClient: map 29% reduce 3% |
023 |
13/03/23 14:11:22 INFO mapred.JobClient: map 30% reduce 3% |
024 |
13/03/23 14:11:28 INFO mapred.JobClient: map 32% reduce 3% |
025 |
13/03/23 14:11:34 INFO mapred.JobClient: map 34% reduce 4% |
026 |
13/03/23 14:11:40 INFO mapred.JobClient: map 35% reduce 4% |
027 |
13/03/23 14:11:46 INFO mapred.JobClient: map 37% reduce 4% |
028 |
13/03/23 14:11:52 INFO mapred.JobClient: map 39% reduce 4% |
029 |
13/03/23 14:11:58 INFO mapred.JobClient: map 41% reduce 5% |
030 |
13/03/23 14:12:04 INFO mapred.JobClient: map 42% reduce 5% |
031 |
13/03/23 14:12:10 INFO mapred.JobClient: map 44% reduce 5% |
032 |
13/03/23 14:12:16 INFO mapred.JobClient: map 46% reduce 5% |
033 |
13/03/23 14:12:22 INFO mapred.JobClient: map 47% reduce 5% |
034 |
13/03/23 14:12:25 INFO mapred.JobClient: map 47% reduce 6% |
035 |
13/03/23 14:12:28 INFO mapred.JobClient: map 49% reduce 6% |
036 |
13/03/23 14:12:34 INFO mapred.JobClient: map 51% reduce 6% |
037 |
13/03/23 14:12:40 INFO mapred.JobClient: map 52% reduce 6% |
038 |
13/03/23 14:12:46 INFO mapred.JobClient: map 54% reduce 7% |
039 |
13/03/23 14:12:52 INFO mapred.JobClient: map 56% reduce 7% |
040 |
13/03/23 14:12:58 INFO mapred.JobClient: map 58% reduce 7% |
041 |
13/03/23 14:13:04 INFO mapred.JobClient: map 59% reduce 7% |
042 |
13/03/23 14:13:10 INFO mapred.JobClient: map 61% reduce 7% |
043 |
13/03/23 14:13:13 INFO mapred.JobClient: map 61% reduce 8% |
044 |
13/03/23 14:13:16 INFO mapred.JobClient: map 63% reduce 8% |
045 |
13/03/23 14:13:22 INFO mapred.JobClient: map 64% reduce 8% |
046 |
13/03/23 14:13:28 INFO mapred.JobClient: map 66% reduce 8% |
047 |
13/03/23 14:13:34 INFO mapred.JobClient: map 68% reduce 8% |
048 |
13/03/23 14:13:40 INFO mapred.JobClient: map 70% reduce 9% |
049 |
13/03/23 14:13:46 INFO mapred.JobClient: map 71% reduce 9% |
050 |
13/03/23 14:13:52 INFO mapred.JobClient: map 73% reduce 9% |
051 |
13/03/23 14:13:58 INFO mapred.JobClient: map 75% reduce 9% |
052 |
13/03/23 14:14:04 INFO mapred.JobClient: map 76% reduce 9% |
053 |
13/03/23 14:14:10 INFO mapred.JobClient: map 78% reduce 10% |
054 |
13/03/23 14:14:16 INFO mapred.JobClient: map 80% reduce 10% |
055 |
13/03/23 14:14:22 INFO mapred.JobClient: map 82% reduce 10% |
056 |
13/03/23 14:14:28 INFO mapred.JobClient: map 83% reduce 10% |
057 |
13/03/23 14:14:34 INFO mapred.JobClient: map 85% reduce 10% |
058 |
13/03/23 14:14:37 INFO mapred.JobClient: map 85% reduce 11% |
059 |
13/03/23 14:14:40 INFO mapred.JobClient: map 87% reduce 11% |
060 |
13/03/23 14:14:46 INFO mapred.JobClient: map 88% reduce 11% |
061 |
13/03/23 14:14:52 INFO mapred.JobClient: map 90% reduce 11% |
062 |
13/03/23 14:14:58 INFO mapred.JobClient: map 92% reduce 12% |
063 |
13/03/23 14:15:04 INFO mapred.JobClient: map 94% reduce 12% |
064 |
13/03/23 14:15:10 INFO mapred.JobClient: map 95% reduce 12% |
065 |
13/03/23 14:15:16 INFO mapred.JobClient: map 97% reduce 12% |
066 |
13/03/23 14:15:22 INFO mapred.JobClient: map 99% reduce 12% |
067 |
13/03/23 14:15:28 INFO mapred.JobClient: map 100% reduce 13% |
068 |
13/03/23 14:15:37 INFO mapred.JobClient: map 100% reduce 26% |
069 |
13/03/23 14:15:40 INFO mapred.JobClient: map 100% reduce 39% |
070 |
13/03/23 14:15:49 INFO mapred.JobClient: map 100% reduce 59% |
071 |
13/03/23 14:15:52 INFO mapred.JobClient: map 100% reduce 79% |
072 |
13/03/23 14:15:58 INFO mapred.JobClient: map 100% reduce 100% |
073 |
13/03/23 14:16:03 INFO mapred.JobClient: Job complete: job_201303111631_0016 |
074 |
13/03/23 14:16:03 INFO mapred.JobClient: Counters: 29 |
075 |
13/03/23 14:16:03 INFO mapred.JobClient: Job Counters |
076 |
13/03/23 14:16:03 INFO mapred.JobClient: Launched reduce tasks=5 |
077 |
13/03/23 14:16:03 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=491322 |
078 |
13/03/23 14:16:03 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 |
079 |
13/03/23 14:16:03 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 |
080 |
13/03/23 14:16:03 INFO mapred.JobClient: Launched map tasks=117 |
081 |
13/03/23 14:16:03 INFO mapred.JobClient: Data- local map tasks=117 |
082 |
13/03/23 14:16:03 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=719836 |
083 |
13/03/23 14:16:03 INFO mapred.JobClient: File Output Format Counters |
084 |
13/03/23 14:16:03 INFO mapred.JobClient: Bytes Written=147035685 |
085 |
13/03/23 14:16:03 INFO mapred.JobClient: FileSystemCounters |
086 |
13/03/23 14:16:03 INFO mapred.JobClient: FILE_BYTES_READ=147032689 |
087 |
13/03/23 14:16:03 INFO mapred.JobClient: HDFS_BYTES_READ=147045529 |
088 |
13/03/23 14:16:03 INFO mapred.JobClient: FILE_BYTES_WRITTEN=296787727 |
089 |
13/03/23 14:16:03 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=147035685 |
090 |
13/03/23 14:16:03 INFO mapred.JobClient: File Input Format Counters |
091 |
13/03/23 14:16:03 INFO mapred.JobClient: Bytes Read=147029851 |
092 |
13/03/23 14:16:03 INFO mapred.JobClient: Map-Reduce Framework |
093 |
13/03/23 14:16:03 INFO mapred.JobClient: Map output materialized bytes=147036169 |
094 |
13/03/23 14:16:03 INFO mapred.JobClient: Map input records=117 |
095 |
13/03/23 14:16:03 INFO mapred.JobClient: Reduce shuffle bytes=145779618 |
096 |
13/03/23 14:16:03 INFO mapred.JobClient: Spilled Records=234 |
097 |
13/03/23 14:16:03 INFO mapred.JobClient: Map output bytes=147032074 |
098 |
13/03/23 14:16:03 INFO mapred.JobClient: CPU time spent (ms)=79550 |
099 |
13/03/23 14:16:03 INFO mapred.JobClient: Total committed heap usage (bytes)=19630391296 |
100 |
13/03/23 14:16:03 INFO mapred.JobClient: Combine input records=0 |
101 |
13/03/23 14:16:03 INFO mapred.JobClient: SPLIT_RAW_BYTES=15678 |
102 |
13/03/23 14:16:03 INFO mapred.JobClient: Reduce input records=117 |
103 |
13/03/23 14:16:03 INFO mapred.JobClient: Reduce input groups =117 |
104 |
13/03/23 14:16:03 INFO mapred.JobClient: Combine output records=0 |
105 |
13/03/23 14:16:03 INFO mapred.JobClient: Physical memory (bytes) snapshot=20658409472 |
106 |
13/03/23 14:16:03 INFO mapred.JobClient: Reduce output records=117 |
107 |
13/03/23 14:16:03 INFO mapred.JobClient: Virtual memory (bytes) snapshot=65064620032 |
108 |
13/03/23 14:16:03 INFO mapred.JobClient: Map output records=117 |
01 |
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs - ls /user/xiaoxiang/output/smallfiles/whole |
03 |
-rw-r--r-- 3 xiaoxiang supergroup 0 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/_SUCCESS |
04 |
drwxr-xr-x - xiaoxiang supergroup 0 2013-03-23 14:09 /user/xiaoxiang/output/smallfiles/whole/_logs |
05 |
-rw-r--r-- 3 xiaoxiang supergroup 30161482 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00000 |
06 |
-rw-r--r-- 3 xiaoxiang supergroup 30160646 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00001 |
07 |
-rw-r--r-- 3 xiaoxiang supergroup 27647901 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00002 |
08 |
-rw-r--r-- 3 xiaoxiang supergroup 30161567 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00003 |
09 |
-rw-r--r-- 3 xiaoxiang supergroup 28904089 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00004 |
11 |
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -text /user/xiaoxiang/output/smallfiles/whole/part-r-00000 | cut -d " " -f 1 |
可以看到,Reducer阶段生成了5个文件,他们都是将小文件合并后的得到的大文件,如果需要对这些文件进行其他处理,可以再实现满足实际处理的Mapper,将输入路径指定的前面Reducer的输出路径即可。这样一来,对于大量小文件的处理,转换成了数个大文件的处理,就能够充分利用Hadoop MapReduce计算集群的优势。