file类的方法(java类库帮助文档)

文档目录FilInputFormat实现类1. TextInputFormat示例2. KeyValueTextInputFormat示例3. NLineInputFormat示例4 .实践-KeyValueTextInputFormat是ducerKVTextDriver 5.实际操作-NLineInputFormat使用案例代码实现nlinemappernlinereducernlinedriver

FilInputFormat实现类

运行MapReduce程序时，输入的文件格式包括基于行的日志文件、二进制格式文件和数据库表。那么，MapReduce对于各种数据类型是如何读取数据的呢？

抽象类FileInputFormat的常见实现类包括TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定义InputFormat

逐一介绍FileInputFormat的这些实现类。

1.textinputformattextinputformat是默认的FileInputFormat实现类。

逐行读取各记录。

密钥key是在整个文件中存储行的起始字节偏移，类型为LongWritable。值value是此行的内容，不包含行结尾(换行符和换行符)和文本类型。例如，切片包含以下四条文本记录：

richlearningformintelligentlearningenginelearningmoreconvenientfromtherealdemandformoreclosetotheenterprise中的每个记录都具有以下键/值

(0，Rich learning form ) (19，智能学习引擎) (47，Learning more convenient ) ) 72，fromtherealdemandformoreclore

可以通过将驱动器类设置为来设置分隔符

conf.set (keyvaluelinerecordreader.key _ value _ seperator，'t ' )的默认分隔符为tab )t。

输入示例是包含四个记录的分片。其中——表示“水平”选项卡。

line 1—— richlearningformline 2—— intelligentlearningengineline —— learningmoreconvenientline 43354 fromtherealdemandformormormine

(行1，richlearningform (行2，智能学习引擎)行3，Learning more convenient )行4，fromtherealdemand

3 .使用nlineinputformatnlineinputformat时，表示每个映射进程的处理的InputSplit由nlineinputformat指定的行数n拆分，而不是由块拆分。也就是说，输入文件的总行数/N=片数，如果不能被除尽，则片数=商1。

作为例子，请考虑上面四行的输入。

richlearningformintelligentlearningenginelearningmoreconvenientfromtherealdemandformoreclosetotheenterprise，例如，n为2表示每个输入数据打开两个映射任务。

(0，Rich learning form ) ) 19，智能学习引擎)另一个映射器接收以下两行：

(47，Learning more convenient ) ) 72，fromtherealdemandformoreclosetotheenterprise )此处的键和值与TextInputFormat生成的相同。

4 .技能-KeyValueTextInputFormat使用案例要求：统计输入文件中的每个

行的第一个单词相同的行数。

输入数据：

banzhang ni haoxihuan hadoop banzhangbanzhang ni haoxihuan hadoop banzhang

期望结果数据:

banzhang2xihuan2

需求分析:

根据需求可知要把分隔符设置成空格，以及输入格式。

代码实现 KVTextMapper /** * @Date 2020/7/9 22:34 * @Version 10.21 * @Author DuanChaojie */public class KVTextMapper extends Mapper<Text, Text,Text, IntWritable> { IntWritable v = new IntWritable(1); @Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { context.write(key,v); }} KVTextReducer public class KVTextReducer extends Reducer<Text, IntWritable,Text,IntWritable> { IntWritable v = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } v.set(sum); context.write(key,v); }} KVTextDriver

conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " "); 设切割符

job.setInputFormatClass(KeyValueTextInputFormat.class); 设置输入格式

public class KVTextDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); // 设置切割符 conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " "); Job job = Job.getInstance(conf); job.setJarByClass(KVTextDriver.class); job.setMapperClass(KVTextMapper.class); job.setReducerClass(KVTextReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job,new Path(args[0])); // 设置输入格式 job.setInputFormatClass(KeyValueTextInputFormat.class); FileOutputFormat.setOutputPath(job,new Path(args[1])); boolean result = job.waitForCompletion(true); System.exit(result?0:1); }}

不要忘了设置输入和输出路径，结果与预期结果一致。

5. 实操-NLineInputFormat使用案例

需求：对每个单词进行个数统计，要求根据每个输入文件的行数来规定输出多少个切片。此案例要求每三行放入一个切片中。

输入数据：

banzhang ni haoxihuan hadoop banzhangbanzhang ni haoxihuan hadoop banzhangbanzhang ni haoxihuan hadoop banzhangbanzhang ni haoxihuan hadoop banzhangbanzhang ni haoxihuan hadoop banzhang banzhang ni haoxihuan hadoop banzhang

期望输出数据:

Number of splits:4

需求分析：

代码实现 NLineMapper public class NLineMapper extends Mapper<LongWritable, Text, Text, LongWritable>{private Text k = new Text();private LongWritable v = new LongWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 切割 String[] splited = line.split(" "); // 3 循环写出 for (int i = 0; i < splited.length; i++) { k.set(splited[i]); context.write(k, v); }}} NLineReducer public class NLineReducer extends Reducer<Text, LongWritable, Text, LongWritable>{LongWritable v = new LongWritable();@Overrideprotected void reduce(Text key, Iterable<LongWritable> values,Context context) throws IOException, InterruptedException { long sum = 0; // 1 汇总 for (LongWritable value : values) { sum += value.get(); } v.set(sum); // 2 输出 context.write(key, v);}} NLineDriver

job.setInputFormatClass(NLineInputFormat.class); 使用NLineInputFormat处理记录数

NLineInputFormat.setNumLinesPerSplit(job, 3); 设置每个切片InputSplit中划分三条记录

public class NLineDriver {public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException { // 1 获取job对象 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 再这里赋值之后就不需要设置 args = new String[]{"E:\file\test.txt","E:\file\output1"}; // 2设置jar包位置，关联mapper和reducer job.setJarByClass(NLineDriver.class); job.setMapperClass(NLineMapper.class); job.setReducerClass(NLineReducer.class); // 3设置map输出kv类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); // 4设置最终输出kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); // 5设置输入输出数据路径 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 6设置每个切片InputSplit中划分三条记录 NLineInputFormat.setNumLinesPerSplit(job, 3); // 7使用NLineInputFormat处理记录数 job.setInputFormatClass(NLineInputFormat.class); // 8提交job job.waitForCompletion(true);}}

输出结果与预期结果一致！

☆