Hadoop - IO
- Input to the Mapper as files are read from the HDFS.
- Output from the Mapper that is spilled to local disk.
- Network I/O between the Reducer and Mapper, as the Reducer’s retrieve files from the Mapper nodes.
- Merge to local disk on the Reducer node as the partitions received from the Mapper nodes are fully sorted on the Reducer node.
- Reading back from the local disk as records are made available to the reduce method on the Reducer instance.
- Output from the Reducer- this is written back to the HDFS.
串行化
传输、存储都需要
Writable接口
Avro框架:IDL,版本支持,跨语言,JSON-linke
压缩
能够减少磁盘的占用空间和网络传输的量
Compressed Size, Speed, Splittable
gzip, bzip2, LZO, LZ4, Snappy
要比较各种压缩算法的压缩比和性能
重点:压缩和拆分一般是冲突的(压缩后的文件的block是不能很好地拆分独立运行,很多时候某个文件的拆分点是被拆分到两个压缩文件中,这时Map任务就无法处理,所以对于这些压缩,Hadoop往往是直接使用一个Map任务处理整个文件的分析)
Map的输出结果也可以进行压缩,这样可以减少Map结果到Reduce的传输的数据量,加快传输速率
完整性
磁盘和网络很容易出错,保证数据传输的完整性一般是通过CRC32这种校验法
每次写数据到磁盘前都验证一下,同时保存校验码
每次读取数据时,也验证校验码,避免磁盘问题
同时每个datanode都会定时检查每一个block的完整性
当发现某个block数据有问题时,也不是立刻报错,而是先去Namenode找一块该数据的完整备份进行恢复,不能恢复才报错