首页 > 编程知识 正文

CDH Hadoop HDFS EOFException异常的问题

时间:2023-05-04 03:32:50 阅读:207218 作者:3984

 CDH hadoop HDFS系统中,看了下出现异常的DataNode节点日志

018-09-04 23:24:38,446 WARN org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: No block pool scanner found for block pool id: BP-21853433-xxxxxxxxx-14848353795732018-09-05 00:45:13,777 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0xa7b72c5217b7ac, containing 1 storage report(s), of which we sent 0. The reports had 6076010 total blocks and used 0 RPC(s). This took 1636 msec to generate and 1082 msecs for RPC and NN processing. Got back no commands.2018-09-05 00:45:13,777 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerServicejava.io.EOFException: End of File Exception between local host is: "xxxx"; destination host is: "xxx":53310; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy12.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:177) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:524) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:750) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:889) at java.lang.Thread.run(Thread.java:745)Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)

关注了下EOFException这个异常,根据异常给的链接提示http://wiki.apache.org/hadoop/EOFException,看了下这个异常的原因:

服务异常重启了,RPC过程中网络通信异常(比如服务重启,HA failover发生主从切换)通信协议不匹配,如:DataNode节点的Hadoop版本和NameNode不一致

要说服务重启,只有DataNode重启了,NameNode没变化;集群的Hadoop版本也是一致的,所以,都不是以上2点原因导致的EOFException异常。好没头绪,又看了下NameNode日志,发现一个关键性的信息:

hadoop Requested data length 86483783 is longer than maximum configured RPC length 67108864

 在结合DataNode的异常堆栈信息,应该是ipc的数据包过大导致的,知道问题所在,应该就好解决了。

解决

修改NameNode的hdfs-site.xml配置文件,添加以下配置:

<property> <name>ipc.maximum.data.length</name> <value>134217728</value></property>

在CDH HADOOP系统中,则需要从CDH CM主页上去修改,见下图

 

允许ipc通讯最大的数据包为128MB,默认配置为64MB。最后平滑重启NameNode,在重启异常的DataNode解决。

总结

出问题的集群规模比较小,但是小文件数量特别多。从日志看是内存不足导致DataNode跪了,然后随即的重启,DataNode要给NameNode上报block信息,由于block信息较多,ipc通信的数据包超过了64MB,NameNode返回异常了,导致DataNode这边出现EOFException的异常,由于这个异常,DataNode没能把block信息上报给NameNode,NameNode就认为block丢失了。

这次的问题是由于block数量较多导致,也分析了下block数量与资源消耗的关系,针对单个DataNode上有500W个左右的block,大概消耗如下:

DataNode内存6GB左右一次性向NameNode汇报block的ipc数据包大小约为64MB

今后维护HDFS可以根据block的数量信息分配合理的内存和ipc.maximum.data.length大小了

 

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。