HBase--服务异常排查

潜在的危机

  • 就在某一天下午,刚去公司楼下逛了一圈,悠哉悠哉的回到自己的工位上,几个同事就找过来了。说HBase服务巨慢,立即调出日志,晕,全是插入请求超时的日志。立即调出HBase监控,发现HBase一些可疑的信息!
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #61, waiting for some tasks to finish. Expected max=0, tasksInProgress=5
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #54, waiting for some tasks to finish. Expected max=0, tasksInProgress=6
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #18, waiting for some tasks to finish. Expected max=0, tasksInProgress=5
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #62, waiting for some tasks to finish. Expected max=0, tasksInProgress=5
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #46, waiting for some tasks to finish. Expected max=0, tasksInProgress=5
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #70, waiting for some tasks to finish. Expected max=0, tasksInProgress=4
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #45, waiting for some tasks to finish. Expected max=0, tasksInProgress=6
    INFO org.apache.hadoop.hbase.client.AsyncProcess: #47, waiting for some tasks to finish. Expected max=0, tasksInProgress=15


    INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 204 ms, current pipeline: [DatanodeInfoWithStorage[10.xx.xx.xx:50010,DS-47fdd2fe-7722-47d9-9518-151b49a59fc7,DISK], DatanodeInfoWithStorage[10.xx.xx.xx:50010,DS-f55c19d9-5494-47d0-a97b-5d0d07fc8a95,DISK], DatanodeInfoWithStorage[10.xx.xx.xx:50010,DS-4413d571-cab8-4b5f-bc7f-9cf11d2ec2e8,DISK]]
    INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 148 ms, current pipeline: [DatanodeInfoWithStorage[10.xx.xx.xx:50010,DS-47fdd2fe-7722-47d9-9518-151b49a59fc7,DISK], DatanodeInfoWithStorage[10.xx.xx.xx:50010,DS-f55c19d9-5494-47d0-a97b-5d0d07fc8a95,DISK], DatanodeInfoWithStorage[10.xx.xx.xx:50010,DS-4413d571-cab8-4b5f-bc7f-9cf11d2ec2e8,DISK]]

紧绷的神经

  • 看来情况有点严重!发现读写请求剧增,咨询情况,发现时有一个业务之前业务问题导致关键性数据不正常,在进行HBase大量读写操作更新数据!!!赶紧让他们停掉,缓一缓,看看具体原因,停掉后hbase正常了一下。一会儿,异常信息又来了!汗!!看来这个只是其中原因之一,但是上面第一个异常信息没有了,第二个警告减少。HBase服务正常(只是有大量警告)!!!

  • 这是什么原因呢?不是hbase的问题,hbase服务前面还有一层负载“edware 负载均衡 ->waf -> nginx ”,中间还有一个HBase接口服务。然后发了封邮件给运维让他帮着排查一下是否与负载有关,当然这个希望不大,于是紧接着就检查HBase接口服务。

  • 通过top命令查看服务,我插CPU直接到200%多了,内存也用完了,怎么突然占用这么多内存?大对象??

初尝甜果

  1. 查看JVM参数设置,也没啥问题,于是打印查看GC日志,希望能从中发现问题:
    1
    2
    3
    nohup java -server -Xms6144m -Xmx6144m  -XX:SurvivorRatio=2 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseG1GC  
    -XX:InitiatingHeapOccupancyPercent=50 -XX:-UseSplitVerifier -XX:+PrintGCDateStamps -XX:+PrintGCDetails
    -Xloggc:/data/dubbox/hservice/logs/gc.log -jar ./$JAR_NAME >/dev/null 2>&1 &
  • GC日志如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    2017-03-07T15:01:14.767+0800: 7.071: [GC pause (young), 0.0502650 secs]
    [Parallel Time: 18.1 ms, GC Workers: 28]
    [Eden: 306.0M(306.0M)->0.0B(268.0M) Survivors: 0.0B->38.0M Heap: 307.0M(6144.0M)->38.8M(6144.0M)]
    2017-03-07T15:01:28.113+0800: 20.417: [GC pause (young), 0.0183790 secs]
    [Parallel Time: 9.2 ms, GC Workers: 28]
    [Eden: 268.0M(268.0M)->0.0B(282.0M) Survivors: 38.0M->24.0M Heap: 306.8M(6144.0M)->24.8M(6144.0M)]
    2017-03-07T15:01:45.162+0800: 37.466: [GC pause (young), 0.0114490 secs]
    [Parallel Time: 6.0 ms, GC Workers: 28]
    [Eden: 282.0M(282.0M)->0.0B(334.0M) Survivors: 24.0M->22.0M Heap: 306.8M(6144.0M)->22.3M(6144.0M)]
    2017-03-07T15:02:03.528+0800: 55.832: [GC pause (young), 0.0143630 secs]
    [Parallel Time: 7.2 ms, GC Workers: 28]
    [Eden: 334.0M(334.0M)->0.0B(1986.0M) Survivors: 22.0M->28.0M Heap: 356.3M(6144.0M)->28.2M(6144.0M)]
    2017-03-07T15:02:23.174+0800: 75.478: [GC pause (young), 0.2785950 secs]
    [Parallel Time: 268.8 ms, GC Workers: 28]
    [Eden: 212.0M(212.0M)->0.0B(242.0M) Survivors: 590.0M->64.0M Heap: 802.1M(6144.0M)->556.7M(6144.0M)]
    2017-03-07T15:02:28.132+0800: 80.435: [GC pause (young), 0.0534540 secs]
    [Parallel Time: 50.9 ms, GC Workers: 28]
    [Eden: 220.0M(220.0M)->0.0B(354.0M) Survivors: 110.0M->116.0M Heap: 1038.4M(6144.0M)->887.4M(6144.0M)]
    2017-03-07T15:02:34.622+0800: 86.926: [GC pause (young), 0.0927160 secs]
    [Parallel Time: 89.4 ms, GC Workers: 28]
    [Eden: 426.0M(426.0M)->0.0B(510.0M) Survivors: 200.0M->212.0M Heap: 1610.4M(6144.0M)->1311.4M(6144.0M)]
    2017-03-07T15:02:45.490+0800: 97.793: [GC pause (young), 0.1458780 secs]
    [Parallel Time: 141.0 ms, GC Workers: 28]
    [Eden: 566.0M(566.0M)->0.0B(606.0M) Survivors: 316.0M->304.0M Heap: 2359.5M(6144.0M)->1943.5M(6144.0M)]
    /////几秒钟就上来了,停不下来了
    2017-03-07T15:03:01.747+0800: 114.050: [GC pause (young), 0.1641290 secs]
    [Parallel Time: 157.9 ms, GC Workers: 28]
    [Eden: 700.0M(700.0M)->0.0B(760.0M) Survivors: 412.0M->408.0M Heap: 3441.2M(6144.0M)->2946.2M(6144.0M)]
    2017-03-07T15:03:15.536+0800: 127.840: [GC pause (young) (initial-mark), 0.1768610 secs]
    [Parallel Time: 169.8 ms, GC Workers: 28]
    [Eden: 932.0M(932.0M)->0.0B(966.0M) Survivors: 432.0M->452.0M Heap: 4532.2M(6144.0M)->3837.2M(6144.0M)]
    2017-03-07T15:03:26.259+0800: 138.563: [GC pause (young) (initial-mark), 0.1677280 secs]
    [Parallel Time: 160.8 ms, GC Workers: 28]
    [Eden: 654.0M(654.0M)->0.0B(532.0M) Survivors: 508.0M->438.0M Heap: 5016.6M(6144.0M)->4554.6M(6144.0M)]
    2017-03-07T15:03:44.313+0800: 156.616: [GC pause (young) (initial-mark) (to-space exhausted), 0.2589990 secs]
    [Parallel Time: 234.0 ms, GC Workers: 28]
    [Eden: 200.0M(200.0M)->0.0B(244.0M) Survivors: 106.0M->62.0M Heap: 6007.4M(6144.0M)->5946.0M(6144.0M)]
    2017-03-07T15:03:45.374+0800: 157.678: [GC pause (young) (to-space exhausted), 1.7079930 secs]
    [Parallel Time: 1572.6 ms, GC Workers: 28]
    [Eden: 196.0M(244.0M)->0.0B(306.0M) Survivors: 62.0M->0.0B Heap: 6142.0M(6144.0M)->6142.0M(6144.0M)]
    ////这个内存增长的也太猛了,停不下来了
    2017-03-07T15:04:10.598+0800: 182.902: [GC pause (young) (to-space exhausted), 1.0087070 secs]
    [Parallel Time: 905.9 ms, GC Workers: 28]
    [Eden: 204.0M(204.0M)->0.0B(258.0M) Survivors: 102.0M->48.0M Heap: 6038.3M(6144.0M)->6112.3M(6144.0M)]
    2017-03-07T15:04:11.725+0800: 184.029: [GC pause (young) (to-space exhausted), 0.6503080 secs]
    [Parallel Time: 614.5 ms, GC Workers: 28]
    [Eden: 28.0M(258.0M)->0.0B(306.0M) Survivors: 48.0M->0.0B Heap: 6140.3M(6144.0M)->6140.3M(6144.0M)]
    ////下面彻底回收不了了
    2017-03-07T15:04:12.376+0800: 184.680: [GC pause (young), 0.0118450 secs]
    [Parallel Time: 10.4 ms, GC Workers: 28]
    [GC Worker Start (ms): Min: 184680.1, Avg: 184680.3, Max: 184680.6, Diff: 0.5]
    [Ext Root Scanning (ms): Min: 1.1, Avg: 1.5, Max: 2.0, Diff: 0.9, Sum: 41.8]
    [SATB Filtering (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
    [Update RS (ms): Min: 7.8, Avg: 8.2, Max: 8.5, Diff: 0.6, Sum: 228.5]
    [Processed Buffers: Min: 3, Avg: 15.1, Max: 31, Diff: 28, Sum: 424]
    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 2.2]
    [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 5.3]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.9]
    [GC Worker Total (ms): Min: 9.7, Avg: 10.0, Max: 10.2, Diff: 0.5, Sum: 278.8]
    [GC Worker End (ms): Min: 184690.2, Avg: 184690.3, Max: 184690.4, Diff: 0.2]
    [Code Root Fixup: 0.0 ms]
    [Code Root Migration: 0.0 ms]
    [Clear CT: 0.5 ms]
    [Other: 0.9 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 0.7 ms]
    [Ref Enq: 0.0 ms]
    [Free CSet: 0.0 ms]
    [Eden: 0.0B(306.0M)->0.0B(306.0M) Survivors: 0.0B->0.0B Heap: 6140.3M(6144.0M)->6140.3M(6144.0M)]
    [Times: user=0.22 sys=0.00, real=0.01 secs]
    2017-03-07T15:04:12.389+0800: 184.693: [GC pause (young), 0.0034990 secs]
    [Parallel Time: 2.2 ms, GC Workers: 28]
    [GC Worker Start (ms): Min: 184692.9, Avg: 184693.0, Max: 184693.2, Diff: 0.3]
    [Ext Root Scanning (ms): Min: 1.1, Avg: 1.4, Max: 1.6, Diff: 0.4, Sum: 38.7]
    [SATB Filtering (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
    [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Processed Buffers: Min: 0, Avg: 0.0, Max: 1, Diff: 1, Sum: 1]
    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: 2.2]
    [Termination (ms): Min: 0.0, Avg: 0.3, Max: 0.5, Diff: 0.5, Sum: 7.4]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
    [GC Worker Total (ms): Min: 1.6, Avg: 1.7, Max: 1.9, Diff: 0.2, Sum: 48.6]
    [GC Worker End (ms): Min: 184694.8, Avg: 184694.8, Max: 184694.8, Diff: 0.0]
    [Code Root Fixup: 0.0 ms]
    [Code Root Migration: 0.0 ms]
    [Clear CT: 0.4 ms]
    [Other: 0.8 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 0.7 ms]
    [Ref Enq: 0.0 ms]
    [Free CSet: 0.0 ms]
    [Eden: 0.0B(306.0M)->0.0B(306.0M) Survivors: 0.0B->0.0B Heap: 6140.3M(6144.0M)->6140.3M(6144.0M)]
    [Times: user=0.05 sys=0.00, real=0.00 secs]
    2017-03-07T15:04:12.404+0800: 184.708: [Full GC

从上面GC日志,显然发现了问题,内存猛增,并且后面无法回收,同时运维邮件反馈和通过自己测试也排查掉了网络问题。现在总算有点眉目了,锁定异常到HBase接口服务,很可能就是大对象,还有可能一些锁等问题导致。

  1. 通过jstack,观察线程状状态,看看锁是否被释放等

    1
    2
    3
    [root@xiaoxiaomo ~]# jps  ##查看PID
    110132 hservice.jar
    [root@xiaoxiaomo ~]# jstack 110132 > jstack110132.log ###日志中并没有什么锁信息
  2. 通过jmap查看,但是不直观,只能去获取一下dump文件,然后下载到本地然后通过dump分析工具分析

    1
    2
    [root@xiaoxiaomo ~]# jmap 110132 > jmap110132 .log
    [root@xiaoxiaomo ~]# jmap -dump:format=b,file=dump2.bin 110132
  • 刚开始dump下来的文件太大了,差不多6G,笔记本总共内存才8g,dump文件没法打开,第二次弄了个小的,打开软件,运行

  • 就是这样一个dump分析工具

  • 左键点击饼图–>Java Basics,Thread Details 就是我们比较熟悉的了

  • 我们主要看的是:Thread Overview and Stacks

    我想到这儿,问提基本上就解决了一大半了吧,通过这里就能找到相应的表,然后找到接口调用方。

水落石出

  • 找到调用方,发现是他们当天正好写了一个监控HBase的程序发布到线上,监控的方法是通过scan,看是否请求正常!!!就是这个东东,scan的startKey是00,endKey是99,然而他们HBase表的rowke前缀就是数字,这不一下返回了大对象吗!!,然后暂时帮他们设置了一个固定的rowKey让他们换成了get,观察GC,正常!

当前网速较慢或者你使用的浏览器不支持博客特定功能,请尝试刷新或换用Chrome、Firefox等现代浏览器