docker搭建高可用hadoop集群

使用zookeeper和journalnode存储namenode的editlog可确保hadoop的高可用,避免单点故障!

以下是我近期用docker实验的过程, 很大程度上参见了 hadoop-ha-docker这个项目, 做了如下修改.

  • 更换基础镜像为centOS7
  • 做了ssh免密码登录,方便集群间访问和管理
  • 去掉了自动化(该自动化并不总是正常运行),初次手工启动很重要,便于观察和调试
  • 由于基础环境docker 1.10及swarm集群,overlay网络可免掉dns
  • 做了详细启动记录

1. 创建镜像及容器

1.1 镜像

1
2
docker build -t registry.mudan.com:5000/peony/hadoop .
docker push registry.mudan.com:5000/peony/hadoop .

1.2 容器

结合zookeeper-and-docker创建的5个zookeeper容器,我们这里再创建2个namenode,3个journal和3个datanoe.

1
2
3
4
5
6
7
8
9
10
11
# 2 namenode
sh nn.sh hadoop-nn1 dc00.mudan.com 192.168.4.7
sh nn.sh hadoop-nn2 dc04.mudan.com 192.168.4.8
# 3 journal
sh jn.sh hadoop-jn1 dc00.mudan.com 192.168.4.9
sh jn.sh hadoop-jn2 dc04.mudan.com 192.168.4.10
sh jn.sh hadoop-jn3 dc05.mudan.com 192.168.4.11
# 3 datanoe
sh dn.sh hadoop-dn1 dc00.mudan.com 192.168.4.12
sh dn.sh hadoop-dn2 dc04.mudan.com 192.168.4.13
sh dn.sh hadoop-dn3 dc05.mudan.com 192.168.4.14

注意,为了与生产坏境部署一致,我们这里了固定ip的做法,实际上docker的overlay网络是不需要设定固定ip的,通过容器名可以直接进行服务发
现。

2. 启动hadoop集群

2.1 格式化 ZooKeeper 集群,在任意的 namenode 上都可以执行

宿主机登陆nn1容器

1
sdocker exec -it hadoop-nn1 bash

格式化ZooKeeper

1
2
3
4
5
6
$HADOOP_HDFS_HOME/bin/hdfs zkfc -formatZK

16/03/31 17:13:27 INFO ha.ActiveStandbyElector: Session connected.
16/03/31 17:13:27 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/cluster in ZK.
16/03/31 17:13:27 INFO zookeeper.ZooKeeper: Session: 0x30010fe11d90001 closed
16/03/31 17:13:27 INFO zookeeper.ClientCnxn: EventThread shut down

2.2 启动 journalnode 结点

宿主机上执行

1
2
3
sdocker exec -it hadoop-jn1 /usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode
sdocker exec -it hadoop-jn2 /usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode
sdocker exec -it hadoop-jn3 /usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode

宿主机上登录验证

1
sdocker exec -it hadoop-jn3 bash

容器内查看运行日志

1
2
3
4
5
6
tail /usr/local/hadoop/logs/hadoop-root-journalnode-hadoop-jn3.log
2016-03-31 17:15:58,312 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8480
2016-03-31 17:16:08,443 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2016-03-31 17:16:08,460 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8485
2016-03-31 17:16:08,495 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2016-03-31 17:16:08,495 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting

2.3 格式化集群的 NameNode 并启动刚格式化的 NameNode

nn1 容器内

1
$HADOOP_PREFIX/bin/hadoop namenode -format

运行情况

1
2
3
4
5
6
16/03/31 17:19:55 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded
16/03/31 17:19:55 INFO namenode.FSImage: Allocated new BlockPoolId: BP-305947057-192.168.4.7-1459415995773
16/03/31 17:19:55 INFO common.Storage: Storage directory /mnt/hadoop/dfs/name has been successfully formatted.
16/03/31 17:19:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
16/03/31 17:19:56 INFO util.ExitUtil: Exiting with status 0
16/03/31 17:19:56 INFO namenode.NameNode: SHUTDOWN_MSG:

格式化成功, 启动该namenode

nn1 容器内执行

1
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode

检查上面运行的日志

1
2
3
4
5
6
7
8
9
10
tail /usr/local/hadoop/logs/hadoop-root-namenode-hadoop-nn1.log

2016-03-31 17:22:44,491 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2016-03-31 17:22:44,491 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2016-03-31 17:22:44,492 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: hadoop-nn1/192.168.4.7:8020
2016-03-31 17:22:44,492 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
2016-03-31 17:22:44,494 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at hadoop-nn2/192.168.4.8:8020 every 120 seconds.
2016-03-31 17:22:44,498 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at http://hadoop-nn2:50070
Serving checkpoints at http://hadoop-nn1:50070

开启nn1上 zookeeper 进程

1
2
3
4
5
6
7
8
9
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start zkfc

tail /usr/local/hadoop/logs/hadoop-root-zkfc-hadoop-nn1.log

2016-03-31 17:27:20,495 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2016-03-31 17:27:20,502 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
2016-03-31 17:27:20,502 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/cluster/ActiveBreadCrumb to indicate that the local node is the most recent active...
2016-03-31 17:27:20,514 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at hadoop-nn1/192.168.4.7:8020 active...
2016-03-31 17:27:21,083 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hadoop-nn1/192.168.4.7:8020 to active state

2.4 同步 NameNode1 元数据到 NameNode2 上

宿主机登录到nn2容器内

1
sdocker exec -it hadoop-nn2 bash

nn2 执行bootstrapStandby

1
2
3
4
5
6
7
8
$HADOOP_PREFIX/bin/hadoop namenode -bootstrapStandby

16/03/31 17:25:09 INFO namenode.TransferFsImage: Opening connection to http://hadoop-nn1:50070/imagetransfer?getimage=1&txid=0&storageInfo=-57:2142625186:0:CID-22a35ddb-646e-4104-be99-1b7cbc578a83
16/03/31 17:25:09 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds
16/03/31 17:25:09 INFO namenode.TransferFsImage: Transfer took 0.02s at 0.00 KB/s
16/03/31 17:25:09 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000000 size 351 bytes.
16/03/31 17:25:09 INFO util.ExitUtil: Exiting with status 0
16/03/31 17:25:09 INFO namenode.NameNode: SHUTDOWN_MSG:

成功后同上nn1过程启动namenode

1
2
3
4
5
6
7
8
9
10
11
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode
tail /usr/local/hadoop/logs/hadoop-root-namenode-hadoop-nn2.log

2016-03-31 17:26:25,011 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2016-03-31 17:26:25,011 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2016-03-31 17:26:25,013 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: hadoop-nn2/192.168.4.8:8020
2016-03-31 17:26:25,013 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
2016-03-31 17:26:25,015 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at hadoop-nn1/192.168.4.7:8020 every 120 seconds.
2016-03-31 17:26:25,019 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at http://hadoop-nn1:50070
Serving checkpoints at http://hadoop-nn2:50070

及zkfc

1
2
3
4
5
6
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start zkfc

tail /usr/local/hadoop/logs/hadoop-root-zkfc-hadoop-nn2.log
2016-03-31 17:28:15,782 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hadoop-nn2/192.168.4.8:8020 entered state: SERVICE_HEALTHY
2016-03-31 17:28:15,821 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at hadoop-nn2/192.168.4.8:8020 should become standby
2016-03-31 17:28:15,832 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hadoop-nn2/192.168.4.8:8020 to standby state

2.5 启动所有datanode

可在宿主机上启动

1
2
3
sdocker exec -it hadoop-dn1 /usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
sdocker exec -it hadoop-dn2 /usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
sdocker exec -it hadoop-dn3 /usr/local/hadoop/sbin/hadoop-daemon.sh start datanode

进入容器内部检查运行日志

1
2
3
4
5
6
sdocker exec -it hadoop-dn1 bash

tail /usr/local/hadoop/logs/hadoop-root-datanode-hadoop-dn1.log

2016-03-31 17:31:23,791 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-305947057-192.168.4.7-1459415995773
2016-03-31 17:31:23,795 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Added bpid=BP-305947057-192.168.4.7-1459415995773 to blockPoolScannerMap, new size=1

以上! 文档请移步 !

https://github.com/iofdata/Docker/tree/master/hadoop

hbase region in transition too long

1. 问题描述

2016年3月30日晚上10:15左右收到下载量低和爬虫流量低报警,意识到可能hbase存数据可能出问题,打开

hbase监控页面

有两个hbase结点宕机, 于是马上登陆相关服务器重启hbase服务。

2. 事后分析

zookeeper 网络抖动导致hhbase01/hbase02 被hbase-master下线。

zookeeper 网络抖动。

hbase02 不能获取路由。

hbase-master 将其标记为dead node 作下线处理。

3. 附加问题

重启hbase01 和 hbase02的过程中,hbase02与hbase05交互的一个region卡死,导致hbase02不能起来,整个集群无法做blancer, 20160331早上1点15集群崩溃,整个系统瘫痪。 我在早上六点半左右重启了hbase05和hbase02,问题解决。

hbase02 无法起来,大量滚动 hfile.LruBlockCache

master 无法 running balancer because 1 region(s) in transition

回溯问题,可能断网那个时候, hbase05 该 region 在 in transition 过程中被标记为pending, 但是转移中,连接被重置,接着hbase02掉线重启,hbase02自身的wal记录状态和zookeerper 里面的状态不一致,拿不到该region,起不来服务,进而慢慢影响整个集群。

hbase05 Connection reset by peer

用docker创建zookeeper集群

参考dynamic-zookeeper-cluster-with-docker[^1],可用来动态增加zookeeper结点,这里主要稍微修正了时区和id添加的问题。

1. 创建 Dockerfile

安装jdk 及 zookeeper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
FROM ubuntu

# update time 
RUN echo "Asia/Shanghai" > /etc/timezone
RUN dpkg-reconfigure -f noninteractive tzdata

RUN apt-get update
RUN apt-get -y install wget bash vim && apt-get clean

# install java
RUN wget http://119.254.110.32:8081/download/jdk1.7.0_60.tar.gz \
   && tar -xvzf  jdk1.7.0_60.tar.gz \
   && mv jdk1.7.0_60 /usr/share/ \
   && rm -rf /usr/lib/jvm/java-1.7-openjdk \
   && mkdir -p /usr/lib/jvm/ \
   && ln -s /usr/share/jdk1.7.0_60 /usr/lib/jvm/java-1.7-openjdk \
   && rm -rf jdk1.7.0_60.tar.gz

ENV JAVA_HOME /usr/lib/jvm/java-1.7-openjdk/

RUN apt-get -y install git ant && apt-get clean

# install zookeeper
RUN mkdir /tmp/zookeeper
WORKDIR /tmp/zookeeper
RUN git clone https://github.com/apache/zookeeper.git .
RUN git checkout release-3.5.1-rc2
RUN ant jar
RUN cp /tmp/zookeeper/conf/zoo_sample.cfg \
    /tmp/zookeeper/conf/zoo.cfg
RUN echo "standaloneEnabled=false" >> /tmp/zookeeper/conf/zoo.cfg
RUN echo "dynamicConfigFile=/tmp/zookeeper/conf/zoo.cfg.dynamic" >> /tmp/zookeeper/conf/zoo.cfg
ADD zk-init.sh /usr/local/bin/
ENTRYPOINT ["/usr/local/bin/zk-init.sh"]

zookeeper 初始化脚本 zk-init.sh

需要指定自己的id和ip以及第一个zookeeper结点的ip。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash

MYID=$1
MYIP=$2
ZK=$3
IPADDRESS=$MYIP

cd /tmp/zookeeper

if [ -n "$ZK" ];then
  output=`./bin/zkCli.sh -server $ZK:2181 \
      get /zookeeper/config | grep ^server`
  #echo $output >> /tmp/zookeeper/conf/zoo.cfg.dynamic
  for i in $output; do echo $i >> /tmp/zookeeper/conf/zoo.cfg.dynamic; done
  echo "server.$MYID=$IPADDRESS:2888:3888:observer;2181" \
      >> /tmp/zookeeper/conf/zoo.cfg.dynamic
  cp /tmp/zookeeper/conf/zoo.cfg.dynamic \
      /tmp/zookeeper/conf/zoo.cfg.dynamic.org
  /tmp/zookeeper/bin/zkServer-initialize.sh \
      --force --myid=$MYID
  ZOO_LOG_DIR=/var/log
  ZOO_LOG4J_PROP='INFO,CONSOLE,ROLLINGFILE'
  /tmp/zookeeper/bin/zkServer.sh start
  /tmp/zookeeper/bin/zkCli.sh -server $ZK:2181 reconfig \
      -add "server.$MYID=$IPADDRESS:2888:3888:participant;2181"
  /tmp/zookeeper/bin/zkServer.sh stop
  ZOO_LOG_DIR=/var/log
  ZOO_LOG4J_PROP='INFO,CONSOLE,ROLLINGFILE'
  /tmp/zookeeper/bin/zkServer.sh start-foreground
else
  echo "server.$MYID=$IPADDRESS:2888:3888;2181" \
      >> /tmp/zookeeper/conf/zoo.cfg.dynamic
  /tmp/zookeeper/bin/zkServer-initialize.sh --force --myid=$MYID
  ZOO_LOG_DIR=/var/log
  ZOO_LOG4J_PROP='INFO,CONSOLE,ROLLINGFILE'
  /tmp/zookeeper/bin/zkServer.sh start-foreground
fi

2. 创建镜像

1
docker build -t peony/zk:2 .

3. 开启容器

测试脚本start-zk-2.sh,启动三个结点

1
2
3
4
5
6
7
8
9
10
docker rm -f zk01 zk02 zk03
docker run -d --net=net04 --name zk01 --add-host \
    zk01:192.168.4.2 --hostname zk01.mudan.com \
    peony/zk:2 1 192.168.4.2
docker run -d --net=net04 --name zk02 --add-host \
    zk02:192.168.4.3 --hostname zk02.mudan.com \
    peony/zk:2 2 192.168.4.3 192.168.4.2
docker run -d --net=net04 --name zk03 --add-host \
    zk03:192.168.4.4 --hostname zk03.mudan.com \
    peony/zk:2 3 192.168.4.4 192.168.4.2

4. 参考:

[^1]:dynamic zookeeper cluster with docker

docker consul 服务发现

1. DC01 192.168.9.8

拉镜像, 起服务

1
2
3
$ docker pull progrium/consul
$ mkdir ~/consul
$ docker run --rm progrium/consul cmd:run 192.168.9.8 -d -v ~/consul:/data

运行上面生成的脚本

1
2
3
4
5
6
7
8
9
10
11
12
docker run --name consul -h $HOSTNAME \
    -p 192.168.9.8:8300:8300 \
    -p 192.168.9.8:8301:8301 \
    -p 192.168.9.8:8301:8301/udp \
    -p 192.168.9.8:8302:8302 \
    -p 192.168.9.8:8302:8302/udp \
    -p 192.168.9.8:8400:8400 \
    -p 192.168.9.8:8500:8500 \
    -p 172.17.0.1:53:53  \
    -p 172.17.0.1:53:53/udp \
    -d -v /home/ubuntu/consul:/data \
    progrium/consul -server -advertise 192.168.9.8 -bootstrap-expect 3 -ui-dir /ui

验证本结点信息

1
2
3
#$ curl localhost:8500/v1/catalog/nodes
$ curl dc01:8500/v1/catalog/nodes
$ dig @0.0.0.0 -p 8600 node1.node.consul

2. DC02 192.168.9.253

1
2
3
$ docker pull progrium/consul
$ mkdir ~/consul
$ docker run --rm progrium/consul cmd:run 192.168.9.253::192.168.9.8 -d -v ~/consul:/data
1
2
3
4
5
6
7
8
9
10
11
12
docker run --name consul -h $HOSTNAME 
    -p 192.168.9.253:8300:8300 \
    -p 192.168.9.253:8301:8301 \
    -p 192.168.9.253:8301:8301/udp \
    -p 192.168.9.253:8302:8302 \
    -p 192.168.9.253:8302:8302/udp \
    -p 192.168.9.253:8400:8400 \
    -p 192.168.9.253:8500:8500 \
    -p 172.17.0.1:53:53 \
    -p 172.17.0.1:53:53/udp \
    -d -v /home/ubuntu/consul:/data \
    progrium/consul -server -advertise 192.168.9.253 -join 192.168.9.8

3. DC03 192.168.9.252

1
2
3
$ docker pull progrium/consul
$ mkdir ~/consul
$ $(docker run --rm progrium/consul cmd:run 192.168.9.252::192.168.9.8 -d -v ~/consul:/data)

4. DC01 192.168.9.8 验证

1
2
$ curl dc01:8500/v1/catalog/nodes
[{"Node":"dc01.mudan.com","Address":"192.168.9.8"},{"Node":"dc02.mudan.com","Address":"192.168.9.253"},{"Node":"dc03.mudan.com","Address":"192.168.9.252"}]

consul 参考资料

https://hub.docker.com/r/progrium/consul/
http://jlordiales.me/2015/02/03/registrator/
http://artplustech.com/docker-consul-dns-registrator/
https://www.spirulasystems.com/blog/2015/06/25/building-an-automatic-environment-using-consul-and-docker-part-1/
https://docs.docker.com/v1.5/swarm/discovery/
http://tonybai.com/2015/07/06/implement-distributed-services-registery-and-discovery-by-consul/

docker registry 搭建

DC01-192.168.9.8

1. 工作目录

1
2
3
$ mkdir -p /home/ubuntu/registry
$ cd /home/ubuntu/registry
# sudo docker run -d -p 5000:5000 -v `pwd`/data:/var/lib/registry --restart=always --name registry registry:2

2. CA证书

1
2
$ mkdir certs
$ openssl req -newkey rsa:2048 -nodes -sha256 -keyout certs/registry.mudan.com.key -x509 -days 3650 -out certs/registry.mudan.com.crt
1
2
3
4
5
6
7
Country Name (2 letter code) [AU]:CN
State or Province Name (full name) [Some-State]:HB
Locality Name (eg, city) []:Wuhan
Organization Name (eg, company) [Internet Widgits Pty Ltd]:PEONY
Organizational Unit Name (eg, section) []:DATA
Common Name (e.g. server FQDN or YOUR name) []:registry.mudan.com
Email Address []:peony_wh@163.com

重新启动

1
2
3
4
5
6
7
8
9
10
$ docker stop registry
$ docker rm registry
$ docker run -d -p 5000:5000 --restart=always --name registry \
  -v `pwd`/data:/var/lib/registry \
  -v `pwd`/certs:/certs \
  -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/registry.mudan.com.crt \
  -e REGISTRY_HTTP_TLS_KEY=/certs/registry.mudan.com.key \
  registry:2
$ sudo vi /etc/hosts
192.168.9.8 registry.mudan.com registry

拷贝证书

1
2
3
$ sudo mkdir -p /etc/docker/certs.d/registry.mudan.com:5000
$ sudo cp certs/registry.mudan.com.crt /etc/docker/certs.d/registry.mudan.com:5000/ca.crt
$ sudo service docker restart

推送镜像

1
2
3
docker pull busybox:latest
docker tag busybox:latest registry.mudan.com:5000/peony/busybox:latest
docker push registry.mudan.com:5000/peony/busybox

3. 其他节点

DC03 192.168.9.252

1
2
3
4
5
$ sudo mkdir -p /etc/docker/certs.d/registry.mudan.com:5000
$ sudo scp ubuntu@192.168.9.8:/home/ubuntu/registry/certs/registry.mudan.com.crt \
    /etc/docker/certs.d/registry.mudan.com:5000/
$ docker pull registry.mudan.com:5000/peony/busybox
$ docker images

4. 账号登陆,待完成

参考资料

https://github.com/docker/distribution/blob/master/docs/deploying.md

https://github.com/docker/distribution/blob/master/docs/configuration.md#storage

http://seanlook.com/2014/11/13/deploy-private-docker-registry-with-nginx-ssl/

http://tonybai.com/

ElasticSearch search queen size is too high

我们有两个ElasticSearch集群用于提供文档索引和搜索服务,其中一个20个结点的大集群用于存储全量数据,一个12个结点的小集群用于存储近一个周数据,ES2即该小集群。

1 问题描述

ES2-1 CPU负载偏高,search队列积压,集群查询速度过慢。

1.1 CPU负载偏高,search队列积压

ES2-1

ES2-1

ES2-1 search queue size较其他节点偏高。
ES2-1-search

ES2-2

ES2-2

1.2 查询速度

查询在50s以上

186

187

1.3 后台日志

2月4日起,indexing和search slowlog量增大。

slowlog

es_peony_sindex2_index_indexing_slowlog.log.2016-02-14

indexing_slowlog

es_peony_sindex2_index_search_slowlog.log.2016-02-14

index.search.slowlog.query

2 解决方案

下架ES2-1,杀掉该节点ES进程。

2.1 查询速度

3 事后分析

  1. 重启ES2-1不能解决问题,重启后待数据均衡,该节点依然成为瓶颈,故考虑暂时下架处理。
  2. 青云表示物理主机cpu负荷正常,系该虚拟主机内部进程有关。
  3. 猜测可能与批量新索引创建导致队列阻塞。2月4日后开始出现indexing_slowlog。

4 Fixed!

2月17日晨,格式化ES2-1的磁盘,当作全新结点重新启动ES,恢复正常。

参考

http://kibana.logstash.es/content/elasticsearch/performance/cluster-state.html

mysql清理分区

问题描述: mysql集群一个结点(appdb05)磁盘满,需要清理分区数据,同时保留部分客户数据。

解决方法: 先停止写入,再导出数据,再清空分区,再导入数据,确认无误后重新刷新写入。

待观察问题: slaver(appdb12)磁盘满,主从复制可能失效,数据可能不一致。

step 1 关掉相关专题刷新

@5.23

1
2
3
4
5
6
7
8
mysql -h 192.168.5.23 peony_t -uroot -p
mysql > update pe_t_subject set state = 0,update_time = now() where id in \
(4620,4849,4850,4852,4853,4854,4855,4858,4859,4860,4861,\
4862,4865,4866,4875,4876,4877,4879,4880,4881,4882,4883,\
4884,4885,4888,5034,5079,5081,5082,5083,5101,5102,5103,\
5104,5162);
mysql> update pe_t_subject set state = 0,\
update_time = now() where id in (3418,3419);

step 2 确保数据没有再写入

@5.5

1
2
3
4
5
6
7
8
9
mysql> use peony_m_63;
mysql> select count(*) from  pe_t_subject_page where \ userId=1526 AND publishDate<'2016-01-01';
+----------+
| count(*) |
+----------+
|    53226 |
+----------+
1 row in set (0.05 sec)
[root@i-cphylyv8 ~]# ll -rt /home/mysql3306/peony_m_63

step 3 备份数据

@5.5

1
2
3
4
5
6
7
8
mysqldump --host=192.168.5.5 --user=***--password=*** \
--no-create-info --where="publishDate<'2016-01-01' AND \
userId=1526" peony_m_63 pe_t_subject_page \
>1526.2016-01-01.sql
mysqldump --host=192.168.5.5 --user=*** --password=*** \
--no-create-info --where="publishDate<'2016-01-01' AND \
userId=496" peony_m_63 pe_t_subject_page \
>496.2016-01-01.sql

step 4 清空分区数据并导入保留的专题数据

@5.5 上清空分区数据并导入保留的专题数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mysql> alter table pe_t_subject_page truncate partition \
p24_1,p24_2,p25_1,p25_2;
mysql> select count(*) from  pe_t_subject_page where \
userId=1526 AND publishDate<'2016-01-01';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.10 sec)
mysql> source 1526.2016-01-01.sql;
mysql> source 496.2016-01-01.sql;
mysql> select count(*) from  pe_t_subject_page where \
userId=1526 AND publishDate<'2016-01-01';
+----------+
| count(*) |
+----------+
|    53226 |
+----------+
1 row in set (0.04 sec)

step 5 重新开启专题刷新

@5.23

1
2
3
4
5
6
7
8
update pe_t_subject set state = 1,update_time = now() \
where id in
(4620,4849,4850,4852,4853,4854,4855,4858,4859,4860,\
4861,4862,4865,4866,4875,4876,4877,4879,4880,4881,\
4882,4883,4884,4885,4888,5034,5079,5081,5082,5083,\
5101,5102,5103,5104,5162);
update pe_t_subject set state = 1,update_time = now() \
where id in (3418,3419);

gitlab使用说明

这是gitlab搭建起来后,为团队内部写的简单配置说明。

Step1: Use ssh-keygen to generate a new pair of id_rsa_new / id_rsa_new.pub

1
2
cd ~/.ssh
ssh-keygen -t rsa -C "tanhao2013@foxmail.com" # your email

step1

Step2: Add the ssh key to the gitlab

1
cat gitlab_rsa.pub

step2

Step3: Modify youy ~/.ssh/config

step3

Step4: Testing

1
2
3
4
5
6
ssh-agent bash
ssh-add ~/.ssh/gitlab_rsa
# Input the password you set at step1
# Enter passphrase for /root/.ssh/gitlab_rsa:
# Identity added: /root/.ssh/gitlab_rsa (/root/.ssh/gitlab_rsa)
ssh -T git@gitlab.mudan.com

Step5: Use it

1
2
3
cd workdir
git config --global user.name "tanhao"
git config --global user.email "tanhao2013@foxmail.com"
1
2
3
4
5
6
7
8
mkdir test
cd test
git init
touch README.md
git add README.md
git commit -m "first commit"
git remote add origin ssh://git@*.*.*.*:1020/tanhao/test.git
git push -u origin master

Step6: Windows 下使用方法

http://www.showerlee.com/archives/1300

Step7:Eclipes 使用方法

http://blog.csdn.net/luckarecs/article/details/7427605

Step8: 多人协作及WorkFlow

http://herry2013git.blog.163.com/blog/static/219568011201341111240751/
http://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/0013760174128707b935b0be6fc4fc6ace66c4f15618f8d000

References

http://www.kuqin.com/shuoit/20141213/343854.html
http://www.2cto.com/os/201402/281792.html
http://www.cnblogs.com/BeginMan/p/3548139.html
http://my.oschina.net/csensix/blog/184434

Writing idiomatic python

1. if Statements

1.1. Chain comparisons to make if statements more concise

Don’t:

1
2
if x <= y and y <= z:
    return Ture

Do:

1
2
if 1 < x < 6:
    print 'python'

1.2. Avoid placing conditional branch code on the same line as the colon

Don’t:

1
2
3
4
name = 'Tom'
address = "NY"
if name: print(name)
print(address)

Do:

1
2
3
4
5
name = 'Tom'
address = "NY"
if name:
    print(name)
print(address)

1.3. Avoid repeating variable name in compound if statement

Don’t:

1
2
3
4
is_generic_name=False
name='Tom'
if name=='Tom' or name=='Dick' or name=='Harry':
    is_generic_name=True

Do:

1
2
name='Tom'
is_generic_name = name in ('Tom','Dick','Harry')

1.4. Avoid comparing directly to True,False,or None

All of the following are considered False
None
False
•zero for numeric types
•empty sequences
•empty dictionaries
•a value of 0 or False returned when either __len__ or __nonzero__ is called

Don’t

1
2
if foo == True:
    pass

Do:

1
2
3
4
5
6
if foo:
    pass

def insert_value(value,position=None):
    if position is not None:
        pass

1.5. Use if and else as a short ternary operator replacement(三元运算符 ? :)

Don’t:

1
2
3
4
5
foo = True
value = 0
if foo:
    value = 1
print(value)

Do:

1
2
3
foo = True
value = 1 if foo else 0
print(value)

Read More