由于缓存文件存在导致gpinitsystem失败

GP集群:master, segment1, segment2

问题描述:
segment1和segment2上各配置8个primary segment,8个mirror segment,集群初始化,gpinitsystem过程中遇到了问题,集群初始化失败。

在集群初始化过程中,segment实例目录已按照预期分别在segment1和segment2上建立,在后续的gpstart启动集群时,有些segment找不到DBID启动不了,输出提示集群为初始化完全。

Read More

信号量问题导致gp启动失败

[toc]

现象

在执行gpinitsystem的时候,最后执行失败,返回错误日志如下:

20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-Process results…
20160910:21:51:01:017775 gpstart:lin:gpadmin-[ERROR]:-No segment started for content: 1.
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-dumping success segments: [‘lin.g3.s1:/gpadmin/data/mirror/gpseg3:content=3:dbid=9:mode=s:status=u’]
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:—————————————————–
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:4 FAILED host:’lin.g3.s2’ datadir:’/gpadmin/data/primary/gpseg2’ with reason:’PG_CTL failed.’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:5 FAILED host:’lin.g3.s2’ datadir:’/gpadmin/data/primary/gpseg3’ with reason:’PG_CTL failed.’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:7 FAILED host:’lin.g3.s2’ datadir:’/gpadmin/data/mirror/gpseg1’ with reason:’PG_CTL failed.’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:6 FAILED host:’lin.g3.s2’ datadir:’/gpadmin/data/mirror/gpseg0’ with reason:’PG_CTL failed.’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:8 FAILED host:’lin.g3.s1’ datadir:’/gpadmin/data/mirror/gpseg2’ with reason:’PG_CTL failed.’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:3 FAILED host:’lin.g3.s1’ datadir:’/gpadmin/data/primary/gpseg1’ with reason:’PG_CTL failed.’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-DBID:2 FAILED host:’lin.g3.s1’ datadir:’/gpadmin/data/primary/gpseg0’ with reason:’Failure in segment mirroring; check segment logfile’
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:—————————————————–

20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:—————————————————–
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:- Successful segment starts = 1
20160910:21:51:01:017775 gpstart:lin:gpadmin-[WARNING]:-Failed segment starts, from mirroring connection between primary and mirror = 1 <<<<<<<<
20160910:21:51:01:017775 gpstart:lin:gpadmin-[WARNING]:-Other failed segment starts = 6 <<<<<<<<
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:—————————————————–
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-Successfully started 1 of 8 segment instances <<<<<<<<
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:—————————————————–
20160910:21:51:01:017775 gpstart:lin:gpadmin-[WARNING]:-Segment instance startup failures reported
20160910:21:51:01:017775 gpstart:lin:gpadmin-[WARNING]:-Failed start 7 of 8 segment instances <<<<<<<<
20160910:21:51:01:017775 gpstart:lin:gpadmin-[WARNING]:-Review /home/gpadmin/gpAdminLogs/gpstart_20160910.log
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:—————————————————–
20160910:21:51:01:017775 gpstart:lin:gpadmin-[INFO]:-Commencing parallel segment instance shutdown, please wait…
..

20160910:21:51:05:017775 gpstart:lin:gpadmin-[ERROR]:-gpstart error: Do not have enough valid segments to start the array.

Read More

greenplum OOM的难问题定位

背景描述

使用的环境是2+3,2台当计算节点,3台当计算节点,总共起了48个primary segment和48个mirror segment。该gp集群是业务程序项目测试使用,运行一段时间后,gp会出现突然宕机的情况。

问题定位

用户的最大进程数导致segment失效

查询pg_log日志,发现如下信息could not fork new process for connection: Resource temporarily unavailable。

Read More

由于gpstop -M immediate导致gp reciverseg失败

现象

操作

有个主segment挂了,运行了gprecoverseg,在gprecoverseg进程还未结束的时候运行gpstop -M immediate.然后运行gpstart -a启动集群出现问题。

现象

gpstart -a 集群启动成功,但带告警

1
2
3
4
5
6
7
8
9
10
11
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:-----------------------------------------------------
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:- Successful segment starts = 4
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:- Failed segment starts = 0
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:-----------------------------------------------------
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:-
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:-Successfully started 4 of 4 segment instances
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:-----------------------------------------------------
20161031:09:56:39:023724 gpstart:mdw1:gpadmin-[INFO]:-Starting Master instance mdw1.com directory /home/gpadmin/data/master/gpseg-1
20161031:09:56:40:023724 gpstart:mdw1:gpadmin-[INFO]:-Command pg_ctl reports Master mdw1.com instance active
20161031:09:57:05:023724 gpstart:mdw1:gpadmin-[WARNING]:-FATAL: DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1602)

Read More

greenplum的motion机制

查询计划会根据sql语句的预估开销来决策选择哪种motion,一般情况下,假设我有语句GP相比其他数据库有的操作类型,有一个额外操作类型叫做motion,一个motion操作会涉及segment进行查询时的数据移动。并非所有的操作都会触发moiton操作。例如精确查询,不会涉及到segment之间的数据交互。当sql语句涉及到join、aggregation、sort(或其他对于行的操作),就会有移动数据的需求,GP生成查询计划的时候会有motion node出现。

GP有如下几种motion方式:

  • Broadcast Motion:广播方式,每个节点向其他节点广播需要发送的数据。
  • redistribute motion:重分布移动数据,当sql语句做join的时候,join的列值hash不同,将筛选后的数据在其他segment重新发布.
  • explicit redistribute motion:
  • gather motion : segment数据汇聚到master。

    Read More

机器有大量剩余内存时但java进程还是报OOM错误

现象

当前环境下有tomcat和ambari-server,zookeeper等java进程在运行,其中tomcat未设置堆大小,ambari-server和zookeeper均在有设置内存大小,大小为几GB。
但在运行java -version的时候出错,提示

1
2
3
4
[root@m1 tmp]# Error occurred during initialization of VM
Could not reserve enough space for object heap
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

通过free -h查看发现机器有很大的空余内存,但还是报内存不够初始化jvm。

1
2
3
4
[root@m1 ~]# free -h
total used free shared buff/cache available
Mem: 94G 1.4G 90G 2.2G 2.5G 90G
Swap: 4.0G 0B 4.0G

Read More

greenplum的worlkoad详解

默认8192,内存资源限制。计算公式为(x单台机器物理内存)/主segment数目。x为1到1.5之间。例如物理机126G内存,主segment16个。则该值为(1126)/16=7.875GB.则该值设置为7875.
gp_vmem_idle_resource_timeout :默认 eager_free,有none, auto, eager_free三种。当设置为auto,查询内存使用被statement_mem和resource queue memory limits所限,设置为none,当设置为None,memory management与4.1版本GP相同。当设置为eager_free时,将会最大程度的使用内存,但不超过max_statement_mem和resourece queue的内存限制。

描述

主要起控制查询队列和防止系统过载的作用。
当用户提交一个查询的时候,这个查询会被评估是否超过了资源队列的限制。如果这个查询没有超过资源限制,则会被立即执行。如果该查询超过了资源队列的限制,则查询必须等到资源队列里前面的查询执行结束有空闲才会进行。查询请求默认先进先出。如果启用了查询优先级的设置,超级管理员的。
使Workload生效需要如下几步:

  • 配置workload Managermnat的配置项(可使用默认配置)
  • 创建资源队列,并设置资源队列的limit值
  • 将用户添加到资源队列中
  • 检查/监控资源队列状态

Read More