Find Communities by: Category | Product

    客户应用系统使用两台相同配置的VNX存储,怀疑VNX导致应用性能下降。系统维护人员在主机上用“dd”的命令对两台VNX分别进行了读写测试,对照后发现其中一台的读性能很差,测试结果如下:


写测试:两台存储性能相同

VNX1

314572800000 bytes (315 GB) copied, 521.238 seconds, 604 MB/s

314572800000 bytes (315 GB) copied, 527.589 seconds, 596 MB/s

314572800000 bytes (315 GB) copied, 529.907 seconds, 594 MB/s

314572800000 bytes (315 GB) copied, 537.869 seconds, 585 MB/s

VNX2

314572800000 bytes (315 GB) copied, 509.022 seconds, 618 MB/s

314572800000 bytes (315 GB) copied, 521.032 seconds, 604 MB/s

314572800000 bytes (315 GB) copied, 528.544 seconds, 595 MB/s

314572800000 bytes (315 GB) copied, 535.1 seconds, 588 MB/s

读测试:VNX2读性能只有VNX1的一半

VNX1

314572800000 bytes (315 GB) copied, 447.419 seconds, 703 MB/s

314572800000 bytes (315 GB) copied, 460.649 seconds, 683 MB/s

314572800000 bytes (315 GB) copied, 474.604 seconds, 663 MB/s

314572800000 bytes (315 GB) copied, 508.481 seconds, 619 MB/s

VNX2

314572800000 bytes (315 GB) copied, 1163.37 seconds, 270 MB/s

314572800000 bytes (315 GB) copied, 1167.6 seconds, 269 MB/s

314572800000 bytes (315 GB) copied, 1244.01 seconds, 253 MB/s

314572800000 bytes (315 GB) copied, 1792.76 seconds, 175 MB/s

 

     在VNX2上用Unisphere Analyzer收集性能数据,分析后发现硬盘1.1.18性能异常:

         1.硬盘响应时间对比,1.1.18明显高于其它硬盘:

Snap1.bmp

 

        2. 硬盘忙时平均队列深度对比,1.1.18最差,与1.1.18同一个Raid Group中的其它硬盘稍差,其它Raid Group中硬盘最好:

Snap2.bmp

 

进一步检查VNX2SP日志发现硬盘1.1.18有很多”Read Command Timeout”报错信息。结合现场观察到的写性能正常,读性能差的现象,判断是硬盘1.1.18故障导致了VNX2上的性能问题。

B 12/18/13 19:41:07 Bus1 Enc1 Dsk18       801 Soft SCSI Bus Error [READ Command timeout] 0    17267bd0 10006

B 12/18/13 19:41:07 Bus1 Enc1 Dsk18       801 Soft SCSI Bus Error [READ Command timeout] 0    17267b00 10006

B 12/18/13 19:41:07 Bus1 Enc1 Dsk18       801 Soft SCSI Bus Error [READ Command timeout] 0    3a8aa5d0 10006

B 12/18/13 19:41:07 Bus1 Enc1 Dsk18       801 Soft SCSI Bus Error [READ Command timeout] 0    3155bd00 10006

 

更换硬盘1.1.18后,VNX2性能恢复正常。由于原硬盘的读性能很差,更换硬盘时Proactive Copy花了接近10个小时。

        问题如下, IBM AIX上运行Powerpath 5.5,   PowerPath xcryptd进程在AIX主机消耗大量的CPU 和内存资源的问题. 大概在180天后, emcp_xcrypt进程会消耗100%CPU 和内存资源.

 

PID             %CPU      ResSize Char    Command

10682438   63.6          263676  0          [emcp_xcrypt]           

 

这个问题的原因目前还是未知的. Powerpath的开发人员仍然在调查原因但是可以用以下的workaround 解决.

 

  1. 杀死在运行的emcp_xcrypt进程.kill -9 <pid no. >
  2. 为防止系统重启动后这个进程再回来, 编辑文件   /etc/PowerPathExtensions    , 这个文件包含下面这些行,

 

mpxext:cfgmpx

gpxext

dmext:cfgdm

vlumdext:cfgvlumd

xcryptext:cfgxcrypt 

 

删除掉最后面的两行.       

 

3. 从/etc/inittab文件中删除下面的行并保存文件.

 

rcxcrypt:2:wait:/etc/rc.emcp_xcryptd xcrypt_rc >/dev/null 2>&1


4.  有时客户比较在意的是emcp_xcremcp_xcrypt是否为同一进程,以及什么时候显示为emcp_xcr,什么时候显示为emcp_xcrypt 其实,这个问题和AIX的服务器上不同的命令来显示进程的名字有关。比如,PS这个命令和topas这个命令,对同一台服务器上PowerPath的这个进程显示出来的名称就不同,见下图:


Topas

Snap2.bmp

PS

Snap3.bmp

   如上图所见,同一个进程号204900,就可以有两种不同的显示结果。但相同的进程号已经验证了他们是同一个进程。至于为什么在不同的情况下有不同的显示,IBM也作过一些说明,如下:

 

http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds4/ps.htm

(-f, -l, and l flags) Contains the command name. Under the -f flag, the ps command tries to determine the current command name and arguments, both of which may be changed asynchronously by the process. These are then displayed. If this fails, the command name is written as it would appear without the -f option in square brackets.

 

     如上所述,IBM陈述了ps 命令的-f 参数,这个文档描述了ps命令显示的进程名称可能有的时候只会显示前8位。到这里为止,我们已经能够充分的证明,emcp_xcremcp_xcrypt是同一个进程,因为他们的进程号相同。

 

5. AIX中有两种情况下进程不能通过kill或者kill -9来终止。一是出于Zombie状态,一是在kernel mode。只有在user mode的进程才可以通过kill命令终止。对于PPID1,有时会发生进入kernel mode的情况。不管由于什么原因导致kill不能终止进程,都可以通过重启来解决。 对于不能killZombie状态的进程,资源已经释放,对系统没有影响。对于处于kernel mode的进程,如果该进程没有占用过多资源,那么也没有影响,可以等下次有机会系统重启来解决。如果占用系统资源过高,那么就只能通过重启来解决。

 

 

KB:

 

IBM AIX PowerPath xcryptd consuming large amounts

of CPU and memory on AIX host.

Article Number:000083101 Version:2

Key Information

Audience: Level 30 = Customers Original Create Date: Wed Jan 25

19:23:40 GMT 2012

Article Type: Break Fix

Channels: Customer , Internal App First Published: Wed May 29 19:57:29

GMT 2013

Validation Status: Final Approved

Originally Created By: Amanda

Montford

Last Modified: Fri Dec 20 15:08:38

GMT 2013

Last Published: Fri Dec 20 15:08:38

GMT 2013

Summary:

Article Content

Issue: IBM AIX PowerPath xcryptd consuming large amounts of CPU and memory on AIX host.

After around 180 days the emcp_xcrypt process goes to 100%.

Environment:EMC SW: PowerPath

  1. 5.5

OS: IBM AIX

Cause:Unknown.

Resolution: Follow these steps as a workaround:

Halt the currently running process with kill -9 <pid no. >

Prevent the emcp_xcryptd daemon from coming back up on a reboot before the next reboot.

Edit the file /etc/PowerPathExtensions which contains these lines:

mpxext:cfgmpx

gpxext

dmext:cfgdm

vlumdext:cfgvlumd

xcryptext:cfgxcrypt

Remove the last two lines, save the file.

NOTE: Do NOT comment out the lines with #. The last two lines must be removed entirely. Using # to

comment out the lines will prevent PowerPath from configuring devices upon reboot.

Remove the following line from /etc/inittab and save the file:

rcxcrypt:2:wait:/etc/rc.emcp_xcryptd xcrypt_rc >/dev/null 2>&1

PowerPath engineering is currently investigating.

Article Metadata

Product: PowerPath for AIX5.5, PowerPath

Shared:Yes

RCA Status: Not Started

Bug Tracking Number: 384304

External Source: Primus

Primus/Webtop solution ID:emc286557

Originally Created By: Amanda Montford

 

 



Filter Blog

By date:
By tag: