False reporting of container health

False reporting of container health

Environment:

ECS Appliance

ECS Appliance Hardware

ECS Appliance Software with Encryption

ECS Appliance Software without Encryption

ECS Appliance Software with Encryption 2.2

ECS Appliance Software without Encryption 2.2

 

 

Description:

False reporting of container health.

The ECS will dial home with an alert like this:

 

Connect Home

Clarify Id: APM00999007777

Site Name: UNKNOWN

Vendor: EMC

DeviceType: ElasticCloudStorageApp

Model: ElasticCloudStorage

SerialNumber: APM00999007777

Platform: platform

OS: SUSE Linux Enterprise Server 12

OS_VER: SUSE Linux Enterprise Server 12

EmbedLevel: 2

InternalMaxSize: 512800

Comment: Fabric Alert

Ucode_Ver: 2.2

ConnectType: ESRS

IP_Address: Not Available

IP_Name: Noed6.ecs1.BigCo.net

Port: 22

 

SymptomCode: 2012

Category: Status

Severity: Error

Status: Warning

Component: Fabric Agent

ComponentID: 2205c9d8-790b-4310-a964-e5360e05

SubComponent: Service

SubComponentID: object-main

CallHome: true

FirstTime: 2016-04-05T15:02:18.026Z

Description: Service Health Suspect event

In the above alert, the Fabric Agent is claiming there is something wrong with the Object Main Container, but the logs below show the Object Main container had an issue for 15 seconds or less, if even at all.

 


2016-04-05 15:02:08,365 60562787 [pool-1-thread-1] INFO com.emc.caspian.fabric.agent.driver.GoalStateDriver  -  Handling event SlotHealthUpdated[kind=SLOT_HEALTH_UPDATED, timestamp=04/05/2016 15:02:08.364 UTC, slot= object-main, health=GOOD]
2016-04-05 15:02:18,027 60572449 [pool-1-thread-1] DEBUG com.emc.caspian.fabric.agent.api.EventHelper  -  Emitting event SlotHealthUpdated[kind=SLOT_HEALTH_UPDATED, timestamp=04/05/2016 15:02:18.026 UTC, slot= object-main, health=SUSPECT]
2016-04-05 15:02:18,027 60572449 [pool-1-thread-1] INFO com.emc.caspian.fabric.agent.driver.GoalStateDriver  -  Handling event SlotHealthUpdated[kind=SLOT_HEALTH_UPDATED, timestamp=04/05/2016 15:02:18.026 UTC, slot= object-main, health=SUSPECT]
2016-04-05 15:02:32,401 60586823 [pool-1-thread-1] DEBUG com.emc.caspian.fabric.agent.api.EventHelper  -  Emitting event SlotHealthUpdated[kind=SLOT_HEALTH_UPDATED, timestamp=04/05/2016 15:02:32.401 UTC, slot= object-main, health=GOOD]

 

Please note: The time stamp when the object container was marked as SUSPECT in the logs above matches the value of "FirstTime" in the alert text.

 

At 15:02:08 the container was GOOD and then at 15:02:18 the container was not GOOD (SUSPECT). Then the Fabric detects the container is GOOD again at 15:02:32.

 

Based on that timeline and alert should not be sent out!

 

 

 

Resolution:

To ensure you have this issue please check the Fabric logs. For a detailed step by step resolution please refer to EMC Support Solution 481068 https://support.emc.com/kb/ 481068