The journal check should not take 2 hours. It should be a pretty instant process, so there is something wrong. Any time you see any kind of journal error, contact Isilon Support immediately and raise an S1 case.
The BMC&CMC error is a bug and has a fix. There is more information here:
466373 : S210, X210, X410, NL410 or HD400 shows event: 'Node's Baseboard Management Controller (BMC) and/or Chassis Management Controller (CMC) are unresponsive.' https://support.emc.com/kb/466373
Ahhhhhh thats a major problem its still offline now at the 18 hours mark and not communicating to the cluster - and after 3x SR's and *attempting* to stress to the support staff whom apparently don't normally answer calls my ULTRA urgent call is an S3 (thats after 2x separate chat sessions both ending in "oh so you need someone there now?" - and yet its now been over 15 hours since my initial "We need urgent help on a dead node" chat.
NOT HAPPY DELL/EMC.
So, the node if works today will have to resync between 3 other HD400's around 90% of data (ie they are full).
I dont wish upon any EMC person being on call this Christmas weekend - cause I bet a Million $$$$ that someone will get called in due to another failure.
Hi sjones5 - yes, finally we managed to get a little *awareness* that this was a problem with SR at DELL/EMC.
Ultimately it appears that the node failed on shutdown after my attempts at resetting the BMC & CMC - it just didnt shutdown cleanly.
On reboot it then found the journal in error - bad that it couldnt really be recovered though as;
Support had a webex and taken control, but no luck - im now for the next week (hopefully less) *shuffling* my storagepool between the nodes and reducing the data percent on the HD400's.
Then we are going to smart fail a hd400 and re-add it in. I just hope the BMC & CMC error are a once off - even though they are a low risk fault/quirk, this time it was nasty!
We managed to commence deleting around 200TB also - but so far after locking out all clients and shutting down all accesses - after 2 days its only removed around 20TB....hmmmmmm...
FUN FUN FUN! Snigger.,
Yes - but with the cluster being degraded none are working - so I modified the job engine settings today and then re-enabled the snapshots delete - its almost finished now after 2 hours.
I'll then run a new smartpool for migrate large files 2TB and greater to the other nodes today - hopefully that will free up around 100TB off the failed smartpool tier so we can fully smartfail the hd400 node and re-add it to the cluster next week
Hi Francisco, its a change to the job engine settings to allow the jobs to run in a degraded state - I would imagine that this is *highly* shouldn't really touch type option.
Here it is for reference - but I would think that support would feel that this is a last resort option:
isi_classic job config -p core.run_degraded=True
After this I was able to start/stop and add/change jobs with the cluster being degraded.
Thanks for your answer
I will run the command and I will tell you the results
I tried to remove a disk, so I execute an smartfailand the flex protect stated, but it failed
MVSISILON-1# isi devices
Node 1, [ OK ]
Bay 1 Lnum 12 [HEALTHY] SN:JPW9K0N12EHRKL /dev/da1
Bay 2 Lnum 10 [HEALTHY] SN:JPW9J0N10YW36V /dev/da2
Bay 3 Lnum 9 [HEALTHY] SN:JPW9J0N10Z109V /dev/da3
Bay 4 Lnum 8 [HEALTHY] SN:JPW9J0N10Z4HRV /dev/da4
Bay 5 Lnum 14 [HEALTHY] SN:JPW9K0N10J962L /dev/da5
Bay 6 Lnum 6 [HEALTHY] SN:JPW9J0N10YWXXV /dev/da6
Bay 7 Lnum 5 [HEALTHY] SN:JPW9J0N10Z4HEV /dev/da7
Bay 8 Lnum 4 [HEALTHY] SN:JPW9J0N10YDDWV /dev/da8
Bay 9 Lnum 3 [HEALTHY] SN:JPW9J0N10X2KBV /dev/da9
Bay 10 Lnum 2 [HEALTHY] SN:JPW9J0N10W491V /dev/da10
Bay 11 Lnum 1 [HEALTHY] SN:JPW9J0N10TMYRV /dev/da11
Bay 12 Lnum 0 [HEALTHY] SN:JPW9J0N10YX7XV /dev/da12
Lnum 7 [SUSPENDED] Last Known Bay N/A
I put the cluster in degraded mode but the flex protect is failing
Recent finished jobs:
ID Type State Time
3251 WormQueue Succeeded 2018-01-02T02:00:32
3250 FlexProtect Failed 2018-01-02T02:16:29
3252 FlexProtect Failed 2018-01-02T02:28:47
3253 ShadowStoreProtect Succeeded 2018-01-02T04:00:19
3243 MediaScan Succeeded 2018-01-02T08:19:33
3254 FlexProtect Failed 2018-01-02T08:52:45
3255 FlexProtect System Cancelled 2018-01-02T08:57:52
3256 FlexProtect Failed 2018-01-02T09:10:08
3257 FlexProtect Failed 2018-01-02T09:22:45
3258 FlexProtect Failed 2018-01-02T16:10:12
additionally I had the same message in the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..." in the other node, the action was: to reset the node to the factory settings, with the command isi_reformat_node, while I executed an smartfail of node in the cluster. isi devices -a smartfail -d (num of node)