9 Replies Latest reply: Jan 2, 2018 2:23 PM by Francisco Reyes RSS

ISILON Journal check - status/timeframe/info?

wyszynski

Ok, so we had an error with the BMC & CMC which support have guided me though in an SR which ended in a power-cycle that never finished powering off after around 20 mins plus.

 

So I removed one power lead, the fans kicked into over-drive and the system then shutdown 10 sec after that. So I removed the 2nd power lead.

 

On reboot, the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..."

 

So, this is the 2nd of four HD400's running at 89% capacity - anyone wanna guess/suggest how long this is gonna check the Journal for? its been running now for about two hours. Running on 7.2.1.1. in a 23 node cluster.

 

....and I still don't know why the BMC&CMC error-ed initially!

 

 

Anyone ever see the Journal message before? what was your result?

 

Thanks!

_L_

  • 1. Re: ISILON Journal check - status/timeframe/info?
    sjones5

    Hi wyszynski,

     

    The journal check should not take 2 hours. It should be a pretty instant process, so there is something wrong. Any time you see any kind of journal error, contact Isilon Support immediately and raise an S1 case.

     

    The BMC&CMC error is a bug and has a fix. There is more information here:

     

    466373 : S210, X210, X410, NL410 or HD400 shows event: 'Node's Baseboard Management Controller (BMC) and/or Chassis Management Controller (CMC) are unresponsive.' https://support.emc.com/kb/466373

  • 2. Re: ISILON Journal check - status/timeframe/info?
    wyszynski

    G'Day sjones5,

     

    Ahhhhhh thats a major problem its still offline now at the 18 hours mark and not communicating to the cluster - and after 3x SR's and *attempting* to stress to the support staff whom apparently don't normally answer calls my ULTRA urgent call is an S3 (thats after 2x separate chat sessions both ending in "oh so you need someone there now?" - and yet its now been over 15 hours since my initial "We need urgent help on a dead node" chat.

     

    NOT HAPPY DELL/EMC.

    So, the node if works today will have to resync between 3 other HD400's around 90% of data (ie they are full).

    I dont wish upon any EMC person being on call this Christmas weekend - cause I bet a Million $$$$ that someone will get called in due to another failure.

  • 3. Re: ISILON Journal check - status/timeframe/info?
    sjones5

    Do you have a case number to reference? I can make sure that Isilon Support management is aware of what is going on if this hasn't been addressed by now.

  • 4. Re: ISILON Journal check - status/timeframe/info?
    wyszynski

    Hi sjones5 - yes, finally we managed to get a little *awareness* that this was a problem with SR at DELL/EMC.

    Ultimately it appears that the node failed on shutdown after my attempts at resetting the BMC & CMC - it just  didnt shutdown cleanly.

    On reboot it then found the journal in error - bad that it couldnt really be recovered though as;

     

    Support had a webex and taken control, but no luck - im now for the next week (hopefully less) *shuffling* my storagepool between the nodes and reducing the data percent on the HD400's.

    Then we are going to smart fail a hd400 and re-add it in. I just hope the BMC & CMC error are a once off - even though they are a low risk fault/quirk, this time it was nasty!

     

    *CROSSES FINGERS*

     

    We managed to commence deleting around 200TB also - but so far after locking out all clients and shutting down all accesses - after 2 days its only removed around 20TB....hmmmmmm...

    FUN FUN FUN! Snigger.,

  • 5. Re: ISILON Journal check - status/timeframe/info?
    Francisco Reyes

    Hi Wyszynski

    do you have the jobs paused by the system? when you run the isi job status in another node?

    I will wait your answer

    Thanks

  • 6. Re: ISILON Journal check - status/timeframe/info?
    wyszynski

    G'Day Francisco,

     

    Yes - but with the cluster being degraded none are working - so I modified the job engine settings today and then re-enabled the snapshots delete - its almost finished now after 2 hours.

    I'll then run a new smartpool for migrate large files 2TB and greater to the other nodes today - hopefully that will free up around 100TB off the failed smartpool tier so we can fully smartfail the hd400 node and re-add it to the cluster next week

     

    *Crosses fingers*

  • 7. Re: ISILON Journal check - status/timeframe/info?
    Francisco Reyes

    Hi

     

    How to modify the job engine settings for work the the jobs?

     

    Thanks

     

     

    Ing de Servicios Profesionales

    Ing. Francisco Reyes Bautista

    freyes@net-brains.com<mailto:freyes@net-brains.com>

    Cel (52)1  5534664851

    Skype nb.francisco.reyes

  • 8. Re: ISILON Journal check - status/timeframe/info?
    wyszynski

    Hi Francisco, its a change to the job engine settings to allow the jobs to run in a degraded state - I would imagine that this is *highly* shouldn't really touch type option.

     

    Here it is for reference - but I would think that support would feel that this is a last resort option:

     

    isi_classic job config -p core.run_degraded=True

     

     

    After this I was able to start/stop and add/change jobs with the cluster being degraded.

  • 9. Re: ISILON Journal check - status/timeframe/info?
    Francisco Reyes

    Thanks for your answer

    I will run the command and I will tell you the results

    Thanks

     

    I tried to remove a disk, so I execute an smartfailand the flex protect stated, but it failed

    MVSISILON-1# isi devices

    Node 1, [ OK ]

      Bay 1        Lnum 12      [HEALTHY]      SN:JPW9K0N12EHRKL      /dev/da1

      Bay 2        Lnum 10      [HEALTHY]      SN:JPW9J0N10YW36V      /dev/da2

      Bay 3        Lnum 9       [HEALTHY]      SN:JPW9J0N10Z109V      /dev/da3

      Bay 4        Lnum 8       [HEALTHY]      SN:JPW9J0N10Z4HRV      /dev/da4

      Bay 5        Lnum 14      [HEALTHY]      SN:JPW9K0N10J962L      /dev/da5

      Bay 6        Lnum 6       [HEALTHY]      SN:JPW9J0N10YWXXV      /dev/da6

      Bay 7        Lnum 5       [HEALTHY]      SN:JPW9J0N10Z4HEV      /dev/da7

      Bay 8        Lnum 4       [HEALTHY]      SN:JPW9J0N10YDDWV      /dev/da8

      Bay 9        Lnum 3       [HEALTHY]      SN:JPW9J0N10X2KBV      /dev/da9

      Bay 10       Lnum 2       [HEALTHY]      SN:JPW9J0N10W491V      /dev/da10

      Bay 11       Lnum 1       [HEALTHY]      SN:JPW9J0N10TMYRV      /dev/da11

      Bay 12       Lnum 0       [HEALTHY]      SN:JPW9J0N10YX7XV      /dev/da12

    Unavailable drives:

      Lnum 7    [SUSPENDED]     Last Known Bay N/A

    I put the cluster in degraded mode but the flex protect is failing

     

    Recent finished jobs:

    ID   Type               State            Time           

    ------------------------------------------------------------

    3251 WormQueue          Succeeded        2018-01-02T02:00:32

    3250 FlexProtect        Failed           2018-01-02T02:16:29

    3252 FlexProtect        Failed           2018-01-02T02:28:47

    3253 ShadowStoreProtect Succeeded        2018-01-02T04:00:19

    3243 MediaScan          Succeeded        2018-01-02T08:19:33

    3254 FlexProtect        Failed           2018-01-02T08:52:45

    3255 FlexProtect        System Cancelled 2018-01-02T08:57:52

    3256 FlexProtect        Failed           2018-01-02T09:10:08

    3257 FlexProtect        Failed           2018-01-02T09:22:45

    3258 FlexProtect        Failed           2018-01-02T16:10:12

     

     

    additionally I had the same message in the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..." in the other node, the action was: to reset the node to the factory settings, with the command isi_reformat_node, while I executed an smartfail of node in the cluster. isi devices -a smartfail -d (num of node)