RecoverPoint for Virtual Machine : Severe distributor error that causes journal loss

           

   Article Number:     540714                                   Article Version: 2     Article Type:    Break Fix 
   

 


Product:

 

RecoverPoint for Virtual Machines,RecoverPoint for Virtual Machines 5.2

 

Issue:

 

 

Replication throws fatal error that causes a journal loss.   
   
    Enabled snapshot consolidation Journal Error Severe distributor error that causes journal loss.   
   
    EventID_GROUP_JVOL_CORRUPTED(4040)   
   
    EVENT_WARNING,summary=Next synchronization will be a full sweep,Summary=The dirtying action took place either due to a user action (e.g. group was enabled or journal volumes changed) or due to a system fatal error.   
   
   
    Affected versions: 5.2.2.*   
   
    Replication logs:   
        

      We can see that readFromUndoMeta and "call replace in the middle" are done in two different threads (e.g., 29007 and 29008) and ending with an error :     
     
      2020/01/20 22:03:30.393 - #2 - 29008/28420 - SnapshotConsolidator::readFromUndoMeta: m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=1017176256,globalCopyID=GlobalCopy(SiteUID(0x651fb51159d191b1) 0) ),gridCopyID=0) l_leftData = 47799202 *m_UndoMetaDatas = The stream ID: 12608520819811811329     
      The tail: StreamPointer(streamID=12608520819811811329 blockID=4 offset=0)     
      The head: StreamPointer(streamID=12608520819811811329 blockID=4 offset=394028970)     
      ...     
      2020/01/20 22:03:30.698 - #2 - 29007/28420 - SnapshotConsolidator::processCopyMetaDataShouldEnd: undoMetaStream reached the end of the block m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=1017176256,globalCopyID=GlobalCopy(SiteUID(0x651fb51159d191b1) 0) ),gridCopyID=0) ...     
      2020/01/20 22:03:30.783 - #2 - 29008/28420 - SnapshotConsolidator::consolidatorAppendingMeta: calling replace in the middle for undo meta stream m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=1017176256,globalCopyID=GlobalCopy(SiteUID(0x651fb51159d191b1) 0) ),gridCopyID=0)     
      2020/01/20 22:03:30.783 - #2 - 29008/28420 - Stream: call replace in the middle a_startPointer = StreamPointer(streamID=12608520819811811329 blockID=4 offset=4306) a_endPointer = StreamPointer(streamID=12608520819811811329 blockID=4 offset=394028970) *this = The stream ID: 12608520819811811329     
      The tail: StreamPointer(streamID=12608520819811811329 blockID=4 offset=0)     
      The head: StreamPointer(streamID=12608520819811811329 blockID=4 offset=394037418     
      ...     
      2020/01/20 22:03:30.783 - #0 - 29008/28420 - StreamPersistentObject: errno=0 Error in replaceInMiddle - start and end pointers have the same blockID which is not the last blockstreamID = 12608520819811811329,     
      ...     
      2020/01/20 22:03:30.793 - #0 - 29008/28420 - StackTrace: errno=0 6: /usr/lib/recoverpoint/libreplication_libsrelease.so(_ZN6Kashya20SnapshotConsolidator25consolidatorAppendingMetaEv+0x13d) [0x7f07c3fe656d]     
      2020/01/20 22:03:30.793 - #0 - 29008/28420 - StackTrace: errno=0 7: /usr/lib/recoverpoint/libreplication_libsrelease.so(_ZN6Kashya26AsyncJobConsolidationCycle21continueConsolidationEv+0x3d0) [0x7f07c3d6fac0]     
      2020/01/20 22:03:30.793 - #0 - 29008/28420 - StackTrace: errno=0 8: /usr/lib/recoverpoint/libreplication_libsrelease.so(_ZN6Kashya26AsyncJobConsolidationCycle15continueExecuteEv+0x5a0) [0x7f07c3d702e0]     
      ...     
      2020/01/20 22:03:30.793 - #1 - 29008/28420 - DistributorGroupHandler::handleFatalException: fatal exception error m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=1017176256,globalCopyID=GlobalCopy(SiteUID(0x651fb51159d191b1) 0) ),gridCopyID=0) ex.what() = Error in replaceInMiddle m_Snapshots.guardedGetStateID() = e_distributorDistributin   

   

      5.2. Or, it might be in the same thread (18343) but a new distribution task was started in between and then we'll see that head is changed (e.g., 4773521 to 4862313):     
      2019/07/01 04:36:27.539 - #2 - 18343/18177 - SnapshotConsolidator::readFromUndoMeta: m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=929714079,globalCopyID=GlobalCopy(SiteUID(0x5063f4f4a30e4db9) 0) ),gridCopyID=0) l_leftData = 2474341 *m_UndoMetaDatas = The stream ID: 11327189250572550145     
      The tail: StreamPointer(streamID=11327189250572550145 blockID=4 offset=0)     
      The head: StreamPointer(streamID=11327189250572550145 blockID=4 offset=4773521)     
      ...     
      2019/07/01 04:36:28.684 - #2 - 18344/18177 - AsyncDistPhase2::startNewTaskIfPossible: m_groupId = (groupCopyRID=(kVolSlot=614558420,globalCopyID=GlobalCopy(SiteUID(0x5063f4f4a30e4db9) 0) ),gridCopyID=0) m_i nfo->getDoSnapshotsStream().tail() = StreamPointer(streamID=17128437152493862913 blockID=1 offset=4202689)     
      2019/07/01 04:36:29.276 - #2 - 18343/18177 - SnapshotConsolidator::consolidatorAppendingMeta: calling replace in the middle for undo meta stream m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=929714079,globalC opyID=GlobalCopy(SiteUID(0x5063f4f4a30e4db9) 0) ),gridCopyID=0) 2019/07/01 04:36:29.276 - #2 - 18343/18177 - Stream: call replace in the middle a_startPointer = StreamPointer(streamID=11327189250572550145 blockID=4 offset=2157346) a_endPointer = StreamPointer(streamID =11327189250572550145 blockID=4 offset=4773521) *this = The stream ID: 11327189250572550145 The tail: StreamPointer(streamID=11327189250572550145 blockID=4 offset=0) The head: StreamPointer(streamID=11327189250572550145 blockID=4 offset=4862313) m_nHeadBlockDataSum: 361 m_nTailBlockDataSum: 0 The blocks list: Block ID: 4 Start offset: 0 End offset: 4862313 Relative position: 1   

                                                             

 

 

Cause:

 

 

Snapshot consolidation meta data write was not atomic, which can cause mismatches between the calculation and the write itself.                                                           

 

 

Resolution:

 

 

Workaround:   
   
    Disabling Snapshot Consolidation    
   
    Resolution:   
   
    RP4VM version 5.2.2.2 , 5.3