Linux SDC Kernel Panic (Memory Allocation): kernel: net_sched: page allocation failure

           

   Article Number:     530089                                   Article Version: 3     Article Type:    Break Fix 
   

 


Product:

 

VxFlex OS,VxFlex Product Family,VxFlex Ready Node

 

Issue:

 

 

   

      Issue Description   

   

      Linux SDC loses access to some or all volumes or kernel panic related to memory allocation.   

   

      Scenario   

   

      The SDC is installed on Linux VM, however, the issue might occur on physical Linux or any other OS with SDC installed.   

   

      SDC suddenly disconnected.   

   

      Possibly, Linux SDC kernel panic.   

   

      SDC IO errors.   

   

      File system IO errors.   

   

      Symptoms   

   

      Messages file on the Linux machine report an SDC stack trace, which includes page allocation (memory) statistics:     
     
          

   
Dec 3 10:40:50 backup7 kernel: net_sched: page allocation failure: order:4, mode:0x104020Dec 3 10:40:50 backup7 kernel: CPU: 3 PID: 1538 Comm: net_sched Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1Dec 3 10:40:50 backup7 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015Dec 3 10:40:50 backup7 kernel: Call Trace:Dec 3 10:40:50 backup7 kernel: [<ffffffff816ae7c8>] dump_stack+0x19/0x1bDec 3 10:40:50 backup7 kernel: [<ffffffff8118cd10>] warn_alloc_failed+0x110/0x180Dec 3 10:40:50 backup7 kernel: [<ffffffff816aa774>] __alloc_pages_slowpath+0x6b6/0x724Dec 3 10:40:50 backup7 kernel: [<ffffffff811912a5>] __alloc_pages_nodemask+0x405/0x420Dec 3 10:40:50 backup7 kernel: [<ffffffff811d5a38>] alloc_pages_current+0x98/0x110Dec 3 10:40:50 backup7 kernel: [<ffffffff8118bb0e>] __get_free_pages+0xe/0x40Dec 3 10:40:50 backup7 kernel: [<ffffffff811e146e>] kmalloc_order_trace+0x2e/0xa0Dec 3 10:40:50 backup7 kernel: [<ffffffff811e5011>] __kmalloc+0x211/0x230Dec 3 10:40:50 backup7 kernel: [<ffffffffc0530e3e>] mapClass_AllocAndInitObj+0x3e/0x120 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc0531ca6>] mapClass_UpdateAll+0x306/0x760 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc055d54a>] ? mosMitSchedThrd_CurThrdOurs+0x6a/0xa0 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc053df93>] mapMdm_HandleObjUpdate_CK+0x2b3/0x540 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc053e290>] ? mapMdm_SendUpdateReq_CK+0x70/0xcd0 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc053e686>] mapMdm_SendUpdateReq_CK+0x466/0xcd0 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc0547a46>] ? netSock_DoIO+0xe6/0x630 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc05112f0>] ? netChan_SendReq_CK+0x70/0x800 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc0511432>] netChan_SendReq_CK+0x1b2/0x800 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc051a5fe>] netCon_SendReq_CK+0x17e/0x500 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc05158d7>] ? netRPC_SendDone_CK+0x47/0x6f0 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc05159ad>] netRPC_SendDone_CK+0x11d/0x6f0 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc055d7df>] mosMit_RunWithTLS+0x4f/0x60 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc055f0ba>] mosMitSchedThrd_ThrdEntry+0x1aa/0x510 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc055c490>] ? mosTicks_GetCurrentTick+0x20/0x20 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffffc055c4aa>] mosOsThrd_Entry+0x1a/0x40 [scini]Dec 3 10:40:50 backup7 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0Dec 3 10:40:50 backup7 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40Dec 3 10:40:50 backup7 kernel: [<ffffffff816c0577>] ret_from_fork+0x77/0xb0Dec 3 10:40:50 backup7 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40Dec 3 10:40:50 backup7 kernel: Mem-Info:Dec 3 10:40:50 backup7 kernel: active_anon:540198 inactive_anon:192106 isolated_anon:0#012 active_file:526767 inactive_file:908890 isolated_file:0#012 unevictable:0 dirty:2548 writeback:0 unstable:0#012 slab_reclaimable:113189 slab_unreclaimable:12471#012 mapped:4048 shmem:21154 pagetables:2768 bounce:0#012 free:87384 free_pcp:669 free_cma:0Dec 3 10:40:50 backup7 kernel: Node 0 DMA free:15900kB min:104kB low:128kB high:156kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yesDec 3 10:40:50 backup7 kernel: lowmem_reserve[]: 0 2814 9821 9821Dec 3 10:40:50 backup7 kernel: Node 0 DMA32 free:200976kB min:19336kB low:24168kB high:29004kB active_anon:195676kB inactive_anon:266280kB active_file:292588kB inactive_file:1429216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129280kB managed:2884228kB mlocked:0kB dirty:1004kB writeback:0kB mapped:5056kB shmem:26680kB slab_reclaimable:405056kB slab_unreclaimable:19648kB kernel_stack:2464kB pagetables:1864kB unstable:0kB bounce:0kB free_pcp:468kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? noDec 3 10:40:50 backup7 kernel: lowmem_reserve[]: 0 0 7006 7006Dec 3 10:40:50 backup7 kernel: Node 0 Normal free:132556kB min:48136kB low:60168kB high:72204kB active_anon:1965116kB inactive_anon:502176kB active_file:1814484kB inactive_file:2206340kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7340032kB managed:7174724kB mlocked:0kB dirty:9200kB writeback:0kB mapped:11168kB shmem:57936kB slab_reclaimable:47700kB slab_unreclaimable:30224kB kernel_stack:4960kB pagetables:9208kB unstable:0kB bounce:0kB free_pcp:2212kB local_pcp:704kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? noDec 3 10:40:50 backup7 kernel: lowmem_reserve[]: 0 0 0 0Dec 3 10:40:50 backup7 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kBDec 3 10:40:50 backup7 kernel: Node 0 DMA32: 5802*4kB (UEM) 3223*8kB (UEM) 9329*16kB (UEM) 85*32kB (UEM) 2*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 201104kBDec 3 10:40:50 backup7 kernel: Node 0 Normal: 29631*4kB (UEM) 1755*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 132564kBDec 3 10:40:50 backup7 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kBDec 3 10:40:50 backup7 kernel: 1469304 total pagecache pagesDec 3 10:40:50 backup7 kernel: 12478 pages in swap cacheDec 3 10:40:50 backup7 kernel: Swap cache stats: add 927451, delete 914973, find 1499563/1552563Dec 3 10:40:50 backup7 kernel: Free swap = 3295096kBDec 3 10:40:50 backup7 kernel: Total swap = 4194300kBDec 3 10:40:50 backup7 kernel: ScaleIO R2_5 mapClass_AllocAndInitObj:1212 :Error: Failed to allocate memory 36288.Cannot process MDM response    
   

      At the same time or later (depend on the workload), "NO_RESOURCES" SDC errors appears and/or SDC "IO errors" and/or file system IO errors:   

   

      Messages file:     
          

   
Dec  3 11:23:55 backup7 kernel: ScaleIO R2_5 mapClass_UpdateAll:523 :Error: Object ffff8802aa340000 failed to update in place.status NO_RESOURCES (67)Dec  3 11:24:45 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:361 :[7567770049] IO-ERROR comb: 0. offsetInComb 0. SizeInLB 0. SDS_ID 0. Comb Gen 0. Head Gen 16da.Dec  3 11:24:45 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:374 :Vol ID 0x7dfb023900000046. Last fault Status IO_FAULT_NOT_PRI(12).Last error Status NOT_FOUND(3) Reason (failed getting LB-Info) Retry count (0) chan (0)Dec  3 11:24:45 backup7 kernel: blk_update_request: I/O error, dev scinia, sector 2166028544Dec  3 11:24:45 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:361 :[7567770056] IO-ERROR comb: 0. offsetInComb 0. SizeInLB 0. SDS_ID 0. Comb Gen 0. Head Gen 16da.Dec  3 11:24:45 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:374 :Vol ID 0x7dfb023900000046. Last fault Status IO_FAULT_NOT_PRI(12).Last error Status NOT_FOUND(3) Reason (failed getting LB-Info) Retry count (0) chan (0)Dec  3 11:24:45 backup7 kernel: blk_update_request: I/O error, dev scinia, sector 2166028544Dec  3 11:24:45 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:361 :[7567770372] IO-ERROR comb: 0. offsetInComb 0. SizeInLB 0. SDS_ID 0. Comb Gen 0. Head Gen 16da.Dec  3 11:24:45 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:374 :Vol ID 0x7dfb023900000046. Last fault Status IO_FAULT_NOT_PRI(12).Last error Status NOT_FOUND(3) Reason (failed getting LB-Info) Retry count (0) chan (0)Dec  3 11:24:45 backup7 kernel: blk_update_request: I/O error, dev scinia, sector 2166028552......Dec  3 11:27:05 backup7 kernel: XFS (dm-2): metadata I/O error: block 0x7dec700 ("xfs_trans_read_buf_map") error 19 numblks 32Dec  3 11:27:05 backup7 kernel: XFS (dm-2): xfs_imap_to_bp: xfs_trans_read_buf() returned error -19.Dec  3 11:27:05 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:361 :[7567910448] IO-ERROR comb: 0. offsetInComb 0. SizeInLB 0. SDS_ID 0. Comb Gen 0. Head Gen 16ac.Dec  3 11:27:05 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:374 :Vol ID 0x7dfb023900000046. Last fault Status IO_FAULT_NOT_PRI(12).Last error Status NOT_FOUND(3) Reason (failed getting LB-Info) Retry count (0) chan (0)Dec  3 11:27:05 backup7 kernel: blk_update_request: I/O error, dev scinia, sector 132042496Dec  3 11:27:05 backup7 kernel: XFS (dm-2): metadata I/O error: block 0x7dec700 ("xfs_trans_read_buf_map") error 19 numblks 32Dec  3 11:27:05 backup7 kernel: XFS (dm-2): xfs_imap_to_bp: xfs_trans_read_buf() returned error -19.Dec  3 11:27:05 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:361 :[7567910460] IO-ERROR comb: 0. offsetInComb 0. SizeInLB 0. SDS_ID 0. Comb Gen 0. Head Gen 16ac.Dec  3 11:27:05 backup7 kernel: ScaleIO R2_5 mapVolIO_ReportIOErrorIfNeeded:374 :Vol ID 0x7dfb023900000046. Last fault Status IO_FAULT_NOT_PRI(12).Last error Status NOT_FOUND(3) Reason (failed getting LB-Info) Retry count (0) chan (0)Dec  3 11:27:05 backup7 kernel: blk_update_request: I/O error, dev scinia, sector 132042496    
   

      Impact   

   

      SDC is not functional.   

   

      SDC disconnection.   

   

      Lost access to one or more volumes.   

                                                             

 

 

Cause:

 

 

   

      Root cause   

   

      There was not enough contiguous memory for the SDC.   

   

      Memory fragmention and low memory available on host.   

   

      As the Linux machine had low memory available and because of memory fragmentation, there was not enough memory for the SDC.   

   

      By design, SDC use large chucks for memory allocation, in this specific case the SDC requested 36k (36288) of memory which cannot be allocated:     
          

   
Dec 3 10:40:50 backup7 kernel: ScaleIO R2_5 mapClass_AllocAndInitObj:1212 :Error: Failed to allocate memory 36288.Cannot process MDM response    
        
From the messages file: there was approximately 132MB of memory available, however, there were not enough large chunks (32k, 64k, etc.) available for memory allocation, resulted in kernel panic:    
   

      There were 29631*4kb chunks available plus 1755*8k chunks available = 132MB (132564kb).     
          

   
Dec 3 10:40:50 backup7 kernel: Node 0 Normal: 29631*4kB (UEM) 1755*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 132564kB    
   
Note: It is unlikely to happen to a machine with few GB available of memory.     
                                                             

 

 

Resolution:

 

 

   

      Workaround   

   

      Note: A host reboot will clear the memory fragmentation temporary until the next time the issue occur.   

   

      From the SDC side there is no workaround as the behavior is by design.   

   

      From the host side:   

   

      1) Adding more memory and making sure the available memory remain high enough.   

   

      2) In this specific case the SDC Linux machines are VMs, moving the SDC to the ESXi would resolve the issue as the ESXi hosts have few GBs of memory available.   

   

      3) Verify if the application/services running might caused or contributed to the memory fragmentation.   

                                                             

 

 

Notes:

 

 

   

      Impacted Versions   

   

      Any SIO version.   

   

      Fixed in Version   

   

      N/A