11 Replies Latest reply: Nov 17, 2014 12:33 PM by asafayan RSS

Isilon storage performance issue

prakashpatidar

Hi all,

We have setup of four X 200 storage nodes, where each node has 6 Gig RAM,12 disks of 7200 RPM SATA (except storage node #4 , which has 11 disks), total 47 disk.

We are accessing storage cluster using 4 compute nodes via NFS protocol.Front end networking from compute node to storage cluster is through 1Gbps switch (each compute node can read and write data to storage cluster using 1Gbps bandwidth) ,backend networking is done using QDR infiband.

 

We did performance evaluation to get Maximum random read IOPS for 128K chunk read(we choose 128K , because our elasticsearch(application) read data in 128 K chunk).

We have use FIO tool to check performance.

Our observations for the same is :

1.With 20 FIO jobs(20 parallel reading thread) , running on single compute node, gives us max 850 IOPS of 128 K chunk (max out network utilization of 1 Gbps) , when we run same FIO job for 2 compute nodes in parallel(total 40 reading thread, 20 on each)  , performance drops to 673 & 691 respectively , total 1364 IOPS. When we run same FIO job for 4 compute nodes in parallel(total 80 reading thread, 20 on each) , performance further drops to 252,255,567,550,total 1624 IOPS. Whereas with 8K random read chunk from 4 compute nodes(80 reading threads,20 on each compute node) it was giving 4602 IOPS.


My question :

Why we are stuck at max 1624 IOPS in case of 128 K chunk random read whereas total 47 disk can give much higher IOPS(as with 8k random read chunk ,it is giving 4602 IOPS)

I understand , when we read 128K chunk , we effectively reading more, but in this case : neither network from compute to storage nor CPU on storage and compute node was bottleneck. In all above cases, we used 8K disk block size on isilon ,random access pattern to optimize IOPS , protection was +2:1.

We also tried to increase disk block size in isilon but with increase in disk block size , the performance was decreasing.

 

It will be nice if you people will recommend tuning parameters to optimize IOPS for 128 K chunk.


Regards,

Prakash

 

 

  • 1. Re: Isilon storage performance issue
    Peter Serocka

    Have you seen this?  I'd recommend to have look:

     

    Ask the Expert: Isilon Performance Analysis

     

    Maybe a few things can be checked in advance (before tracking things down to disk level):

    - double check that no background jobs are running and stealing CPU or IOPS

    - with four clients, is the network traffic well balanced across the four Isilon nodes?

    - are the actual NFS read/write sizes large enough for 128K? (server and client negotiate a match within their limits.)

    - is the random access pattern really in effect?

    - for 128K reads, one could also try the concurrency pattern...

     

    -- Peter

  • 2. Re: Isilon storage performance issue
    prakashpatidar

    Thanks Peter for your valuable response!

    -No background job was taking CPU/IOPS

    -Need to check (Is looking client connection on dashboard of Management UI is right way?We are using smartconnect (with basic licence, I think , it is round robin policy)

    -we did mount NFS on compute node using rsize=131072,wsize=131072 (need to check whether negotiated value is also 128k)

    -we did setting in management UI, can you suggest ways to check whether it is really in effect or not?

    -Need to check it (will update you once it is done)

  • 3. Re: Isilon storage performance issue
    Peter Serocka

    > -Need to check (Is looking client connection on dashboard of Management UI is right way?We are using smartconnect (with basic licence, I think , it is round robin policy)

     

    The WebGUI is ok, but IMO to slow for live monitoring.

    On the command line interface (CLI):

    isi nfs clients ls

    isi perfstat

     

    Other useful views for live monitoring (just to start with...)

    isi statistics system --nodes --top

    isi statistics client --orderby=Ops --top

    isi statistics heat --top

    isi statistics pstat

    (these are 6.5 cmds, on 7.0 the syntax might vary, or use isi_classic instead)

     

    > -we did setting in management UI, can you suggest ways to check whether it is really in effect or not?


    isi get "filename"

    isi get -DD "filename"


    the latter shows (in very verbose form, but not so easy to count

    the number of disks used) the actual layout of the file on the cluster disks.

    Usually "streaming" access files should spread onto more disks,

    but on small (or fragmented?) clusters the difference between

    streaming/random/concurrency might appear minimal.


    isi set -l {concurrency|streaming|random} -r g retune "filename"


    will actually change the layout if needed. (I trust this more

    that the WebUI). Even if finished, it might take some

    more seconds until changes show up with isi get -DD


    And if you have SmartPools enabled, make sure

    you allow settings per file instead of SmartPools

    ruling everything. (Or do use SmartPools, but then

    you would need to run a SmartPools job each time to

    implement changes.)


    The access pattern (as listed by isi get) also affects

    prefetching; try out all three choices. You are doing  random

    IO, but on 128K chunks -- certainly larger than 4k or 8k.

    And you have many concurrent accesses, so it's

    hard to predict. It is even possible to fine-tune

    prefetching beyond the presets for the three patterns,

    but I would keep that for later...

     

    -- Peter

  • 4. Re: Isilon storage performance issue
    prakashpatidar

    Thanks Peter,

    I will check the details and update you.(currently setup is not with me)

    meantime I have some question to understand performance bottleneck , It will be nice if I get answers of below questions:

     

    Considering I have four X200 nodes , and filesystem bock size is 8K and data protection is +2:1(2 disk or 1 node failure)

    1.If I write file of 768 KB(128X6) size, will it write 6 data stripe and 2 parity stripe(128K size of each stripe and form a protection group of 8 stripe) on 4 nodes in a way where each node will have 2 stripes (at least one data stripe on each node ).

    2.When it writes 128K stripe unit (16 block)  on given node , whereas node has 12 disk in my case , how many disk will it use? Will it use all 12 disk or write 128K on one disk only and for next stripe unit at this node use second disk and so on...

    3.What will happen if my file size is exactly 128K , will it write  128K stripe on 3 nodes(mirror to ensure 2 disk or 1 node failure) and create overhead of 2x?  

    4.If my application open a file and write on 128K chunk on NFS protocol and appending multiple 128K chunks to the same file , does OneFS use stripping for every 128K chunk or wait for multiple 128K chunk and do file striping latter to define file layout properly?

    5.If I change filesystem block size from 8K to 32K , does it means my stripe unit will be now of 16X32K

    6.OneFS uses 16 contiguous block to create one stripe unit , can we change 16 to some other value?

    7.Can we access Isilon storage cluster from compute node (install RHEL) using SMB protocol, as I read in performance benchmark from storage council that SMB performance is almost double compare to NFS in terms of IOPS? 

     

    Thanks & Regards,

    Prakash

  • 5. Re: Isilon storage performance issue
    Peter Serocka

    A couple of thoughts and suggestions:

     

    http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf

     

    is really worth reading to learn more about the filesystem layout. And questions similar to yours

    have been discussed here recently

     

    Isilon overhead for 174kb files

    How many sizes does Isilon consume when writing 128kb file with N+2:1 and N+3:1

     

    With all information, it is really fun to examine a file's disk/block layout as reported by isi get -DD "file".

     

    Furthermore:

     

    > 5.If I change filesystem block size from 8K to 32K , does it means my stripe unit will be now of 16X32K

     

    I don't think you can do so - which exact setting are you referring to?

     

    > 6.OneFS uses 16 contiguous block to create one stripe unit , can we change 16 to some other value?

     

    Couldn't  imagine, but the access pattern parameter controls whether a larger or lower number of disks per node are being used (under the constraint of the chosen protection level).

     

    > 7.Can we access Isilon storage cluster from compute node (install RHEL) using SMB protocol, as I read in performance benchmark from storage council that SMB performance is almost double compare to NFS in terms of IOPS?

     

    In benchmarks SMB IOPS appear higher than NFS IOPS because the set of protocol operations is different, even  for identical workloads, not to mention different workloads used. You cannot compare the resulting values...

     

    For your original test, you might max out with the disk IOPS (xfers),  but you could also get stuck at a certain rate of your "application's IOPS " while seeing few or no disk activity at all(!) -- because your data is mostly or entirely in the OneFS cache . Check the "disk IOPS" or xfers, including ave size per xfer, with

     

    isi statistics drive -nall -t --long --orderby=OpsOut

     

    and cache hit rates for data (level 1 & 2) with:

     

    isi_cache_stats -v 2

     

    In case of very effective caching the IOPS will NOT be limited by disk transfers (so all that filesystem block size reasoning doesn't apply).

    Instead the limit is imposed by CPU usage, or network bandwidth, or by protocol (network + execution) latency even

    if CPU or bandwidth < 100%.

    In the latter case, doing more requests in parallel should be possible (it seems you are right on that track anyway with multiple jobs).

     

    To check protocol latencies, use "isi statistics client" as before and add --long:

     

    isi statistics client --orderby=Ops --top --long

     

    This will show latency times as:  TimeMax    TimeMin    TimeAvg   (also useful for --orderby=... !)

     

    Good luck!

     

    Peter

  • 6. Re: Isilon storage performance issue
    Rdamal

    Peter, that is a great explanation.   

    Could you explain a little more on the commands. I mean what do we have to look for that says something is wrong, when we type in the commands.

     

    isi statistics system --nodes --top

    isi statistics client --orderby=Ops --top

    isi statistics pstat

    isi statistics drive -nall -t --long --orderby=OpsOut

    isi statistics client --orderby=Ops --top --long


    Thanks,

    Damal

  • 7. Re: Isilon storage performance issue
    Peter Serocka

    Hi Damal


    this is roughly how I use these commands to  investigate

    performance issues in new situations:


    > isi statistics pstat


    Bottom half: Overall view to see what's going on.

    Compare Network -- Filesystem -- Disk  throughputs

    to see wether they are consistent with what is expected

    for the workflow as far as it is known. In a typical NAS workflow,

    network throughput should match filesystem througput.

    Write throughput should match disk write throughput (mind the protection overhead).

    Read throughput matches disk read throughput for uncached workflows,

    or can be much higher which good caching.

     

    High disk activity without network or filesystem activity indicates

    some internal job is running (restripes etc).

     

    High CPU without any network/filesystem/disk

    activity would be very strange, for example

    (some process running wild, etc)

     

    Just illustrating how things can be learned from pstat.

     

    Top half of pstat is protocol specific,

    might need to run separately for NFS and SMB.

    Again, quick check for consistency:

    Does the observed mix of reads/writes/ etc. make sense

    in the light of the assumed workflow?

     

    > isi statistics system --nodes --top

     

    Quick check wether the load is well balanced across the cluster,

    and how the protocols are used (SMB vs NFS etc).

    Can also indicate where physical network bandwidth hits the max.

     

    > isi statistics client --orderby=Ops --top

     

    Who is causing the load?

    --orderby=In or Out or TimeAvg for throughputs or latencies resp.

     

    > isi statistics client --orderby=Ops --top --long

     

    With --long there are InAvg and OutAvg, which denote the request ("block") sizes

    (NOT the average of In and Out rates!!).  Small request sizes

    often indicate suboptimal configs on the clients side.

     

    > isi statistics drive -nall -t --long --orderby=OpsOut

     

    Do the disk activities max out? Also --orderby=OpsIn or TimeInQ

     

    Do the disk activities match the assumed workload:

    Small SizeIn and SizeOut request size indicate metadata ops

    or small random IO ops; pure streaming reads/write are usually upto 64k.

     

    Disclaimer: Those are just by personal favorites (plus isi statistics heat),

    and I might err on interpretations; I am not aiming at convincing anyone

     

    Cheers

     

    -- Peter

  • 8. Re: Isilon storage performance issue
    Rdamal

    Peter, thank you very much for the explanation. I hope other people who visit this page will make corrections, if required

     

    With --long there are InAvg and OutAvg, which denote the request ("block") sizes (NOT the average of In and Out rates!!).  Small request sizes often indicate suboptimal configs on the clients side.

     

    When you say small, any particular value that we have to look for ? Any recommendations on what configurations has to be changed on client side ?

  • 9. Re: Isilon storage performance issue
    Peter Serocka

    Hi Damal:

     

    for NFS, it's basically the max and preferred read and write sizes

    (512KB, 128KB, 512KB, 512KB resp on Isilon side). Just

    make sure that clients do not limit requests to smaller sizes via the

    NFS mount params rsize and wsize (as max possible sizes).

    These might me set in the clients stab or auto mount map,

    some systems might have other places to set global defaults.

    Random small IOs will result in smaller request sizes, of course,

    as they can't get coalesced on the client side.

     

    SMB1 is limited to 64KB requests which always has been pretty bad,

    while recent SMB allows for much larger requests. "Secure packet signing"

    reduces the allowed request sizes though. For potential restrictions on the client

    side, please refer to the Windows (or other clients OS, resp. Samba) docs.

     

    Cheers

     

    -- Peter

  • 10. Re: Isilon storage performance issue
    Rdamal

    Peter, thank you for the responses. It's gives a better insight on handling issues.

     

     

    Best Regards,

    Yoga

  • 11. Re: Isilon storage performance issue
    asafayan

    Fantastic summary Peter.  Thank you very, very much.

     

    Amir