Isilon OneFS fundamentals of locks and locking

NOTE: This topic is part of the Uptime Information Hub.


1. Introduction



The term lock is used in a large number of situations in computing, and especially in the case of Network-Attached Storage (NAS). While the underlying concepts are largely similar, the overloading of the term has been seen to lead to a fair degree of confusion. This article will attempt to cover the various instances where the term “lock” appears and describe them so that the similarities and differences are apparent.

1.1 Background

One possible problem with the choice of the word lock is that the word has a meaning outside of computing, and the analogy to, for example, a lock on a door is not a good one. Locks in computing are generally used for “mutual exclusion” or concurrency control, not for security or access control. As you will learn from this article, from the lowest-level locking of the system through to protocol and application-level locking, the purpose is to synchronize multiple operations acting on a shared resource so that correctness and data integrity are maintained.

1.2 Fundamentals of locking

Locking serves the purpose of preventing simultaneous, uncoordinated access to a shared resource, for example:

  • Multiple CPUs attempting to update the same memory location in a multiprocessor system.
  • Multiple filesystem code paths attempting to update the file size in an inode.
  • Multiple client application threads attempting to update a shared file.

In all of the cases above, without some form of concurrency control, the risk of data loss or corruption is high. Locks enable implementation of a “critical section” where only one actor can be active at a time.

2. Operating system mutual exclusion

Contemporary operating systems use mutual exclusion extensively. In modern hardware systems where multiple CPU sockets and cores are common, the need for concurrency control mechanisms is somewhat obvious. Less obvious is that, even in a system with a single CPU, such mechanisms are still needed due to pre-emption (interrupts). Any critical section of code that operates on data that is also accessed in an interrupt service routine must at the very least lock out interrupts while it executes.

2.1 Isilon OneFS

The OneFS operating system is based on the FreeBSD operating system. As such, it provides a rich collection of locking primitives. Details can be found in the section 9 “locking” manual (man) page on a FreeBSD system. All current EMC Isilon hardware ships with at least four CPU cores, and the operating system makes heavy use of the locking primitives to ensure safe and correct operation.


This low level of locking is not generally of interest to most users of the system. See the following references for further detail.

2.2 References


Back to top

3. OneFS Distributed Lock Manager (DLM)


OneFS is a distributed, single namespace, scale-out filesystem. All nodes in an Isilon cluster operate on the same single-namespace filesystem simultaneously. Because of this, the filesystem requires mutual exclusion mechanisms to function correctly. Unlike the operating system locks, the scope of the locks implemented by the Distributed Lock Manager (DLM) in OneFS is the entire cluster.

3.1 Cluster membership and quorum


OneFS implements a tightly-coupled, clustered system. Every node that is a member of the cluster has a list of other known member nodes, and all nodes constantly attempt to communicate with each other to determine an aggregate group membership. All active members have a view of the group, which consists of the state (UP, SOFT_FAIL, DEAD, GONE [additionally STALLED for drives]) of all nodes and their drives. This comprises the cluster and its state.


To ensure that data is protected from erroneous simultaneous access, the system also uses the concept of quorum in two variants: read quorum and write quorum. If the group that the node is currently a member of does not have the appropriate quorum (at the least the group must consist of a strict majority of the known members of the cluster), it is not allowed the corresponding access to the filesystem.  That is, if a node is not part of the node group that has write quorum within the cluster it cannot write to OneFS. The system functions similarly for read access to OneFS.  If this were not the case then corruption could occur, for example a 10-node cluster splits into two groups each containing five nodes and each side continues to try to access the filesystem but they are not coordinated. Requiring a strict majority prevents this scenario. By definition, there can only ever be at most one majority group.

3.2 Distributed Lock Manager (DLM)


The OneFS Distributed Lock Manager (DLM) implements lock domains that provide coherent, recoverable, cluster-wide locks for use by the OneFS filesystem. These allow controlled access to filesystem objects, as well as providing the underlying basis for higher-level filesystem locks such as byte-range locks, semantic locks, etc.


The DLM distributes the lock data across all the nodes in the cluster. In a heterogeneous cluster where node memory sizes differ, the DLM will balance the memory utilization so that smaller memory nodes are not penalized.


In the case of a node leaving the group (for example, reboot, power loss, etc.), the DLM recovers the lock state and rebalances the locks across the surviving nodes in the group.


The DLM also constantly probes for and detects and breaks deadlocks for locks in the LIN domain (see below).

3.3 Lock manager domains


As mentioned previously, the DLM implements multiple lock domains. Each lock domain implements a set of (key,value) pairs. Additionally, each lock can support a “byte range” (pair of file offsets), and a “user data” block.

3.3.1 Logical Inode Number (LIN)

Every object in the OneFS filesystem (file, directory, internal special LINs) is indexed by a logical inode number. This is similar to an inode in a traditional POSIX filesystem. A LIN provides an extra level of indirection, providing pointers to the mirrored copies of the on-disk inode. Additionally, unlike inodes in a traditional POSIX filesystem, LINs are never re-used, and they are dynamically allocated so there is no static inode limit (it is only bounded by free space).


The LIN domain is keyed by the LIN (logical inode number). This domain is used to provide mutual exclusion around classic VNODE ( operations. Operations that require a stable view of data, for example, read operations, take a read lock which allows other readers to operate simultaneously, but prevents modification. Operations that change data, for example, creating a file in a directory, take a write lock that prevents others from accessing that directory while the change is taking place.

3.3.2 Data

The datalock lock domain implements locks on regions of data within a file. By reducing the locking granularity to below the file level, this enables simultaneous writers to multiple sections of a single file.


3.3.3 Mirrored Data Structure (MDS)

All metadata in the OneFS filesystem is mirrored for protection. Operations involving read/write of such metadata are protected using locks in the MDS domain.


3.3.4 Delete

The ref lock domain exists to enable POSIX delete-on-close semantics. In a POSIX filesystem, unlinking an open file does not remove the space associated with the file until every thread accessing that file closes the file.


3.3.5 Advlock

The advisory lock domain implements local POSIX advisory locks and NFS NLM locks.


3.3.6 AV

The AV domain implements locks used by the OneFS Antivirus feature.


3.3.7 SMB byte-range

The cbrl lock domain implements support for Windows byte-range locking.


3.3.8 Idmap

The idmap database contains mappings between POSIX (uid, gid) and Windows (SID) identities. This lock domain provides concurrency control for this database.


3.3.9 Opportunistic locks (Oplock)

The Oplock lock domain implements the underlying support for oplocks and leases. See below.


3.3.10 Quota

The Quota domain is used to implement concurrency control to quota domain records.


3.3.11 Share mode

The share_mode_lock domain is used to implement the Windows share mode locks detailed below.

3.4 DLM from the end-user perspective


Much like the OS-level locking, in general, the DLM functionality should be invisible/irrelevant to end users. One feature that is noteworthy is the hangdump infrastructure. The OneFS DLM utilizes heuristics for the expected maximum time to wait to obtain a lock. When these timings are exceeded, OneFS triggers a diagnostic information-gathering process referred to as a hangdump. It is important to understand that the triggering of a hangdump is not necessarily indicative of an issue but should prompt further investigation.


Back to top

4. Byte-range locking

Byte-range locks enable filesystem clients to coordinate access to subsections of files between multiple threads of execution. Both Windows and POSIX-based Operating Systems provide APIs that allow for the locking of specific byte-ranges within files.

4.1 POSIX versus Windows semantics


Although both Operating Systems provide the facility to perform byte-range locking, there are significant semantic differences between the two as follows:

  • POSIX range locks do not stack; multiple locks on a single range are equivalent to one, and a single unlock completely unlocks the range even if there were multiple locks taken out against it.
  • In contrast, Windows range unlocks must match the range taken.
  • Outside of NFSv4, POSIX locks are generally advisory, whereas Windows locks are mandatory.
  • NFSv4 locks may be either mandatory or advisory.  It is left up to the implementer of the NFSv4 file server to decide which to offer with the only stipulation being that the NFSv4 file server must offer only one or the other, not both.
  • POSIX and Windows range locks also differ in the recursive semantics and fairness guarantees.
  • Finally, the Windows semantics for 0-length locks are “unusual”.

4.2 Intra-protocol considerations


All versions of the NFS protocol prior to version 4 were stateless. This is a problem for file-locking, because file-locking is an inherently stateful operation. This has ramifications for the implementation of byte-range locking in NFSv3. The implementation sits outside of the core NFS protocol which operates over port 2049 (unlike NFSv4), and instead utilizes the NLM protocol, which is commonly implemented by two daemons: lockd and statd.


Aside from the requirements regarding two-way communication between the client and server lockd implementations, a concern with respect to NFSv3 locking is that there is no way for the server to determine if a client that is holding locks will ever release them. On OneFS, it is possible via the CLI or API to list NLM locks, and to explicitly release them should the need arise.



4.3 Cross-protocol interactions

4.3.1 OneFS versions 7.1.1 and earlier

In earlier releases of OneFS, there are two distinct DLM domains that implement byte-range locking for the primary protocols of NFS and SMB. The advlock domain implements POSIX-style locking and is utilized both locally and for NFS locking (both v3 and v4). The cbrl domain implements the Windows byte-range locking. Because the implementation uses two distinct DLM lock domains, there is no contention between the two, or put more simply, there is no cross-protocol locking and an NFS byte-range lock is not visible to a Windows client and vice versa.


4.3.2 OneFS versions 7.2.0 and later

While the advlock domain still exists in OneFS 7.2.0 and later releases, it is no longer utilized by the OneFS NFS stack. Local process that utilize POSIX-style locks still operate using this lock domain, but both NFS and SMB now utilize the cbrl lock domain, and as such, locks are respected across these two protocols.


Back to top

5. Share-mode locks


Share-mode locks are Windows-specific. When a file is in-user (open) by one client, the share modes that it specifies determine whether a second client will be able to open that same file at the same time.

5.1 History (why do these exist?)


Since we’ve already covered byte-range locking, you might wonder what purpose these serve. The answer is rooted in history. Historically, for example under DOS, there was no way for more than one process/actor to open a file at the same time. There was no networking and no multithreading of the OS. This presented a problem when network filesystems appeared. Multiple clients could potentially open the same file at the same time, and corrupt the data. Rather than change the access API and force every application to be rewritten, the decision was made to add share modes. Newer code explicitly specifies the level of sharing (see the dwShare argument to the WIN32 CreateFile function []), and calls from the older code simply don’t allow sharing at all.

5.2 Semantics


When opening a file from Windows, the options are to not allow any sharing, or to allow any combination of read, write, and/or delete. If the modes are not compatible, the open will fail with the well-known “sharing violation” error.

5.3 Cross-protocol interactions


There are no cross-protocol interactions. SMB clients contend the share-mode locks and if a client already has a file open where the share-mode flags are not compatible with the requested open, then the new open is denied with a sharing-mode violation. SMB is the only protocol to utilize this lock domain and access via other protocols is not constrained by the share-mode of any existing SMB opens.


Back to top

6. Oplocks and Leases


Opportunistic locks (oplocks for short) and leases are a performance enhancement mechanism whereby the server cooperates with a client and allows the client to aggressively cache data under specific conditions. Oplocks allow a Windows client to cache read-ahead data, writes, opens, closes, and byte-range lock acquisitions. Starting from SMB2.1, Microsoft introduced the concept of leases, which provide more fine-grained and flexible caching for the clients and allow oplock (lease) upgrades in addition to oplock breaks (see below).

6.1 Protocol specifics

  • SMB1, SMB2.0 – oplocks are defined and used in the SMB1 protocol. These are fully supported in OneFS.
  • SMB2.1 and later – oplocks are still supported but leases are also included in the protocol. These offer a number of improvements over oplocks. These are fully supported in OneFS.
  • NFSv3 – there is nothing in the protocol that allows for anything like leases or oplocks.
  • NFSv4 – the protocol offers optional support for file and directory delegations which are very similar to SMB leases. These are not currently supported by OneFS NFSv4 server.

6.2 Oplock/Lease functionality

6.2.1 Oplock grant/definition

When a Windows client attempts to open a file, it can request no oplock or request a batch or exclusive oplock (see below). Once the open has passed the server’s access and share mode checks, the server must do one of the following:

  • Grant the client its requested oplock on the file (exclusive or batch).
  • Grant the client a lower-level oplock on the file, called a level II oplock, defined below.
  • Grant the client no oplock at all on the file.


The various types, ranked from the lowest amount of caching to the highest, are the following:

  • Level II (shared) - Level II oplocks, also referred to as shared oplocks, grant clients the ability to cache the results of read operations. This means a client can prefetch data that an application may want to read, as well as retain old read data, allowing its reads to be more efficient. Multiple clients can hold level II oplocks at the same time, but all existing level II oplocks are broken when a client tries to write data to the file.
  • Exclusive - Exclusive oplocks grant clients the ability to retain read data, like level II oplocks, but also allow clients to cache data and metadata writes and byte-range lock acquisitions. Unlike level II oplocks, a client cannot be granted an exclusive oplock if the file is already opened. If a client is granted an exclusive oplock, it is able to cache writes to the file, cache metadata changes (such as timestamps, but not ACLs) and cache range locks of the file via byte-range locking. As soon as there is another opener, either from the same client or a different client, the server must request to break the exclusive oplock, in order to guarantee the second opener has access to the most up-to-date data.
  • Batch - Batch oplocks are identical to exclusive oplocks, except that they allow clients to cache open/close operations. The origins of this type of oplock are from the days of DOS batch files; batch files were opened and closed for every line of the script to be executed.


6.2.2 Oplock break (loss)

There are two types of oplock breaks: level I breaks and level II breaks. An oplock break occurs when an oplock is contended, due to a conflicting file operation. To understand oplock breaks, we must first understand contention. Contention

Contention is the condition that arises when two locks, either held or requested, conflict with one another. In our case, it is when an operation on one File ID (FID) (a FID is the “handle” that SMB uses to refer to a file) conflicts with a currently held oplock on a different FID, on the same client or on a different client. When an oplock contends with some operation, the oplock is broken. Here are our rules for oplock contention:

  • A level II oplock contends with modifying operations, such as writes and truncates, as well as byte-range lock acquisitions.
  • An exclusive oplock contends with an open operation, except for stat-only opens.
  • A batch oplock contends with an open, delete, or rename operation.


Contention can occur if the operations are from the same or a different Windows client. However, an operation on a FID does not contend against the FID’s own oplock; the FIDs must be different to contend. This normally means that opening the same file a second time will contend with the first opening of the file, since the second opening will be returned a different FID. However, there is one special case which occurs when batch oplocks are being broken, which is described in the next section. Level I Oplock Breaks

The two level I oplocks, exclusive and batch, are broken in different ways. An exclusive oplock is broken when the file it pertains to has been requested for opening. Batch oplocks are broken when the same file is opened from a different client or when the file is deleted or renamed.


When the server needs to break a level I oplock, it must give the client a chance to perform all of the operations that it has cached. Before the server can respond to the open request from the second client, it must wait for an acknowledgment of the oplock break from the first client. The first client now has the chance to do any of the following:

  • Flush cached metadata or data.
  • Send byte-range locking requests.


Once the client has completed flushing its cached operations, it must now relinquish its oplock. It can do this in one of two ways: by closing the file or by responding to the server in acknowledgment that it has downgraded its oplock. When a client decides to downgrade its oplock, it can choose to accept a level II oplock, or it can inform the server that it wants no oplock at all.


After the client has acknowledged the oplock break, the server is free to respond to the open request from the second client. The server may also give the second client a level II oplock, allowing it to cache read data.


Since the server must wait for acknowledgment of its oplock break request, it also must be able to timeout an unresponsive client. OneFS will timeout an oplock break after 30 seconds. Level II Oplock breaks

A FID’s level II oplock is broken when a modifying operation or a byte range lock acquisition is performed on a different FID. The server must tell the first FID that its oplock has been broken and that it can no longer cache read data. Unlike the exclusive oplock break, the server does not need to wait for an acknowledgment of the oplock break from the client and can continue processing the write request right away.


Back to top

7. Leases


Leases are similar to (and compatible with) oplocks, but superior in a number of areas:

  • Leases contend based on a client key, not a FID, so two different programs on a client accessing the same file can share a lease whereas they cannot share an oplock.
  • There are more lease types:
    • A Read (R) lease (shared) indicates that there are multiple readers of a stream and no writers. This supports client read caching (similar to Level II oplock).
    • A Read-Handle (RH) lease (shared) indicates that there are multiple readers of a stream, no writers, and that a client can keep a stream open on the server even though the local accessor on the client machine has closed the stream. This supports client read caching and handle caching (level II plus handle caching).
    • A Read-Write (RW) lease (exclusive) allows a client to open a stream for exclusive access and allows the client to perform arbitrary buffering. This supports client read caching and write caching (Level I Exclusive).
    • Read-Write-Handle (RWH) lease (exclusive) allows a client to keep a stream open on the server even though the local accessor on the client machine has closed the stream. This supports client read caching, write caching, and handle caching (Level I Batch).

7.1 Cross-protocol interactions


In OneFS, it is of particular importance to note that file operations (for example, open, write, read, …) always contend against oplocks/leases. In other words, even for protocols with no support for oplocks/leases, these operations may be delayed to allow SMB clients to correctly flush their caches and drop their oplocks/leases.

8. Other

8.1 Snapshot “locking”


The OneFS SnapshotIQ feature allows for snapshots to be “locked”:


A snapshot lock prevents a snapshot from being deleted. If a snapshot has one or more locks applied to it, the snapshot cannot be deleted and is referred to as a locked snapshot. If the duration period of a locked snapshot expires, OneFS will not delete the snapshot until all locks on the snapshot have been deleted.


OneFS applies snapshot locks to ensure that snapshots generated by OneFS applications are not deleted prematurely. For this reason, it is recommended that you do not delete snapshot locks or modify the duration period of snapshot locks.

8.2 SmartLock


The SmartLock feature prevents users from modifying and deleting files until a configurable retention period expires. It allows files to be committed to a write-once, read-many state. The file can neither be modified nor deleted until after the configured retention period expires. The SmartLock feature can aid in compliance with Securities and Exchange Commission Rule 17a-4.


Back to top