Ceph: A Scalable, High-Performance Distributed File System (2024)

Ceph: A Scalable, High-Performance Distributed File System (1)
OSDI '06 Paper

Pp. 307–320 of the Proceedings

Ceph: A Scalable, High-Performance Distributed File System

Sage A.Weil,Scott A.Brandt,Ethan L.Miller,Darrell D. E.Long,CarlosMaltzahn
University of California, Santa Cruz
{sage, scott, elm, darrell, carlosm}@cs.ucsc.edu

Abstract

We have developed Ceph, a distributed file system that providesexcellent performance, reliability, and scalability. Ceph maximizesthe separation between data and metadata management by replacingallocation tables with a pseudo-random data distribution function(CRUSH) designed for heterogeneous and dynamic clusters of unreliableobject storage devices (OSDs). We leverage device intelligence bydistributing data replication, failure detection and recovery tosemi-autonomous OSDs running a specialized local object file system.A dynamic distributed metadata cluster provides extremely efficientmetadata management and seamlessly adapts to a wide range of generalpurpose and scientific computing file system workloads. Performancemeasurements under a variety of workloads show that Ceph has excellentI/O performance and scalable metadata management, supporting more than250,000 metadata operations per second.

1Introduction

System designers have long sought to improve the performance of filesystems, which have proved critical to the overall performance of anexceedingly broad class of applications. The scientific andhigh-performance computing communities in particular have drivenadvances in the performance and scalability of distributed storagesystems, typically predicting more general purpose needs by a fewyears. Traditional solutions, exemplified byNFS[20], provide a straightforward model in whicha server exports a file system hierarchy that clients can map intotheir local name space. Although widely used, the centralizationinherent in the client/server model has proven a significant obstacleto scalable performance.

More recent distributed file systems have adopted architectures basedon object-based storage, in which conventional hard disks are replacedwith intelligent object storage devices (OSDs) which combine a CPU,network interface, and local cache with an underlying disk orRAID[4,7,8,32,35].OSDs replace the traditional block-level interface with one in whichclients can read or write byte ranges to much larger (and oftenvariably sized) named objects, distributing low-level block allocationdecisions to the devices themselves. Clients typically interact witha metadata server (MDS) to perform metadata operations (open,rename), while communicating directly with OSDs to perform fileI/O (reads and writes), significantly improving overall scalability.

Systems adopting this model continue to suffer from scalabilitylimitations due to little or no distribution of the metadataworkload. Continued reliance on traditional file system principleslike allocation lists and inode tables and a reluctance to delegateintelligence to the OSDs have further limited scalability andperformance, and increased the cost of reliability.

We present Ceph, a distributed file system that provides excellentperformance and reliability while promising unparalleled scalability. Ourarchitecture is based on the assumption that systems at the petabytescale are inherently dynamic: large systems are inevitablybuilt incrementally, node failures are the norm rather than theexception, and the quality and character of workloads are constantlyshifting over time.

Ceph decouples data and metadata operations by eliminating fileallocation tables and replacing them with generating functions. Thisallows Ceph to leverage the intelligence present in OSDs to distributethe complexity surrounding data access, update serialization,replication and reliability, failure detection, and recovery. Cephutilizes a highly adaptive distributed metadata cluster architecturethat dramatically improves the scalability of metadata access, andwith it, the scalability of the entire system. We discuss the goalsand workload assumptions motivating our choices in the design of thearchitecture, analyze their impact on system scalability and performance, and relate our experiences in implementing afunctional system prototype.

2System Overview

The Ceph file system has three main components: the client, eachinstance of which exposes a near-POSIX file system interface to a hostor process; a cluster of OSDs, which collectively stores all data andmetadata; and a metadata server cluster, which manages thenamespace (file names and directories) while coordinating security,consistency and coherence (see Figure1). Wesay the Ceph interface is near-POSIX because we find itappropriate to extend the interface and selectively relax consistencysemantics in order to better align with the needs of applicationsand to improve system performance.

Ceph: A Scalable, High-Performance Distributed File System (2)
Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can either link directly to a client instance or interact with a mounted file system.

The primary goals of the architecture are scalability (to hundreds ofpetabytes and beyond), performance, and reliability. Scalability isconsidered in a variety of dimensions, including the overall storagecapacity and throughput of the system, and performance in terms ofindividual clients, directories, or files. Our target workload mayinclude such extreme cases as tens or hundreds of thousands of hostsconcurrently reading from or writing to the same file or creatingfiles in the same directory. Such scenarios, common in scientificapplications running on supercomputing clusters, are increasinglyindicative of tomorrow's general purpose workloads. More importantly,we recognize that distributed file system workloads are inherentlydynamic, with significant variation in data and metadata access asactive applications and data sets change over time. Ceph directlyaddresses the issue of scalability while simultaneously achieving highperformance, reliability and availability through three fundamentaldesign features: decoupled data and metadata, dynamicdistributed metadata management, and reliable autonomic distributed objectstorage.

Decoupled Data and Metadata-Ceph maximizes theseparation of file metadata management from the storage of filedata. Metadata operations (open, rename, etc.) arecollectively managed by a metadata server cluster, while clientsinteract directly with OSDs to perform file I/O (reads and writes).Object-based storage has long promised to improve the scalability offile systems by delegating low-level block allocation decisions toindividual devices. However, in contrast to existing object-basedfilesystems[4,7,8,32]which replace long per-file block lists with shorter object lists,Ceph eliminates allocation lists entirely. Instead, file data isstriped onto predictably named objects, while a special-purposedata distribution function called CRUSH[29]assigns objects to storage devices. This allows any party tocalculate (rather than look up) the name and location of objectscomprising a file's contents, eliminating the need to maintain anddistribute object lists, simplifying the design of the system, andreducing the metadata cluster workload.

Dynamic Distributed Metadata Management-Because filesystem metadata operations make up as much as half of typical filesystem workloads[22], effective metadatamanagement is critical to overall system performance. Ceph utilizes anovel metadata cluster architecture based on Dynamic SubtreePartitioning[30] that adaptively and intelligentlydistributes responsibility for managing the file system directoryhierarchy among tens or even hundreds of MDSs. A (dynamic)hierarchical partition preserves locality in each MDS's workload,facilitating efficient updates and aggressive prefetching to improveperformance for common workloads. Significantly, the workloaddistribution among metadata servers is based entirely on currentaccess patterns, allowing Ceph to effectively utilize available MDSresources under any workload and achieve near-linear scaling in thenumber of MDSs.

Reliable Autonomic Distributed Object Storage-Largesystems composed of many thousands of devices are inherently dynamic:they are built incrementally, they grow and contract as new storage isdeployed and old devices are decommissioned, device failures arefrequent and expected, and large volumes of data are created, moved,and deleted. All of these factors require that the distribution ofdata evolve to effectively utilize available resources and maintainthe desired level of data replication. Ceph delegates responsibilityfor data migration, replication, failure detection, and failurerecovery to the cluster of OSDs that store the data, while at a highlevel, OSDs collectively provide a single logical object store toclients and metadata servers. This approach allows Ceph to moreeffectively leverage the intelligence (CPU and memory) present on eachOSD to achieve reliable, highly available object storage with linearscaling.

We describe the operation of the Ceph client,metadata server cluster, and distributed object store, and how theyare affected by the critical features of our architecture. We also describe the status of our prototype.

3Client Operation

We introduce the overall operation of Ceph's components and theirinteraction with applications by describing Ceph's clientoperation. The Ceph client runs on each host executing applicationcode and exposes a file system interface to applications. In the Cephprototype, the client code runs entirely in user space and can beaccessed either by linking to it directly or as a mounted file systemvia FUSE[25] (a user-space file systeminterface). Each client maintains its own file data cache, independentof the kernel page or buffer caches, making it accessible toapplications that link to the client directly.

3.1File I/O and Capabilities

When a process opens a file, the client sends a request to the MDScluster. An MDS traverses the file system hierarchy to translate thefile name into the file inode, which includes a unique inodenumber, the file owner, mode, size, and other per-file metadata. Ifthe file exists and access is granted, the MDS returns the inodenumber, file size, and information about the striping strategy used tomap file data into objects. The MDS may also issue the client a capability (if it does not already have one) specifying whichoperations are permitted. Capabilities currently include four bitscontrolling the client's ability to read, cache reads, write, andbuffer writes. In the future, capabilities will include security keysallowing clients to prove to OSDs that they are authorized to read orwrite data[13,19] (the prototypecurrently trusts all clients). Subsequent MDS involvement in file I/O is limited to managingcapabilities to preserve file consistency and achieve propersemantics.

Ceph generalizes a range of striping strategies to map file data ontoa sequence of objects.To avoid any need for file allocation metadata, object names simplycombine the file inode number and the stripe number. Object replicasare then assigned to OSDs using CRUSH, a globally known mappingfunction (described in Section5.1). For example, if oneor more clients open a file for read access, an MDS grants themthe capability to read and cache file content. Armed with the inodenumber, layout, and file size, the clients can name and locate allobjects containing file data and read directly from the OSD cluster.Any objects or byte ranges that don't exist are defined to be file"holes," or zeros. Similarly, if a client opens a file for writing,it is granted the capability to write with buffering, and any data itgenerates at any offset in the file is simply written to theappropriate object on the appropriate OSD. The client relinquishesthe capability on file close and provides the MDS with the new filesize (the largest offset written), which redefines the set of objectsthat (may) exist and contain file data.

3.2Client Synchronization

POSIX semantics sensibly require that reads reflect any datapreviously written, and that writes are atomic (i.e., the result ofoverlapping, concurrent writes will reflect a particular order ofoccurrence). When a file is opened by multiple clients with eithermultiple writers or a mix of readers and writers, the MDS will revokeany previously issued read caching and write buffering capabilities,forcing client I/O for that file to be synchronous. That is, eachapplication read or write operation will block until it isacknowledged by the OSD, effectively placing the burden of updateserialization and synchronization with the OSD storing each object.When writes span object boundaries, clients acquire exclusive locks onthe affected objects (granted by their respective OSDs), and immediately submit the write and unlock operations to achieve thedesired serialization. Object locks are similarly used to masklatency for large writes by acquiring locks and flushing dataasynchronously.

Not surprisingly, synchronous I/O can be a performance killer forapplications, particularly those doing small reads or writes, due tothe latency penalty-at least one round-trip to the OSD. Althoughread-write sharing is relatively rare in general-purposeworkloads[22], it is more common in scientificcomputing applications[27], where performance is oftencritical. For this reason, it is often desirable to relax consistencyat the expense of strict standards conformance in situations whereapplications do not rely on it. Although Ceph supports suchrelaxation via a global switch, and many other distributed filesystems punt on this issue[20], this is animprecise and unsatisfying solution: either performance suffers, orconsistency is lost system-wide.

For precisely this reason, a set of high performance computingextensions to the POSIX I/O interface have been proposed by thehigh-performance computing (HPC)community[31], a subset of which are implemented by Ceph. Mostnotably, these include an O_LAZY flag for open thatallows applications to explicitly relax the usual coherencyrequirements for a shared-writefile. Performance-conscious applications whichmanage their own consistency (e.g., by writing to different parts of thesame file, a common pattern in HPC workloads[27]) arethen allowed to buffer writes or cache reads when I/O would otherwisebe performed synchronously. If desired, applications can thenexplicitly synchronize with two additional calls: lazyio_propagate will flush a given byte range to the objectstore, while lazyio_synchronize will ensure that the effects ofprevious propagations are reflected in any subsequent reads. The Cephsynchronization model thus retains its simplicity by providing correctread-write and shared-write semantics between clients via synchronousI/O, and extending the application interface to relax consistency forperformance conscious distributed applications.

3.3Namespace Operations

Client interaction with the file system namespace is managed by themetadata server cluster. Both read operations (e.g., readdir,stat) and updates (e.g., unlink, chmod) aresynchronously applied by the MDS to ensure serialization, consistency,correct security, and safety. For simplicity, no metadata locks orleases are issued to clients. For HPC workloads in particular, callbacksoffer minimal upside at a high potential cost in complexity.

Instead, Ceph optimizes for the most common metadata access scenarios.A readdir followed by a stat of each file (e.g., ls-l) is an extremely common access pattern and notorious performancekiller in large directories. A readdir in Ceph requires only asingle MDS request, which fetches the entire directory, includinginode contents. By default, if a readdir is immediatelyfollowed by one or more stats, the briefly cached informationis returned; otherwise it is discarded. Although this relaxescoherence slightly in that an intervening inode modification may gounnoticed, we gladly make this trade for vastly improved performance.This behavior is explicitly captured by the readdirplus[31] extension, which returns lstat resultswith directory entries (as some OS-specific implementationsof getdir already do).

Ceph could allow consistency to be further relaxed by caching metadatalonger, much like earlier versions of NFS, which typically cache for30 seconds. However, this approach breaks coherency in a way that isoften critical to applications, such as those using stat todetermine if a file has been updated-they either behave incorrectly,or end up waiting for old cached values to time out.

We opt instead to again provide correct behavior and extend theinterface in instances where it adversely affects performance. Thischoice is most clearly illustrated by a stat operation on a filecurrently opened by multiple clients for writing. In order to returna correct file size and modification time, the MDS revokes any writecapabilities to momentarily stop updates and collect up-to-date sizeand mtime values from all writers. The highest values are returnedwith the stat reply, and capabilities are reissued to allowfurther progress. Although stopping multiple writers may seemdrastic, it is necessary to ensure proper serializability. (For asingle writer, a correct value can be retrieved from the writingclient without interrupting progress.) Applications for whichcoherent behavior is unnecesssary-victims of a POSIX interface thatdoesn't align with their needs-can use statlite[31], which takes a bit mask specifyingwhich inode fields are not required to be coherent.

4Dynamically Distributed Metadata

Metadata operations often make up as much as half of file systemworkloads[22] and lie in the critical path, makingthe MDS cluster critical to overall performance. Metadatamanagement also presents a critical scaling challenge in distributedfile systems: although capacity and aggregate I/O rates can scalealmost arbitrarily with the addition of more storage devices, metadataoperations involve a greater degree of interdependence that makesscalable consistency and coherence management more difficult.

File and directory metadata in Ceph is very small, consisting almostentirely of directory entries (file names) and inodes (80 bytes).Unlike conventional file systems, no file allocation metadata isnecessary-object names are constructed using the inode number, anddistributed to OSDs using CRUSH. This simplifies the metadataworkload and allows our MDS to efficiently manage a very large workingset of files, independent of file sizes. Our design furtherseeks to minimize metadata related disk I/O through the use of atwo-tiered storage strategy, and to maximize locality and cacheefficiency with Dynamic Subtree Partitioning[30].

4.1Metadata Storage

Although the MDS cluster aims to satisfy most requests fromits in-memory cache, metadata updates must be committed todisk for safety. A set of large, bounded, lazily flushed journalsallows each MDS to quickly stream its updated metadata to the OSD cluster in anefficient and distributed manner. The per-MDS journals, eachmany hundreds of megabytes, also absorb repetitive metadataupdates (common to most workloads) such that when old journal entriesare eventually flushed to long-term storage, many are alreadyrendered obsolete.Although MDS recovery is not yet implementedby our prototype, the journals are designed such that in the event ofan MDS failure, another node can quickly rescan the journal to recover the critical contents of the failed node's in-memorycache (for quick startup) and in doing so recover the filesystem state.

This strategy provides the best of both worlds: streaming updates todisk in an efficient (sequential) fashion, and a vastly reducedre-write workload, allowing the long-term on-disk storage layout to beoptimized for future read access. In particular, inodes are embeddeddirectly within directories, allowing the MDS to prefetch entiredirectories with a single OSD read request and exploit the high degreeof directory locality present in mostworkloads[22]. Each directory's content iswritten to the OSD cluster using the same striping and distributionstrategy as metadata journals and file data. Inode numbers areallocated in ranges to metadata servers and consideredimmutable in our prototype, although in the future they could betrivially reclaimed on file deletion. An auxiliary anchortable[28] keeps the rare inode with multiplehard links globally addressable by inode number-all withoutencumbering the overwhelmingly common case of singly-linked files withan enormous, sparsely populated and cumbersomeinode table.

4.2Dynamic Subtree Partitioning

Our primary-copy caching strategy makes a singleauthoritative MDS responsible for managing cache coherence andserializing updates for any given piece of metadata.While most existing distributed file systems employ some form ofstatic subtree-based partitioning to delegate this authority (usuallyforcing an administrator to carve the dataset into smaller static"volumes"), some recent and experimental file systems have usedhash functions to distribute directory and filemetadata[4], effectively sacrificing locality forload distribution. Both approaches have critical limitations: staticsubtree partitioning fails to cope with dynamic workloads and datasets, while hashing destroys metadata locality and criticalopportunities for efficient metadata prefetching and storage.

Ceph: A Scalable, High-Performance Distributed File System (3)
Figure 2: Ceph dynamically maps subtrees of the directory hierarchy to metadata servers based on the current workload. Individual directories are hashed across multiple nodes only when they become hot spots.

Ceph's MDS cluster is based on a dynamic subtree partitioningstrategy[30] that adaptively distributes cached metadatahierarchically across a set of nodes, as illustrated inFigure2. Each MDS measures the popularity of metadatawithin the directory hierarchy using counters with an exponential timedecay. Any operation increments the counter on the affected inode andall of its ancestors up to the root directory, providing each MDS witha weighted tree describing the recent load distribution. MDS loadvalues are periodically compared, and appropriately-sized subtrees ofthe directory hierarchy are migrated to keep the workload evenlydistributed. The combination of shared long-term storage andcarefully constructed namespace locks allows such migrations toproceed by transferring the appropriate contents of the in-memorycache to the new authority, with minimal impact on coherence locks orclient capabilities. Imported metadata is written to the new MDS'sjournal for safety, while additional journal entries on both endsensure that the transfer of authority is invulnerable to interveningfailures (similar to a two-phase commit). The resulting subtree-basedpartition is kept coarse to minimize prefix replication overhead andto preserve locality.

When metadata is replicated across multiple MDS nodes, inode contentsare separated into three groups, each with different consistencysemantics: security (owner, mode), file (size, mtime), and immutable(inode number, ctime, layout). While immutable fields never change,security and file locks are governed by independent finite statemachines, each with a different set of states and transitions designedto accommodate different access and update patterns while minimizinglock contention. For example, owner and mode are required for thesecurity check during path traversal but rarely change, requiring veryfew states, while the file lock reflects a wider range of clientaccess modes as it controls an MDS's ability to issue clientcapabilities.

4.3Traffic Control

Partitioning the directory hierarchy across multiple nodes can balancea broad range of workloads, but cannot always cope with hot spots orflash crowds, where many clients access the same directory or file.Ceph uses its knowledge of metadata popularity to provide a widedistribution for hot spots only when needed and without incurring theassociated overhead and loss of directory locality in the generalcase. The contents of heavily read directories (e.g., many opens) areselectively replicated across multiple nodes to distribute load.Directories that are particularly large or experiencing a heavy writeworkload (e.g., many file creations) have their contents hashed by filename across the cluster, achieving a balanced distribution at theexpense of directory locality. This adaptive approach allows Ceph toencompass a broad spectrum of partition granularities, capturing thebenefits of both coarse and fine partitions in the specificcirc*mstances and portions of the file system where those strategiesare most effective.

Every MDS response provides the client with updated information about theauthority and any replication of the relevant inode and its ancestors,allowing clients to learn the metadata partition for the parts of thefile system with which they interact. Future metadata operations aredirected at the authority (for updates) or a random replica(for reads) based on the deepest known prefix of a given path.Normally clients learn the locations of unpopular(unreplicated) metadata and are able to contact the appropriate MDSdirectly. Clients accessing popular metadata, however, aretold the metadata reside either on different or multiple MDS nodes,effectively bounding the number of clients believing any particularpiece of metadata resides on any particular MDS, dispersing potentialhot spots and flash crowds before they occur.

5Distributed Object Storage

From a high level, Ceph clients and metadata servers view the objectstorage cluster (possibly tens or hundreds of thousands of OSDs) as asingle logical object store and namespace. Ceph'sReliable Autonomic Distributed Object Store (RADOS) achieves linearscaling in both capacity and aggregate performance by delegatingmanagement of object replication, cluster expansion, failure detectionand recovery to OSDs in a distributed fashion.

5.1Data Distribution with CRUSH

Ceph must distribute petabytes of data among an evolving cluster ofthousands of storage devices such that device storage and bandwidthresources are effectively utilized. In order to avoid imbalance(e.g., recently deployed devices mostly idle or empty) or loadasymmetries (e.g., new, hot data on new devices only), we adopt astrategy that distributes new data randomly, migrates arandom subsample of existing data to new devices, and uniformlyredistributes data from removed devices. This stochastic approach isrobust in that it performs equally well under any potential workload.

Ceph: A Scalable, High-Performance Distributed File System (4)
Figure 3: Files are striped across many objects, grouped into placement groups (PGs), and distributed to OSDs via CRUSH, a specialized replica placement function.

Ceph first maps objects into placement groups (PGs) using asimple hash function, with an adjustable bit mask to control thenumber of PGs. We choose a value that gives each OSD on theorder of 100 PGs to balance variance in OSD utilizations with theamount of replication-related metadata maintained by each OSD.Placement groups are then assigned to OSDs using CRUSH (ControlledReplication Under Scalable Hashing)[29], a pseudo-randomdata distribution function that efficiently maps each PG to an orderedlist of OSDs upon which to store object replicas. This differs fromconventional approaches (including other object-based file systems) inthat data placement does not rely on any block or object listmetadata. To locate any object, CRUSH requires only the placementgroup and an OSD cluster map: a compact, hierarchicaldescription of the devices comprising the storage cluster. Thisapproach has two key advantages: first, it is completely distributedsuch that any party (client, OSD, or MDS) can independently calculatethe location of any object; and second, the map is infrequentlyupdated, virtually eliminating any exchange of distribution-relatedmetadata. In doing so, CRUSH simultaneously solves both the datadistribution problem ("where should I store data") and the datalocation problem ("where did I store data"). By design, smallchanges to the storage cluster have little impact on existing PGmappings, minimizing data migration due to device failures or clusterexpansion.

The cluster map hierarchy is structured to align with the clustersphysical or logical composition and potential sources of failure. Forinstance, one might form a four-level hierarchy for an installationconsisting of shelves full of OSDs, rack cabinets full of shelves, androws of cabinets. Each OSD also has a weight value to control therelative amount of data it is assigned. CRUSH maps PGs onto OSDsbased on placement rules, which define the level of replicationand any constraints on placement. For example, one might replicateeach PG on three OSDs, all situated in the same row (to limitinter-row replication traffic) but separated into different cabinets(to minimize exposure to a power circuit or edge switch failure). Thecluster map also includes a list of down or inactive devices and anepoch number, which is incremented each time the map changes. All OSDrequests are tagged with the client's map epoch, such that all partiescan agree on the current distribution of data. Incremental mapupdates are shared between cooperating OSDs, and piggyback on OSDreplies if the client's map is out of date.

5.2Replication

In contrast to systems like Lustre[4], which assumeone can construct sufficiently reliable OSDs using mechanisms likeRAID or fail-over on a SAN, we assume that in a petabyte or exabytesystem failure will be the norm rather than the exception, and at anypoint in time several OSDs are likely to be inoperable. To maintainsystem availability and ensure data safety in a scalable fashion,RADOS manages its own replication of data using a variant ofprimary-copy replication[2], while taking steps tominimize the impact on performance.

Data is replicated in terms of placement groups, each of which ismapped to an ordered list of n OSDs (for n-way replication).Clients send all writes to the first non-failed OSD in an object's PG(the primary), which assigns a new version number for the objectand PG and forwards the write to any additional replica OSDs.After each replica has applied the update and responded to theprimary, the primary applies the update locally and the write isacknowledged to the client. Reads are directed at the primary. Thisapproach spares the client of any of the complexity surroundingsynchronization or serialization between replicas, which can beonerous in the presence of other writers or failure recovery. It alsoshifts the bandwidth consumed by replication from the client to theOSD cluster's internal network, where we expect greater resources tobe available. Intervening replica OSD failures are ignored, as anysubsequent recovery (see Section5.5) will reliablyrestore replica consistency.

5.3Data Safety

In distributed storage systems, there are essentially two reasons whydata is written to shared storage. First, clients are interested inmaking their updates visible to other clients. This should be quick:writes should be visible as soon as possible, particularly whenmultiple writers or mixed readers and writers force clients to operatesynchronously. Second, clients are interested in knowing definitivelythat the data they've written is safely replicated, on disk, and willsurvive power or other failures. RADOS disassociates synchronizationfrom safety when acknowledging updates, allowing Ceph to realize bothlow-latency updates for efficient application synchronization andwell-defined data safety semantics.

Figure4 illustrates the messages sent during anobject write. The primary forwards the update to replicas, andreplies with an ack after it is applied to all OSDs' in-memorybuffer caches, allowing synchronous POSIX calls on the client toreturn. A final commit is sent (perhaps many seconds later)when data is safely committed to disk. We send the ack to theclient only after the update is fully replicated to seamlesslytolerate the failure of any single OSD, even though this increasesclient latency. By default, clients also buffer writes until theycommit to avoid data loss in the event of a simultaneous power loss toall OSDs in the placement group. When recovering in such cases, RADOS allows the replayof previously acknowledged (and thus ordered) updates for a fixedinterval before new updates are accepted.

Ceph: A Scalable, High-Performance Distributed File System (5)
Figure 4: RADOS responds with an ack after the write has been applied to the buffer caches on all OSDs replicating the object. Only after it has been safely committed to disk is a final commit notification sent to the client.

5.4Failure Detection

Timely failure detection is critical to maintaining data safety, butcan become difficult as a cluster scales to many thousands of devices.For certain failures, such as disk errors or corrupted data, OSDs canself-report. Failures that make an OSD unreachable on the network,however, require active monitoring, which RADOS distributes by havingeach OSD monitor those peers with which it shares PGs. In most cases,existing replication traffic serves as a passive confirmation ofliveness, with no additional communication overhead. If an OSD hasnot heard from a peer recently, an explicit ping is sent.

RADOS considers two dimensions of OSD liveness: whether the OSD isreachable, and whether it is assigned data by CRUSH. An unresponsiveOSD is initially marked down, and any primary responsibilities(update serialization, replication) temporarily pass to the nextOSD in each of its placement groups. If the OSD does not quicklyrecover, it is marked out of the data distribution, andanother OSD joins each PG to re-replicate its contents. Clients whichhave pending operations with a failed OSD simply resubmit to the newprimary.

Because a wide variety of network anomalies may cause intermittentlapses in OSD connectivity, a small cluster of monitors collectsfailure reports and filters out transient or systemic problems (like anetwork partition) centrally. Monitors (which are only partiallyimplemented) use elections, active peer monitoring, short-term leases,and two-phase commits to collectively provide consistent and availableaccess to the cluster map. When the map is updated to reflect anyfailures or recoveries, affected OSDs are provided incremental mapupdates, which then spread throughout the cluster by piggybacking onexisting inter-OSD communication. Distributed detection allows fastdetection without unduly burdening monitors, while resolving theoccurrence of inconsistency with centralized arbitration. Mostimportantly, RADOS avoids initiating widespread data re-replicationdue to systemic problems by marking OSDs down but not out(e.g., after a power loss to half of all OSDs).

5.5Recovery and Cluster Updates

The OSD cluster map will change due to OSD failures, recoveries, andexplicit cluster changes such as the deployment of new storage. Cephhandles all such changes in the same way. To facilitate fast recovery,OSDs maintain a version number for each object and a log of recentchanges (names and versions of updated or deleted objects) for eachPG (similar to the replication logs in Harp[14]) .

When an active OSD receives an updated cluster map, it iterates overall locally stored placement groups and calculates the CRUSHmapping to determine which ones it is responsible for, either as aprimary or replica. If a PG's membership has changed, or if the OSD hasjust booted, the OSD must peer with the PG's other OSDs. Forreplicated PGs, the OSD provides the primary with its current PG versionnumber. If the OSD is the primary for the PG, it collects current(and former) replicas' PG versions. If the primary lacks the mostrecent PG state, it retrieves the log of recent PG changes (or acomplete content summary, if needed) from current or prior OSDs in thePG in order to determine the correct (most recent) PG contents. Theprimary then sends each replica an incremental log update (or completecontent summary, if needed), such that all parties know what the PGcontents should be, even if their locally stored object set maynot match. Only after the primary determines the correct PG state andshares it with any replicas is I/O to objects in the PG permitted.OSDs are then independently responsible for retrieving missing oroutdated objects from their peers. If an OSD receives a request for astale or missing object, it delays processing and moves that object tothe front of the recovery queue.

For example, suppose osd1 crashes and is marked down, andosd2 takes over as primary for pgA. If osd1recovers, it will request the latest map on boot, and a monitor willmark it as up. When osd2 receives the resulting mapupdate, it will realize it is no longer primary for pgA and sendthe pgA version number to osd1. osd1 will retrieverecent pgA log entries from osd2, tell osd2 itscontents are current, and then begin processing requests while anyupdated objects are recovered in the background.

Because failure recovery is driven entirely by individual OSDs, eachPG affected by a failed OSD will recover in parallel to (very likely)different replacement OSDs. This approach, based on the Fast RecoveryMechanism (FaRM)[37], decreases recovery timesand improves overall data safety.

5.6Object Storage with EBOFS

Although a variety of distributed file systems use local file systemslike ext3 to manage low-levelstorage[4,12], we found theirinterface and performance to be poorly suited for objectworkloads[27]. The existing kernel interface limits ourability to understand when object updates are safely committed ondisk. Synchronous writes or journaling provide the desiredsafety, but only with a heavy latency and performance penalty. Moreimportantly, the POSIX interface fails to support atomic data andmetadata (e.g., attribute) update transactions, which are important formaintaining RADOS consistency.

Instead, each Ceph OSD manages its local object storage with EBOFS, anExtent and B-tree based Object File System. Implementing EBOFSentirely in user space and interacting directly with a raw blockdevice allows us to define our own low-level object storage interfaceand update semantics, which separate update serialization (forsynchronization) from on-disk commits (for safety). EBOFS supportsatomic transactions (e.g., writes and attribute updates onmultiple objects), and update functions return when the in-memory caches are updated, whileproviding asynchronous notification of commits.

A user space approach, aside from providing greater flexibility andeasier implementation, also avoids cumbersome interaction with theLinux VFS and page cache, both of which were designed for a differentinterface and workload. While most kernel file systems lazily flushupdates to disk after some time interval, EBOFS aggressively schedulesdisk writes, and opts instead to cancel pending I/O operations whensubsequent updates render them superfluous. This provides ourlow-level disk scheduler with longer I/O queues and a correspondingincrease in scheduling efficiency. A user-space scheduler also makesit easier to eventually prioritize workloads (e.g., client I/O versusrecovery) or provide quality of service guarantees[36].

Central to the EBOFS design is a robust, flexible, and fully integratedB-tree service that is used to locate objects on disk, manage blockallocation, and index collections (placement groups). Block allocation is conducted in terms of extents-start and lengthpairs-instead of block lists, keeping metadata compact. Free blockextents on disk are binned by size and sorted by location, allowingEBOFS to quickly locate free space near the write position or relateddata on disk, while also limiting long-term fragmentation. With theexception of per-object block allocation information, all metadata iskept in memory for performance and simplicity (it is quite small, even for large volumes).Finally, EBOFS aggressively performs copy-on-write: with the exceptionof superblock updates, data is always written to unallocated regionsof disk.

6Performance and Scalability Evaluation

We evaluate our prototype under a range of microbenchmarks todemonstrate its performance, reliability, and scalability.In all tests, clients, OSDs, and MDSs are user processesrunning on a dual-processor Linux cluster with SCSI disks andcommunicating using TCP. In general, each OSD or MDS runs on its ownhost, while tens or hundreds of client instances may share the samehost while generating workload.

6.1Data Performance

EBOFS provides superior performance and safety semantics, while thebalanced distribution of data generated by CRUSH and the delegation ofreplication and failure recovery allow aggregate I/O performance toscale with the size of the OSD cluster.

6.1.1OSD Throughput

Ceph: A Scalable, High-Performance Distributed File System (6)
Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD throughput, although if the number of OSDs is fixed, n-way replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs.

We begin by measuring the I/O performance of a 14-node cluster ofOSDs. Figure5 shows per-OSD throughput (y) with varyingwrite sizes (x) and replication. Workload is generated by 400clients on 20 additional nodes. Performance is ultimately limited bythe raw disk bandwidth (around 58MB/sec), shown by the horizontalline. Replication doubles or triples disk I/O, reducing client data ratesaccordingly when the number of OSDs is fixed.

Ceph: A Scalable, High-Performance Distributed File System (7)
Figure 6: Performance of EBOFS compared to general-purpose file systems. Although small writes suffer from coarse locking in our prototype, EBOFS nearly saturates the disk for writes larger than 32KB. Since EBOFS lays out data in large extents when it is written in large increments, it has significantly better read performance.

Figure6 compares the performance of EBOFS to thatof general-purpose file systems (ext3, ReiserFS, XFS) in handling aCeph workload. Clients synchronously write out large files, striped over16MB objects, and read them back again. Although small read and write performance in EBOFSsuffers from coarse threading and locking, EBOFS very nearly saturatesthe available disk bandwidth for writes sizes larger than 32KB, andsignificantly outperforms the others for read workloads because datais laid out in extents on disk that match the write sizes-even whenthey are very large. Performance was measured using a fresh file system.Experience with an earlier EBOFS design suggests it will experiencesignificantly lower fragmentation than ext3, but we have not yetevaluated the current implementation on an aged file system. In anycase, we expect the performance of EBOFS after aging to be no worse than theothers.

6.1.2Write Latency

Ceph: A Scalable, High-Performance Distributed File System (8)
Figure 7: Write latency for varying write sizes and replication. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency for writes over 128KB by acquiring exclusive locks and asynchronously flushing the data.

Figure7 shows the synchronous write latency (y) fora single writer with varying write sizes (x) and replication.Because the primary OSD simultaneously retransmits updates to allreplicas, small writes incur a minimal latency increase for more thantwo replicas. For larger writes, the cost of retransmissiondominates; 1MB writes (not shown) take 13ms for one replica, and2.5 times longer (33ms) for three. Ceph clients partially mask thislatency for synchronous writes over 128KB by acquiring exclusive locks and then asynchronously flushing the data to disk.Alternatively, write-sharing applications can opt to use O_LAZY. With consistency thus relaxed, clients can buffersmall writes and submit only large, asynchronous writes to OSDs; theonly latency seen by applications will be due to clients which filltheir caches waiting for data to flush to disk.

6.1.3Data Distribution and Scalability

Ceph: A Scalable, High-Performance Distributed File System (9)
Figure 8: OSD write performance scales linearly with the size of the OSD cluster until the switch is saturated at 24 OSDs. CRUSH and hash performance improves when more PGs lower variance in OSD utilization.

Ceph's data performance scales nearly linearly in the number ofOSDs. CRUSH distributes data pseudo-randomly such that OSDutilizations can be accurately modeled by a binomial or normaldistribution-what one expects from a perfectly randomprocess[29]. Variance in utilizations decreases as thenumber of groups increases: for 100 placement groups per OSD thestandard deviation is 10%; for 1000 groups it is 3%.Figure8 shows per-OSD write throughput as thecluster scales using CRUSH, a simple hash function, and a linearstriping strategy to distribute data in 4096 or 32768 PGs amongavailable OSDs. Linear striping balances load perfectly for maximumthroughput to provide a benchmark for comparison, but like a simplehash function, it fails to cope with device failures or other OSDcluster changes. Because data placement with CRUSH or a hash isstochastic, throughputs are lower with fewer PGs: greater variance inOSD utilizations causes request queue lengths to drift apart under ourentangled client workload. Because devices can become overfilled oroverutilized with small probability, dragging down performance, CRUSHcan correct such situations by offloading any fraction of theallocation for OSDs specially marked in the cluster map. Unlike thehash and linear strategies, CRUSH also minimizes data migration undercluster expansion while maintaining a balanced distribution. CRUSHcalculations are O(logn) (for a cluster of n OSDs) and take onlytens of microseconds, allowing clusters to grow to hundreds ofthousands of OSDs.

6.2Metadata Performance

Ceph's MDS cluster offers enhanced POSIX semantics with excellentscalability. We measure performance via a partial workload lackingany data I/O; OSDs in these experiments are used solely for metadatastorage.

6.2.1Metadata Update Latency

We first consider the latency associated with metadata updates (e.g., mknod or mkdir). A single client creates a series offiles and directories which the MDS must synchronously journal to acluster of OSDs for safety. We consider both a diskless MDS, whereall metadata is stored in a shared OSD cluster, and one which also hasa local disk serving as the primary OSD for its journal.Figure9(a) shows the latency (y) associated withmetadata updates in both cases with varying metadata replication (x)(where zero corresponds to no journaling at all). Journal entries arefirst written to the primary OSD and then replicated to any additionalOSDs. With a local disk, the initial hop from the MDS to the (local)primary OSD takes minimal time, allowing update latencies for 2×replication similar to 1× in the diskless model. In both cases, morethan two replicas incurs little additional latency because replicasupdate in parallel.

Ceph: A Scalable, High-Performance Distributed File System (10)
(a) Metadata update latency for an MDS with and without a local disk. Zero corresponds to no journaling.
Ceph: A Scalable, High-Performance Distributed File System (11)
(b) Cumulative time consumed during a file system walk.

Figure 9: Using a local disk lowers the write latency by avoiding the initial network round-trip. Reads benefit from caching, while readdirplus or relaxed consistency eliminate MDS interaction for stats following readdir.

6.2.2Metadata Read Latency

The behavior of metadata reads (e.g., readdir, stat, open) is more complex. Figure9(b) shows cumulativetime (y) consumed by a client walking 10,000 nested directories witha readdir in each directory and a stat on each file. Aprimed MDS cache reduces readdir times. Subsequent statsare not affected, because inode contents are embedded in directories,allowing the full directory contents to be fetched into the MDS cachewith a single OSD access. Ordinarily, cumulative stat timeswould dominate for larger directories. Subsequent MDS interaction canbe eliminated by using readdirplus, which explicitly bundlesstat and readdir results in a single operation, or byrelaxingPOSIX to allow stats immediately following areaddir to be served from client caches (the default).

6.2.3Metadata Scaling

Ceph: A Scalable, High-Performance Distributed File System (12)
Figure 10: Per-MDS throughput under a variety of workloads and cluster sizes. As the cluster grows to 128 nodes, efficiency drops no more than 50% below perfect linear (horizontal) scaling for most workloads, allowing vastly improved performance over existing systems.

We evaluate metadata scalability using a 430node partition of thealc Linux cluster at Lawrence Livermore National Laboratory(LLNL). Figure10 shows per-MDS throughput (y) asa function of MDS cluster size (x), such that a horizontal linerepresents perfect linear scaling. In the makedirs workload,each client creates a tree of nested directories four levels deep,with ten files and subdirectories in each directory. Average MDSthroughput drops from 2000 ops per MDS per second with a small cluster, toabout 1000 ops per MDS per second (50% efficiency) with 128 MDSs(over 100,000 ops/sec total). In the makefiles workload, eachclient creates thousands of files in the same directory. When thehigh write levels are detected, Ceph hashes the shared directory andrelaxes the directory's mtime coherence to distribute theworkload across all MDS nodes. The openshared workloaddemonstrates read sharing by having each client repeatedly open andclose ten shared files. In the openssh workloads, each clientreplays a captured file system trace of a compilation in a privatedirectory. One variant uses a shared /lib for moderate sharing,while the other shares /usr/include, which is very heavily read.The openshared and openssh+include workloads have theheaviest read sharing and show the worst scaling behavior, we believedue to poor replica selection by clients. openssh+lib scalesbetter than the trivially separable makedirs because it containsrelatively few metadata modifications and little sharing. Although webelieve that contention in the network or threading in our messaginglayer further lowered performance for larger MDS clusters, our limitedtime with dedicated access to the large cluster prevented a morethorough investigation.

Figure11 plots latency (y) versus per-MDSthroughput (x)for a 4-, 16-, and 64-node MDS cluster under the makedirsworkload. Larger clusters have imperfect load distributions,resulting in lower average per-MDS throughput (but, of course, muchhigher total throughput) and slightly higher latencies.

Ceph: A Scalable, High-Performance Distributed File System (13)
Figure 11: Average latency versus per-MDS throughput for different cluster sizes (makedirs workload).

Despite imperfect linear scaling, a 128-node MDS cluster running ourprototype can service more than a quarter million metadata operationsper second (128 nodes at 2000 ops/sec). Because metadata transactionsare independent of data I/O and metadata size is independent of filesize, this corresponds to installations with potentially many hundredsof petabytes of storage or more, depending on average file size. Forexample, scientific applications creating checkpoints on LLNL'sBluegene/L might involve 64 thousand nodes with two processors eachwriting to separate files in the same directory (as in the makefiles workload). While the current storage system peaks at 6,000metadata ops/sec and would take minutes to complete each checkpoint, a128-node Ceph MDS cluster could finish in two seconds. If each filewere only 10MB (quite small by HPC standards) and OSDs sustain50MB/sec, such a cluster could write 1.25TB/sec, saturating atleast 25,000 OSDs (50,000 with replication). 250GB OSDs would putsuch a system at more than six petabytes. More importantly, Ceph'sdynamic metadata distribution allows an MDS cluster (of anysize) to reallocate resources based on the current workload, even whenall clients access metadata previously assigned to a single MDS,making it significantly more versatile and adaptable than any staticpartitioning strategy.

7Experiences

We were pleasantly surprised by the extent to which replacing fileallocation metadata with a distribution function became a simplifyingforce in our design. Although this placed greater demands on thefunction itself, once we realized exactly what those requirementswere, CRUSH was able to deliver the necessary scalability,flexibility, and reliability. This vastly simplified our metadataworkload while providing both clients and OSDs with complete andindependent knowledge of the data distribution. The latter enabled usto delegate responsibility for data replication, migration, failuredetection, and recovery to OSDs, distributing these mechanisms ina way that effectively leveraged their bundled CPU and memory. RADOShas also opened the door to a range of future enhancements thatelegantly map onto our OSD model, such as bit error detection (as inthe Google File System[7]) and dynamic replicationof data based on workload (similar to AutoRAID[34]).

Although it was tempting to use existing kernel file systems for localobject storage (as many other systems havedone[4,7,9]), werecognized early on that a file system tailored for object workloadscould offer better performance[27]. What we did notanticipate was the disparity between the existing file systeminterface and our requirements, which became evident while developingthe RADOS replication and reliability mechanisms. EBOFS wassurprisingly quick to develop in user-space, offered very satisfyingperformance, and exposed an interface perfectly suited to ourrequirements.

One of the largest lessons in Ceph was the importance of the MDS loadbalancer to overall scalability, and the complexity of choosing what metadata to migrate where and when. Although inprinciple our design and goals seem quite simple, the reality ofdistributing an evolving workload over a hundred MDSshighlighted additional subtleties. Most notably, MDS performance has a wide range ofperformance bounds, including CPU, memory (and cacheefficiency), and network or I/O limitations, any of which maylimit performance at any point in time. Furthermore, it isdifficult to quantitatively capture the balance between totalthroughput and fairness; under certain circ*mstances unbalancedmetadata distributions can increase overallthroughput[30].

Implementation of the client interface posed a greater challenge thananticipated. Although the use of FUSE vastly simplifiedimplementation by avoiding the kernel, it introduced its ownset of idiosyncrasies. DIRECT_IO bypassed kernel page cachebut didn't support mmap, forcing us to modify FUSE toinvalidate clean pages as a workaround. FUSE'sinsistence on performing its own security checks results in copious getattrs (stats) for even simpleapplication calls. Finally, page-based I/O between kernel and userspace limits overall I/O rates. Although linking directly to the clientavoids FUSE issues, overloading system calls in user space introducesa new set of issues (most of which we have yet to fully examine),making an in-kernel client module inevitable.

8Related Work

High-performance scalable file systems have long been a goal of theHPC community, which tends to place a heavy load on the filesystem[18,27]. Although many filesystems attempt to meet this need, they do not provide the same levelof scalability that Ceph does. Large-scale systems likeOceanStore[11] and Farsite[1]are designed to provide petabytes of highly reliable storage, and canprovide simultaneous access to thousands of separate files tothousands of clients, but cannot provide high-performance access to asmall set of files by tens of thousands of cooperating clients due tobottlenecks in subsystems such as name lookup. Conversely,parallel file and storage systems such as Vesta[6],Galley[17],PVFS[12], and Swift[5] have extensive support for stripingdata across multiple disks to achieve very high transfer rates,but lack strong support for scalable metadata access or robustdata distribution for high reliability. For example, Vesta permitsapplications to lay their data out on disk, and allows independentaccess to file data on each disk without reference to shared metadata.However, like many other parallel file systems, Vesta does notprovide scalable support for metadata lookup. As a result, these filesystems typically provide poor performance on workloads that accessmany small files or require many metadata operations. They alsotypically suffer from block allocation issues: blocks are eitherallocated centrally orvia a lock-based mechanism, preventing them from scaling well forwrite requests from thousands of clients to thousands of disks.GPFS[24] and StorageTank[16]partially decouple metadata and data management, but are limited bytheir use of block-based disks and their metadata distributionarchitecture.

Grid-based file systems such as LegionFS[33] aredesigned to coordinate wide-area access and are not optimized for highperformance in the local file system. Similarly, the Google FileSystem[7] is optimized for very large files and aworkload consisting largely of reads and file appends. LikeSorrento[26], it targets a narrow class of applicationswith non-POSIX semantics.

Recently, many file systems and platforms, including Federated Arrayof Bricks(FAB)[23] andpNFS[9] have been designed around networkattached storage[8]. Lustre[4],the Panasas file system[32], zFS[21],Sorrento, and Kybos[35] are based on the object-basedstorage paradigm[3] and most closely resemble Ceph.However, none of these systems has the combination of scalable andadaptable metadata management, reliability and fault tolerance thatCeph provides. Lustre and Panasas in particular fail to delegateresponsibility to OSDs, and have limited support for efficientdistributed metadata management, limiting their scalability andperformance. Further, with the exception of Sorrento's use ofconsistent hashing[10], all of thesesystems use explicit allocation maps to specify where objects arestored, and have limited support for rebalancing when new storage isdeployed. This can lead to load asymmetries and poor resourceutilization, while Sorrento's hashed distribution lacks CRUSH'ssupport for efficient data migration, device weighting, andfailure domains.

9Future Work

Some core Ceph elements have not yet been implemented, including MDSfailure recovery and several POSIX calls. Two security architectureand protocol variants are under consideration, but neither have yetbeen implemented[13,19]. We alsoplan on investigating the practicality of client callbacks onnamespace to inode translation metadata. For static regions of thefile system, this could allow opens (for read) to occur without MDSinteraction. Several other MDS enhancements are planned, includingthe ability to create snapshots of arbitrary subtrees of the directoryhierarchy[28].

Although Ceph dynamically replicates metadata when flash crowdsaccess single directories or files, the same is not yet true of filedata. We plan to allow OSDs to dynamically adjust the level ofreplication for individual objects based on workload, and todistribute read traffic across multiple OSDs in the placement group.This will allow scalable access to small amounts of data, and mayfacilitate fine-grained OSD load balancing using a mechanism similar toD-SPTF[15].

Finally, we are working on developing a quality of servicearchitecture to allow both aggregate class-based trafficprioritization and OSD-managed reservation based bandwidth and latencyguarantees. In addition to supporting applications with QoSrequirements, this will help balance RADOS replication and recoveryoperations with regular workload. A number of other EBOFSenhancements are planned, including improved allocation logic, datascouring, and checksums or other bit-error detection mechanisms toimprove data safety.

10Conclusions

Ceph addresses three critical challenges of storagesystems-scalability, performance, and reliability-by occupying aunique point in the design space. By shedding design assumptions likeallocation lists found in nearly all existing systems, we maximallyseparate data from metadata management, allowing them to scaleindependently. This separation relies on CRUSH, a data distributionfunction that generates a pseudo-random distribution, allowing clientsto calculate object locations instead of looking them up. CRUSHenforces data replica separation across failure domains for improveddata safety while efficiently coping with the inherently dynamicnature of large storage clusters, where devices failures, expansionand cluster restructuring are the norm.

RADOS leverages intelligent OSDs to manage data replication, failuredetection and recovery, low-level disk allocation, scheduling, anddata migration without encumbering any central server(s). Althoughobjects can be considered files and stored in a general-purpose filesystem, EBOFS provides more appropriate semantics and superiorperformance by addressing the specific workloads and interfacerequirements present in Ceph.

Finally, Ceph's metadata management architecture addresses one of themost vexing problems in highly scalable storage-how to efficientlyprovide a single uniform directory hierarchy obeying POSIX semanticswith performance that scales with the number of metadata servers.Ceph's dynamic subtree partitioning is a uniquely scalable approach,offering both efficiency and the ability to adapt to varyingworkloads.

Ceph is licensed under the LGPL and is available athttps://ceph.sourceforge.net/.

Acknowledgments

This work was performed under the auspices of the U.S. Department ofEnergy by the University of California, Lawrence Livermore NationalLaboratory under Contract W-7405-Eng-48. Research was funded in partby the Lawrence Livermore, Los Alamos, and Sandia NationalLaboratories. We would like to thank Bill Loewe, Tyce McLarty, TerryHeidelberg, and everyone else at LLNL who talked to us about theirstorage trials and tribulations, and who helped facilitate our twodays of dedicated access time on alc. We would also like tothank IBM for donating the 32-node cluster that aided in much of theOSD performance testing, and the National Science Foundation, which paidfor the switch upgrade. Chandu Thekkath (our shepherd), the anonymousreviewers, and Theodore Wong all provided valuable feedback, and wewould also like to thank the students, faculty, and sponsors of theStorage Systems Research Center for their input and support.

References

[1]
A.Adya, W.J. Bolosky, M.Castro, R.Chaiken, G.Cermak, J.R. Douceur, J.Howell, J.R. Lorch, M.Theimer, and R.Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, Dec. 2002. USENIX.
[2]
P.A. Alsberg and J.D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering, pages 562-570. IEEE Computer Society Press, 1976.
[3]
A.Azagury, V.Dreizin, M.Factor, E.Henis, D.Naor, N.Rinetzky, O.Rodeh, J.Satran, A.Tavory, and L.Yerushalmi. Towards an object store. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 165-176, Apr. 2003.
[4]
P.J. Braam. The Lustre storage architecture. https://www.lustre.org/documentation.html, Cluster File Systems, Inc., Aug. 2004.
[5]
L.-F. Cabrera and D.D.E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405-436, 1991.
[6]
P.F. Corbett and D.G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225-264, 1996.
[7]
S.Ghemawat, H.Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), Bolton Landing, NY, Oct. 2003. ACM.
[8]
G.A. Gibson, D.F. Nagle, K.Amiri, J.Butler, F.W. Chang, H.Gobioff, C.Hardin, E.Riedel, D.Rochberg, and J.Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 92-103, San Jose, CA, Oct. 1998.
[9]
D.Hildebrand and P.Honeyman. Exporting storage systems in a scalable manner with pNFS. Technical Report CITI-05-1, CITI, University of Michigan, Feb. 2005.
[10]
D.Karger, E.Lehman, T.Leighton, M.Levine, D.Lewin, and R.Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In ACM Symposium on Theory of Computing, pages 654-663, May 1997.
[11]
J.Kubiatowicz, D.Bindel, Y.Chen, P.Eaton, D.Geels, R.Gummadi, S.Rhea, H.Weatherspoon, W.Weimer, C.Wells, and B.Zhao. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cambridge, MA, Nov. 2000. ACM.
[12]
R.Latham, N.Miller, R.Ross, and P.Carns. A next-generation parallel file system for Linux clusters. LinuxWorld, pages 56-59, Jan. 2004.
[13]
A.Leung and E.L. Miller. Scalable security for large, high performance storage systems. In Proceedings of the 2006 ACM Workshop on Storage Security and Survivability. ACM, Oct. 2006.
[14]
B.Liskov, S.Ghemawat, R.Gruber, P.Johnson, L.Shrira, and M.Williams. Replication in the Harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91), pages 226-238. ACM, 1991.
[15]
C.R. Lumb, G.R. Ganger, and R.Golding. D-SPTF: Decentralized request distribution in brick-based storage systems. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37-47, Boston, MA, 2004.
[16]
J.Menon, D.A. Pease, R.Rees, L.Duyanovich, and B.Hillsberg. IBM Storage Tank-a heterogeneous scalable SAN file system. IBM Systems Journal, 42(2):250-267, 2003.
[17]
N.Nieuwejaar and D.Kotz. The Galley parallel file system. In Proceedings of 10th ACM International Conference on Supercomputing, pages 374-381, Philadelphia, PA, 1996. ACM Press.
[18]
N.Nieuwejaar, D.Kotz, A.Purakayastha, C.S. Ellis, and M.Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, Oct. 1996.
[19]
C.A. Olson and E.L. Miller. Secure capabilities for a petabyte-scale object-based distributed file system. In Proceedings of the 2005 ACM Workshop on Storage Security and Survivability, Fairfax, VA, Nov. 2005.
[20]
B.Pawlowski, C.Juszczak, P.Staubach, C.Smith, D.Lebel, and D.Hitz. NFS version 3: Design and implementation. In Proceedings of the Summer 1994 USENIX Technical Conference, pages 137-151, 1994.
[21]
O.Rodeh and A.Teperman. zFS-a scalable distributed file system using object disks. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 207-218, Apr. 2003.
[22]
D.Roselli, J.Lorch, and T.Anderson. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference, pages 41-54, San Diego, CA, June 2000. USENIX Association.
[23]
Y.Saito, S.Frølund, A.Veitch, A.Merchant, and S.Spence. FAB: Building distributed enterprise disk arrays from commodity components. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 48-58, 2004.
[24]
F.Schmuck and R.Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 231-244. USENIX, Jan. 2002.
[25]
M.Szeredi. File System in User Space. https://fuse.sourceforge.net, 2006.
[26]
H.Tang, A.Gulbeden, J.Zhou, W.Strathearn, T.Yang, and L.Chu. A self-organizing storage cluster for parallel data-intensive applications. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04), Pittsburgh, PA, Nov. 2004.
[27]
F.Wang, Q.Xin, B.Hong, S.A. Brandt, E.L. Miller, D.D.E. Long, and T.T. McLarty. File system workload analysis for large scale scientific computing applications. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 139-152, College Park, MD, Apr. 2004.
[28]
S.A. Weil. Scalable archival data and metadata management in object-based file systems. Technical Report SSRC-04-01, University of California, Santa Cruz, May 2004.
[29]
S.A. Weil, S.A. Brandt, E.L. Miller, and C.Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06), Tampa, FL, Nov. 2006. ACM.
[30]
S.A. Weil, K.T. Pollack, S.A. Brandt, and E.L. Miller. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04). ACM, Nov. 2004.
[31]
B.Welch. POSIX IO extensions for HPC. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST), Dec. 2005.
[32]
B.Welch and G.Gibson. Managing scalability in object storage systems for HPC Linux clusters. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 433-445, Apr. 2004.
[33]
B.S. White, M.Walker, M.Humphrey, and A.S. Grimshaw. LegionFS: A secure and scalable file system supporting cross-domain high-performance applications. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (SC '01), Denver, CO, 2001.
[34]
J.Wilkes, R.Golding, C.Staelin, and T.Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95), pages 96-108, Copper Mountain, CO, 1995. ACM Press.
[35]
T.M. Wong, R.A. Golding, J.S. Glider, E.Borowsky, R.A. Becker-Szendy, C.Fleiner, D.R. Kenchammana-Hosekote, and O.A. Zaki. Kybos: self-management for distributed brick-base storage. Research Report RJ 10356, IBM Almaden Research Center, Aug. 2005.
[36]
J.C. Wu and S.A. Brandt. The design and implementation of AQuA: an adaptive quality of service aware object-based storage device. In Proceedings of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 209-218, College Park, MD, May 2006.
[37]
Q.Xin, E.L. Miller, and T.J.E. Schwarz. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC), pages 172-181, Honolulu, HI, June 2004.
Ceph: A Scalable, High-Performance Distributed File System (2024)
Top Articles
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 6615

Rating: 5 / 5 (60 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.