distributed file system (DFS)

When they published their paper in 2003, engineers at Google had already designed, developed, and implemented the Google File System (GFS) in an effort to sustain performance and control costs while providing the infrastructure, platform, and applications required to deliver Google’s services to users (Ghemawat, Gobioff, & Leung, 2003, p. 29). Although the authors acknowledge GFS has similar aims as existing distributed file systems, aims such as “performance, scalability, reliability, and availability,” they state GFS has dissimilar “design assumptions” arising from their “observations” of Google’s “application workloads and technological environment” (Ghemawat, et al., 2003, p. 29). In general, the authors describe GFS as “the storage platform for the generation and processing of data used by our service” and used by our “research and development efforts that require large data sets” (Ghemawat, et al., 2003, p. 29). In addition, they state that GFS is suitable for “large distributed data-intensive applications,” that it is capable of providing “high aggregate performance to a large number of clients,” and that it “is an important tool” that allows Google “to innovate and attack problems on the scale of the entire web” (Ghemawat, et al., 2003, pp. 29, 43).

In the introduction to their paper, the authors state the four primary characteristics of their “workloads and technological environment” as 1) “component failures are the norm rather than the exception,” 2) “files are large by traditional standards,” 3) “most files are mutated by appending new data rather than overwriting existing data,” and 4) “co-designing the applications and the file system API benefits the overall system by increasing flexibility” (Ghemawat, et al., 2003, p. 29).Each of these observations aligns with (results in) what the authors call their “radically different points in the design space” (Ghemawat, Gobioff, & Leung, 2003, p. 29) which they elaborate in some detail both in the numbered list in the paper’s introduction and in the bulleted list in the second section, “Assumptions,” of the paper’s second part, “Design Overview” (Ghemawat, et al., 2003, p. 30). Considering the authors’ first observation, for example, that the “quantity and quality of the components virtually guarantee” parts of the system will fail and “will not recover,” it is reasonable to assert the design premises (assumptions) that system specifications should include the system is made of “inexpensive commodity components” and it “must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis” (Ghemawat, et al., 2003, pp. 29-30). Considering the authors’ second observation, for example, “files are huge by traditional standards,” meaning “multi-GB files are common” and “the system stores a modest number of large files,” it is reasonable to assert the design premises (assumptions) that system parameters “such as I/O operation and block sizes” need “to be revisited” and re-defined in order to optimize the system for managing large files while maintaining support for managing small files (Ghemawat, et al., 2003, pp. 29-30). These two examples demonstrate the type of arguments and evidence the authors provide to support their claim GFS responds to fundamental differences between the data, workloads (software), and infrastructure (hardware) of traditional information technology and the data, workloads, and infrastructure Google needs to sustain its operations in contemporary and future information technology (Ghemawat, et al., 2003, pp. 29-33, 42-43). In the remaining passages of their paper’s introduction and in the first section of their design overview, the authors continue discussing Google’s technological environment by describing the third and fourth primary characteristics of the environment they have observed and by explaining corollary design premises and assumptions arising from those observations they applied to designing and developing GFS (Ghemawat, et al., 2003, pp. 29-30).

With the rationale for their work thus established, the authors move on in the remaining sections of their design overview to discuss the overall architecture of GFS. First, they introduce some features the authors imply are shared with other distributed file systems – for example an API supporting “the usual operations to create, delete, open, close, read, and write files,” — and some features the authors imply are unique to GFS – for example “snapshot and record append operations” (Ghemawat, et al., 2003, p. 30). Next, they describe the main software components (functions or roles) included in a GFS implementation on a given “cluster” (set) of machines, namely the “GFS clients,” the “GFS master,” and the “GFS chunkservers.” The GFS clients enable communication between applications requiring data and between the GFS master and GFS chunkservers providing data. The GFS master “maintains all file system metadata” and “controls system-wide activities.” The GFS chunkservers store the actual data (Ghemawat, et al., 2003, p. 31).

At this point in their paper, although the authors begin providing fairly detailed technical explanations for how these various GFS components interact, I will mention only a few points the authors emphasize as crucial to the success of GFS. First of all, in contrast with some other distributed file systems, GFS is a “single master” architecture that has both advantages and disadvantages (Ghemawat, et al., 2003, pp. 30-31). According to the authors, one advantage of “having a single master” is it “vastly simplifies” the “design” of GFS and “enables the master to make sophisticated chunk placement and replication decisions using global knowledge” (Ghemawat, et al., 2003, pp. 30-31). A disadvantage of having only one master, however, is its resources could be overwhelmed and it could become a “bottleneck” (Ghemawat, et al., 2003, p. 31). In order to overcome this potential disadvantage of the single master architecture, the authors explain how communication and data flows through the GFS architecture, namely that GFS clients “interact with the master for metadata operations,” but interact with the chunkservers for actual data operations (i.e. operations requiring alteration or movement of data) and thereby relieve the GFS master from performing “common operations” that could overwhelm it (Ghemawat, et al., 2003, p. 31, 43). Other important points include GFS’s relatively large data “chunk size,” its “relaxed consistency model,” its elimination of the need for substantial client cache, and its use of replication instead of RAID to solve fault tolerance issues (Ghemawat, et al., 2003, pp. 31-32, 42).

Engineers at Google as early as 2003 encountered challenges in their efforts to deploy, operate, and sustain systems capable of ingesting, storing, and processing the large volumes of data required to produce and deliver Google’s services to its users, services such as the “Google Web search service” for which Google must create and maintain a “large-scale indexing” system, or the “Google Zeitgeist and Google Trends” services for which it must extract and analyze “data to produce reports of popular queries” (Dean & Ghemawat, 2008, pp. 107, 112).

As Dean and Ghemawat explain in the introduction to their article, even though many of the required “computations are conceptually straightforward,” the data volume is massive (terabytes or petabytes in 2003) and the “computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time” (Dean & Ghemawat, 2008, p. 107). At the time, even though Google had already “implemented hundreds of special-purpose computations” to “process large amounts of raw data” and the system worked, the authors describe how they sought to reduce the “complexity” introduced by a systems infrastructure requiring “parallelization, fault tolerance, data distribution and load balancing” (Dean & Ghemawat, 2008, p. 107).

Their solution involved creating “a new abstraction” that not only preserved their “simple computations,” but also provided a cost-effective, performance-optimized large cluster of machines that “hides the messy details” of systems infrastructure administration “in a library” while enabling “programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily” (Dean & Ghemawat, 2008, pp. 107, 112). Dean and Ghemawat acknowledge their “abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” and acknowledge others “have provided restricted programming models and used the restrictions to parallelize the computation automatically.” In addition, however, Dean and Ghemawat assert their “MapReduce” implementation is a “simplification and distillation of some of these models” resulting from their “experience with large real-world computations” and their unique contribution may be their provision of “a fault-tolerant implementation that scales to thousands of processors” while other “parallel processing systems” were “implemented on smaller scales” while requiring the programmer to address machine failures (2008, pp. 107, 113).

In sections 2 and 3 of their paper, the authors provide greater detail of their “programming model,” their specific “implementation of the MapReduce interface” including the Google File System (GFS) – a “distributed file system” that “uses replication to provide availability and reliability on top of unreliable hardware” – and an “execution overview” with a diagram showing the logical relationships and progression of their MapReduce implementation’s components and data flow (Dean & Ghemawat, 2008, pp. 107-110). In section 4, the authors mention some “extensions” at times useful for augmenting “map and reduce functions” (Dean & Ghemawat, 2008, p. 110).

In section 5, the authors discuss their experience measuring “the performance of MapReduce on two computations running on a large cluster of machines” and describe the two computations or “programs” they run as “representative of a large subset of the real programs written by users of MapReduce,” that is computations for searching and for sorting (Dean & Ghemawat, 2008, p. 111). In other words, the authors describe the search function as a “class” of “program” that “extracts a small amount of interesting data from a large dataset” and the sort function as a “class” of “program” that “shuffles data from one representation to another” (Dean & Ghemawat, 2008, p. 111). Also in section 5, the authors mention “locality optimization,” a feature they describe further over the next few sections of their paper as one that “draws its inspiration from techniques such as active disks” and one that preserves “scarce” network bandwidth by reducing the distance between processors and disks and thereby limiting “the amount of data sent across I/O subsystems or the network” (Dean & Ghemawat, 2008, pp. 112-113).

In section 6, as mentioned previously, Dean and Ghemawat discuss some of the advantages of the “MapReduce programming model” as enabling programmers for the most part to avoid the infrastructure management normally involved in leveraging “large amounts of resources” and to write relatively simple programs that “run efficiently on a thousand machines in a half hour” (Dean & Ghemawat, 2008, p. 112).

Overall, the story of MapReduce and GFS told by Dean and Ghemawat in this paper, a paper written a few years after their original paper on this same topic, is a story of discovering more efficient ways to utilize resources.

Tag: distributed file system (DFS)

AB07 – Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google File System.

AB04 – Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.