sstable « My pensieve

Tag: sstable

March 13, 2015

Bigtable: snippets/notes from the original Google’s paper

by viggy — Categories: Uncategorized — Tags: bigtable, chubby, google, gsoc, mapreduce, sstable — Leave a comment

Following are the lines that I made marked in the original Google’s paper on BigTable.

Introduction:

Bigtable has achieved several goals: wide applicability, scalability, high performance and high availability.
Bigtable does not suport a full relational data model; instead, it provides clients with a simple data that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and coloumn names that can be arbitary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings.

Data Model:
A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
(row:string, column: string, time:int64) -> string

Row:
The row keys in a table are arbitary strings(currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic(regardless of the number of different columns being read or written in the row).
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines.

Columns:
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type( we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family: after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small(in the hundreds at most), and that families rarely change during operation. In contrast a table may have an unbounded number of columns.
A column key is named using the following syntax: family:qualifier. Column family names must be printable, but qualifiers may be arbitary strings. Access Control and both disk and memory accounting are performed at the column-family level.

Timestamps:
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp, Bigtable timestamps are 64-bit integers.
The client can specify either that only the last n versions of a cell be kept, or that only new enough versions be kept.

API:

Bigtable suports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable can eb used with MapReduce , a framework for running large scale parallel computations developed at Google.

Building Blocks:

Bigtable uses the distributed Google File System(GFS) to store log and data files. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.

The Google SSTable file format is used internally to store Bigtable. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be com-
pletely mapped into memory, which allows us to perform lookups and scans without touching disk.

Bigtable relies on a highly-available and persistent distributed lock service called Chubby. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. Each Chubby client maintains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.

Implementation:

The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.
Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the master is lightly loaded in practice.

Tablet Location:
We use a three-level hierarchy analogous to that of a B+ tree to store tablet location information.
With a modest limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 2^34 tablets.
The client library caches tablet locations. If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses.

Compactions:
When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies.
Every minor compaction creates a new SSTable.
We bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. A major compaction, produces an SSTable that contains no deletion information or deleted data.

Refinements:
Locality groups:

Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet.

Compression:

The user-specified compression format is applied to each SSTable block. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme, which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data.

Caching for read performance:

To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read.

Bloom Filters:

A read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom filters should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.

Exploiting immutability:

Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable.
The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31