Get e-book Data Deduplication for Data Optimization for Storage and Network Systems

Free download. Book file PDF easily for everyone and every device. You can download and read online Data Deduplication for Data Optimization for Storage and Network Systems file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Data Deduplication for Data Optimization for Storage and Network Systems book. Happy reading Data Deduplication for Data Optimization for Storage and Network Systems Bookeveryone. Download file Free Book PDF Data Deduplication for Data Optimization for Storage and Network Systems at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Data Deduplication for Data Optimization for Storage and Network Systems Pocket Guide.

The locality indicator in metadata b for second data chunk b has the same value as the locality indicators generated for first and second data chunks a and b because second data chunk b was originally stored in chunk container for first stream map a. Any further data chunks not shown in FIG. Chunk store interface of FIG. In an embodiment, data stream parser may be included in data deduplication module of FIG. In an embodiment, system of FIG. Further structural and operational embodiments will be apparent to persons skilled in the relevant art s based on the discussion regarding flowchart Flowchart and system are described as follows.

Flowchart begins with step In step , a data stream is parsed into data chunks. Data stream parser is configured to parse data stream into a sequence of data chunks, indicated as data chunk sequence For instance, in an embodiment, data chunk sequence may include the sequence of data chunks in the order the data chunks are located in data stream The data chunks of data chunk sequence may have the same size or may have different sizes.

In step , whether any of the data chunks are duplicates of data chunks stored in a chunk container is determined. Data chunk storage manager is configured to determine whether any of the data chunks of data chunk sequence are already stored in chunk container , and therefore are duplicates. For example, in an embodiment, as shown in FIG.

In another embodiment, data chunk storage manager may receive hash values FIG. Data chunk storage manager may generate a hash value for each data chunk of data chunk sequence , and may compare the generated hash values to the hash values received in data chunk information or from stream container to determine which data chunks of data chunk sequence are already stored in chunk container In further embodiments, data chunk storage manager may determine which data chunks of data chunk sequence are already stored in chunk container in other ways, as would be known to persons skilled in the relevant art s.

Referring back to FIG. For example, in an embodiment, data chunk storage manager may be configured to store the data chunks of data chunk sequence that were not determined to be stored in chunk container For example, in an embodiment, data chunk storage manager may generate a chunk header e. Furthermore, in an embodiment, data chunk storage manager is configured to store the new data chunks in a contiguous arrangement in chunk container , in a same order as in the source data stream e.

In step , metadata is generated for each of the data chunks determined not to be a duplicate, the metadata for a data chunk including a data stream offset, a pointer to a location in the chunk container, and a locality indicator.


  • Antigen-presenting cells and the eye;
  • What is Kobo Super Points?.
  • Data Deduplication For Data Optimization For Storage And Network Systems by Daehee Kim | KSA | Souq.
  • Lectures on the asymptotic theory of ideals!
  • Real Man Adventures!

In an embodiment, metadata generator may be configured to generate metadata e. Metadata generator may generate metadata for each data chunk of data chunk sequence , including data stream offset , data chunk identifier , and locality indicator For data chunks determined to already be stored in chunk container in step , data chunk identifier is configured to point at the already stored data chunk.

For data chunks newly stored in chunk container in step , data chunk identifier is configured to point at the newly stored data chunk. Stream map generator generates a stream map associated with the data stream that includes data chunk metadata for each received data chunk. Furthermore, stream map generator may generate a stream header for stream map , and may include hash values for each received data chunk in stream map In step , the stream map is stored in a stream container.

Note that in an alternative embodiment, rather than generating and storing a stream map for a data stream, an entry may be made in a database or global table for the data stream that includes metadata pointing to or indicating a location of the data chunks referenced by the data stream. Each stream link links a data stream to the corresponding data e.

A stream map a may be generated for first data stream a , and the four data chunks a - d may be stored in a chunk container , as described above. Stream map a includes pointers represented by arrows in FIG. Data chunks a - d may be categorized in a single set of all new, unique data chunks to chunk container As such, data chunks a - d may be stored in chunk container in a contiguous arrangement, in a same order as in data stream a. For example, data chunks a - d may be the first four data chunks stored in chunk container , or if one or more data chunks are already stored in chunk container , data chunks a - d may be stored in chunk container immediately after the already stored data chunks.

Each of data chunks a - d is assigned the same locality indicator value in stream map a , the locality indicator value selected for first data stream a. Second data stream b includes four data chunks b , c , e , and f. A stream map b may be generated for second data stream b. Data chunks b , c , e , and f may be categorized into two sets of data chunks according to step of flowchart : a first set that includes chunks b and c , which already have copies residing in chunk container due to the chunk sequence of first data stream a , and a second set that includes chunks e and f , which are new, unique data chunks that do not have copies already stored in chunk container Because data chunks b and c are already stored in chunk container , stream map b includes pointers values for data chunk identifier to data chunks b and c already stored in chunk container Thus, data chunks b and c may be stored as pointers to existing data chunks in chunk container without storing chunk data of data chunks b and c.

Because data chunks e and f are not already stored in chunk container , data chunks e and f may be stored in chunk container , as described above. For instance, because data chunks e and f are new, unique data chunks to chunk container , chunks e and f may be stored in chunk container in a contiguous arrangement, in a same order as in data stream b , after the last stored data chunk currently stored in chunk container e. Stream map b includes first-fourth data chunk identifiers a - d , which point to data chunks b , c , e , and f stored in chunk container , respectively.

In stream map b , data chunks b and c are assigned the locality indicator value associated with first data stream a , and data chunks e and f are assigned the locality indicator value selected for second data stream b. Note that any number of additional data streams may be stored in a similar manner following data streams a and b. Furthermore, note that in the example of FIG. In embodiments, data chunks of a particular stream map may be assigned one of any number of locality indicator values, depending on the number of different locality indicators associated with data chunks of the stream map that are already present in the chunk container.

For instance, as described above, new data chunks to a chunk container may be assigned the new locality indicator value selected for the particular data stream associated with the stream map. Furthermore, any number of data chunks referenced by the stream map that are already present in the chunk container are assigned the corresponding locality indicator values of the data chunks already present in the chunk container. This may mean that any number of one or more sets of data chunks of the data stream may be assigned corresponding locality indicator values, such that data chunks of the data stream may be assigned locality indicators selected from two, three, or even more different locality indicator values.

As such, locality indicators of stream map metadata enable the locality of data chunks in data streams to be ascertained. This is because duplicate data chunks tend to occur in groups. When a new data stream contains an already known data chunk already stored in the chunk container , there is a reasonable probability that the next data chunk in the new data stream is also a duplicate data chunk already stored in the chunk container.

Because new, original data chunks are stored in the chunk container adjacent to one another according to the locality indicator, the already present data chunks that the new data stream references are more likely to also be contiguously stored in the chunk container. This aids in improving the performance of reading optimized data streams from a chunk store. For instance, a rehydration module configured to re-assemble a data stream based on the corresponding stream map and data chunks can perform a read-ahead on the data chunks stored in the chunk container, expecting to find the next data chunk needs in the read-ahead buffer.

Furthermore, chunk store maintenance tasks like defragmentation and compaction can perform their tasks while attempting to maintain the original locality by keeping the existing adjacent chunks together as they are move around the chunk container. For instance, after data streams are optimized and stored in chunk store in the form of stream maps and data chunks , the data streams may be read from chunk store Rehydration module is configured to re-assemble a requested data stream e. For instance, for a data stream to be read from chunk store in response to a data stream request FIG.

For instance, rehydration module may provide a stream map identifier of request to chunk store of FIG. Chunk store retrieves the corresponding stream map based on the stream map identifier e. The retrieved stream map includes pointers data chunk identifier of FIG. Rehydration module uses the pointers to retrieve each of the data chunks Rehydration module may use data stream offsets included in the retrieved stream map e.

Through the use of locality indicators , sequential reads of data chunks from chunk container may be performed. This is because at the time that chunk store creates stream maps , new data chunks are stored in chunk container contiguously in stream-map order. Data access seeks during sequential data access are limited to the case where a data chunk or a series of chunks of a data stream are found to already exist in the chunk store. Stream map provides an efficient metadata container for optimized file metadata e. Stream maps are concise and can be cached in memory for fast access.

Chunk store can cache frequently-accessed stream maps for optimized data streams frequently requested and rehydrated by rehydration module based on an LRU least recently used algorithm or other type of cache algorithm. As described above, data chunks may be moved within a chunk container for various reasons, such as due to a defragmentation technique, due to a compaction technique that performs garbage collection, etc.

Embodiments are described in this subsection for keeping track of the movement of data chunks within a chunk container. Chunk container identifier is a unique identifier e. Chunk container generation indication indicates a revision or generation for chunk container For instance, each time that one or more data chunks are moved within chunk container , generation indication may be modified e.

In an embodiment, chunk container may identified by a combination of chunk container identifier and chunk container generation indication e. In an embodiment, both of chunk container identifier and chunk container generation indication may be integers. Chunk container may have a fixed size or fixed number of entries , or may have a variable size. For instance, in one example embodiment, each chunk container file that defines a chunk container may be sized to store about 16, of chunks, with an average data chunk size of 64 KB, where the size of the chunk container file is set to 1 GB.

In other embodiments, a chunk container file may have an alternative size. Data chunks stored in chunk container may be referenced according to data chunk identifier of metadata FIG. In embodiments, data chunk identifier may include various types of information to enable such referencing. For instance, in an embodiment, data chunk identifier may include one or more of a data chunk container identifier, a local identifier, a chunk container generation value, and a chunk offset value.

The chunk container identifier has a value of chunk container identifier for the chunk container in which the data chunk is stored. The local identifier is an identifier e. The chunk container generation value has the value of chunk container generation indication for the chunk container in which the data chunk is stored, at the time the data chunk is stored in the chunk container The chunk offset value is an offset of the data chunk in chunk container at the time that the data chunk is added to chunk container In an embodiment, a chunk store may implement a reliable chunk locator that may be used to track data chunks that have moved.

In contrast to conventional techniques, the reliable chunk locator does not use an index for mapping data chunk identifiers to a physical chunk location. Conventional techniques use an index that maps chunk identifiers to the chunk data physical location. The scale of storage systems e. If such an index is fully loaded in memory it will consume a large amount of the available memory and processor resources.

If the index is not loaded in memory, data accesses become slow because the index needs to be paged into memory. In an embodiment, the reliable chunk locator is implemented in the form of a redirection table, such as redirection table of chunk container in FIG. The redirection table may store one or more entries for data chunks that have been moved in chunk container Each entry identifies a moved data chunk , and has a data chunk offset value indicating the location of the data chunk in chunk container at its new location.

The redirection table may be referenced during rehydration of a data stream to locate any data chunks of the data stream that have moved. Redirection table is used to locate data chunks including stream maps stored as data chunks if the data chunks are moved within chunk container For instance, redirection table enables data chunks to be moved within chunk container for space reclamation as part of a garbage collection and compaction process, and to still be reliably locatable based on the original chunk identifiers of the data chunks Any number of entries may be included in redirection table , including hundreds, thousands, and even greater numbers of entries Each entry includes a local identifier and a changed chunk offset value For instance, first entry a includes a first local identifier a and a first changed chunk offset value a , and second entry b includes a second local identifier b and a second changed chunk offset value b.

Local identifier is the unique local identifier assigned to a data chunk when originally stored in chunk container Changed chunk offset value is the new chunk offset value for the data chunk having the corresponding local identifier that was moved. As such, redirection table may be accessed using a locality indicator for a data chunk to determine a changed chunk offset value for the data chunk. For example, local identifier a in FIG. Entry a of redirection table may be accessed using the local identifier assigned to data chunk b to determine changed chunk offset value a , which indicates a new location for data chunk b in chunk container Note that redirection table may have any size.

In some cases, relocations of data chunks may be infrequent. In an embodiment, after determining a changed chunk offset value for a data chunk, any pointers to the data chunk from stream maps can be modified in the stream maps to the changed chunk offset value, and the entry may be removed from redirection table In some situations, redirection table may be emptied of entries in this manner over time.

Entries to a redirection tables may be added in various ways. Flowchart is described as follows. In step , the contents of the chunk container are modified. For example, in an embodiment, one or more data chunks in chunk container of FIG. Such data chunks may be moved by a maintenance task e.

In step , one or more entries are added to the redirection table that indicated changed chunk offset values for one or more data chunks of the chunk container due to step For example, one or more entries may be added to redirection table that correspond to the one or more moved data chunks For example, for each moved data chunk , an entry may be generated that indicates the local identifier value of the moved data chunk as local identifier , and indicates the new offset value of the moved data chunk as changed chunk offset value In step , the generation indication in the chunk container header is increased due to step For example, in an embodiment, chunk container generation indication may have an initial value of 0, and each time data chunks are moved in chunk container , chunk container generation indication may be incremented to indicate a higher generation value.

In other embodiments, chunk container generation indication may be modified in other ways. As such, when a data chunk of chunk container of FIG. If they are the same, the data chunk can be located at the offset indicated by the chunk offset value in the data chunk identifier.

Architecture

If not, redirection table is read to determine the changed offset value of the data chunk in chunk container In FIG. Data stream assembler processes stream map , generating a data chunk request for each data chunk referenced by stream map In an embodiment, data chunk request generated by data stream assembler may include a data chunk identifier to identify a requested data chunk The located chunk container may be accessed as follows to retrieve requested data chunks.

Generation checker accesses chunk container identified above as having a chunk container identifier that matches the chunk container identifier of the requested data chunk Generation checker is configured to compare chunk container generation indication for chunk container to the chunk container generation value for requested data chunk , and to output a generation match indication If their values do not match e. If generation match indication indicates that a match was not found, data chunk retriever accesses redirection table for a changed chunk offset value FIG.

Data chunk z is the requested data chunk , having been moved in chunk container to second chunk offset value Data chunk is received by data stream assembler In this manner, data stream assembler receives all data chunks referenced by stream map from data chunk retriever , retrieved either directly from chunk container according to the corresponding chunk offset value, or from chunk container as redirected by redirection table Data stream assembler assembles together all of the received data chunks as described elsewhere herein to form data stream It is noted that the stream map reference identifier that resides in the reparse point of a data stream e.

As described above, a stream map may have the form of a data chunk that contains stream map metadata rather than end-user file data. As such, the procedure for addressing a stream map may be the same as addressing a data chunk —both techniques may use the data chunk identifier structure. The stream map identifier contains the [Container identifier, local identifier, generation value, offset value] information that may be used to locate either directly, or through a redirection table the stream map data chunk inside stream container As such, in an embodiment, a format and layout of a stream container may be essentially the same as that of a chunk container Example Data Optimization Backup Embodiments.

Backing up and restoring a data system deploying a data optimization technique is difficult because data is shared between multiple data streams in a chunk store. As such, the data is separated from the file namespace. However, data backup and restore capabilities are useful. Entities typically would not be amenable to deploying a data optimization solution without effective data backup integration.

In embodiments, various backup techniques are provided for data optimization environments, including optimized backup, un-optimized backup, item-level optimized backup, and a hybrid backup technique. Furthermore, in embodiments, heuristics may be used to select between different backup techniques. For instance, heuristics may be used to select between optimized and un-optimized backup. Embodiments provide data optimization systems with optimized backup techniques, such that data may be backed up in its optimized e.

Embodiments may enable data backup to use less backup media space, and may be used to reduce the backup time window, which is significant considering year-to-year data growth. Furthermore, embodiments may enable faster data restore from backup e. In embodiments, backup in data optimization systems may be performed in various ways. Flowchart may be performed by chunk store interface of FIG. Flowchart is described with respect to FIG. Flowchart may be performed by data backup system In step of flowchart , a plurality of optimized data streams stored in a chunk store is identified for backup.

For example, with reference to FIG. Request may identify the one or more optimized data streams by corresponding file names for optimized stream structures e. Each optimized stream structure references a stream map chunk of stream map chunks in stream container that contains metadata describing a mapping of the optimized data stream to data chunks stored in one or more of chunk containers , as described above.

In step , at least a portion of the chunk store is stored in a backup storage to backup the plurality of optimized data streams. In response to request , data backup module may store at least a portion of chunk store in backup storage so that the optimized data streams identified in request are stored in backup storage Data backup module may store the optimized data streams by storing a portion of chunk store in backup storage in various ways. For instance, in one embodiment, for each optimized data stream, data backup module may determine the corresponding stream map chunk of stream map chunks , and the data chunks referenced by the stream map chunk, and may store the determined stream map chunk and data chunks in backup storage In further embodiments, data backup module may store larger portions of chunk store in backup storage that include the determined chunks of the optimized data streams to backup the optimized data streams.

As such, data backup module may be configured to store optimized data streams in various ways according to step Data backup module may include any one or more of modules , , , and , in embodiments. Modules , , , and enable optimized data streams to be stored in backup storage in various ways. Modules , , , and are described as follows. For instance, optimized file backup module may be configured to perform optimized backup by storing optimized data streams in backup storage in optimized form.

According to optimized backup, one or more entire chunk containers of chunk store may be stored to backup optimized data streams.

For instance, optimized file backup module may perform a flowchart shown in FIG. Flowchart and optimized file backup module are described as follows. In step , the chunk store is stored in its entirety in the backup storage. In an embodiment, optimized file backup module may store chunk store in its entirety in backup storage so that the optimized data streams indicated in request are backed up. Alternatively, optimized file backup module may store one or more entire chunk containers e.

The chunk containers to be backed up may be identified by chunk container identifiers of the chunk identifiers for the chunks referenced by the optimized stream structures of the optimized data streams. In step , a plurality of stream metadata stubs are stored in the backup storage for the plurality of optimized data streams that link to corresponding data in the chunk store. The stream metadata stubs may be retrieved by optimized file backup module from other storage e.

In an embodiment, stream metadata stubs corresponding to the optimized data streams identified in request are stored by optimized file backup module in backup storage using a store operation The combination of stored chunk containers and stored stream metadata stubs in backup storage provides complete storage of the optimized data streams, in optimized non-reassembled form. Additional metadata is not needed to be stored. Optimized backup e. Optimized backup may be desired to be performed if most of the backed up volume an accessible storage area in backup storage is in scope.

As such, more data overall may be backed up than is needed. If relatively little of the backed up volume is in scope e. Restore of optimized data streams that have been backed up according to optimized backup involves restoring the chunk store container files and stream metadata stubs such that the restore is an optimized restore. Techniques for selective restore of optimized data streams from optimized backup are described further below in the next subsection.

As such, embodiments for optimized backup may provide various benefits. For instance, optimized backup may result in smaller storage size backups, which save backup media space e. This may be useful for disk-based backup solutions. In some cases, the storage space savings in the backup storage may be more significant and cost-effective than the savings in the primary storage. Optimized backup may result in a faster time used to perform backup. The backup execution time may be reduced, which is significant considering the growth in the amount of data being stored from year-to-year.

Due to the growth in quantity of data, storage users may struggle with performing frequent backups e. Optimized backup techniques may assist with reducing the backup execution time. Furthermore, optimized backup may shorten RTO, leading to faster restores. According to non-optimized backup, optimized data streams designated for backup storage may be rehydrated prior to being stored. For instance, in an embodiment, rehydrating backup module may be configured to perform non-optimized backup by rehydrating optimized data streams, and storing the rehydrated data streams in backup storage Rehydrating backup module may perform a flowchart shown in FIG.

Flowchart and rehydrating backup module are described as follows. In step of flowchart , each optimized data stream is rehydrated into a corresponding un-optimized data stream that includes any data chunks referenced by the corresponding optimized stream metadata. Rehydrating backup module may rehydrate optimized data streams in any manner, including as described above for rehydration module FIG. For example, referring to FIG. Stream container accesses may identify desired stream map chunks by corresponding stream map identifiers.

In response to stream container accesses , rehydrating backup module receives stream map chunks from stream container corresponding to the optimized data streams identified in request Rehydrating backup module uses the pointers to retrieve each of the referenced data chunks Rehydrating backup module may use data stream offsets included in the retrieved stream maps e. In step , each un-optimized data stream is stored in the backup storage. As such, when a data stream is desired to be restored from backup storage , the un-optimized data stream stored in backup storage may be retrieved.

Thus, un-optimized backup e. Un-optimized backup may be desired to be performed if data chunks of the optimized data streams fill a relatively lesser portion of the storage space of chunk store Thus, rather than storing entire chunk containers in backup storage , which may include storing a large number of data chunks unrelated to the optimized data streams identified in request , the specific optimized data streams may be rehydrated and stored in backup storage As such, the backup media stores the data streams in their un-optimized, original form, and backup storage space may be saved by avoiding storing unrelated data chunks.

Embodiments for un-optimized backup may provide various benefits. For instance, un-optimized backup, which uses more selective backup and enables selective restore , is relatively easy to implement. Restoring of the backed up data streams does not depend on the data-optimization technique used by a storage system, because the data streams are backed up in un-optimized form. As such, the data streams can be restored anywhere and accessed without dependency on an installed and functional data-optimization solution.

Due to the rehydration process, un-optimized backup can be relatively slow, having a performance impact on data backup module The rehydration of optimized data is slower than a regular data read due to the decompression of the optimized data that is performed, and also potentially due to data fragmentation. Furthermore, because data streams are backed up in un-optimized form, the total amount of data this is backed up may be large because the advantages of deduplication are not present.

The total amount of data is potentially a larger amount than the volume being backed up because the volume is optimized while the backup data is not. In some cases, it may not be possible to backup all of the data due to backup storage size limitations. Thus, un-optimized backup may be selected for use in particular backup situations. Examples of selecting un-optimized backup versus other backup techniques are described further below.

In another embodiment, item level backup may be performed to store optimized data streams in an item level optimized form. According to item level backup, optimized data streams designated for backup storage are prepared for storage in optimized form. For instance, for a particular optimized data stream, the data chunks referenced by the stream map chunk of the optimized data streams are determined. Any of the referenced data chunks that have already been stored in backup storage are not retrieved from the chunk store.

Any of the referenced data chunks that are not already stored in backup storage are retrieved from the chunk store, and are stored in backup storage. For instance, in an embodiment, item level backup module may be configured to perform item level backup by storing in backup storage the stream map chunks and referenced data chunks that are not already stored in backup storage For instance, item level backup module may perform a flowchart shown in FIG. Flowchart may be performed for each optimized data stream identified in request Flowchart and item level backup module are described as follows.

In step of flowchart , a first optimized data stream identified for backup is received. For example, request may identify an optimized data stream. For the optimized data stream, item level backup module may retrieve the optimized stream metadata e. In an embodiment, item level backup module may compare the data chunks referenced by the metadata of the stream map chunk with any data chunks already stored in backup storage to determine whether any of the referenced data chunks are already stored in backup storage Item level backup module may perform the comparison in any way, including by comparing hashes of the referenced data chunks with hashes of the data chunks stored in backup storage e.

Item level backup module may keep track of the referenced data chunks determined to not already be stored in backup storage , such as by maintaining a list or other data structure e. In step , the optimized stream metadata of the first optimized data stream is stored in the backup storage. For instance, item level backup module may store the stream map chunk retrieved from stream container for the optimized data stream in backup storage e.

As such, when the data stream is desired to be restored from backup storage , the stream map chunk and data chunks stored in backup storage for the optimized data stream may be retrieved from backup storage , and rehydrated to form the data stream. In an embodiment, the optimized stream structure e.

Thus, item level backup e.


  • Book of Cookies and Bars: From the Bakery of Afternoon Tea.
  • Microsoft Flight Simulator X For Pilots Real World Training;
  • Beginning iOS Programming For Dummies.

According to item level backup, deduplicated data chunks are associated with every data stream that is backed up while maintaining the backup in optimized form e. Whole chunk store containers are not backed up. In an embodiment, a backup and optionally restore API application programming interface may be implemented by item level backup module For instance, a backup session may be defined for backing up a first file and a second file stored in chunk store in optimized form.

In this example, it is assumed that the first file includes data chunks a and b of chunk container a and data chunk d of chunk container b , and the second file includes data chunks a - c of chunk container a and data chunks d and e of chunk container b. The backup API may be called to backup the first file. In response, the API will return the stream map chunk for the first file and data chunks a , b , and d.

The returned stream map chunk and data chunks a , b , and d are stored in backup storage for the first file. The backup API may subsequently be called to backup the second file. In response, the API will return the stream map chunk for the second file and data chunks c and e. This is because the API will determine that data chunks a , b , and d of the second file are already stored in backup storage due to the first file being backed up in backup storage As such, the returned stream map chunk and data chunks c and e are stored in backup storage for the second file.

A restore module to restore data streams backed up according to item level backup may use a similar API, where the optimized stream metadata aids the restore module API to point to the referenced data chunks. However, item level backup may be relatively slow and complex, and may not work well with block level backup techniques.

In still another embodiment, data chunk identifier backup may be performed to store optimized data streams in another type of optimized form. According to data chunk identifier backup, the data chunk identifiers are determined for optimized data streams to be backed up. The data chunk identifiers are stored in backup storage, and the chunk containers that store the referenced data chunks are stored in backup storage.

For instance, in an embodiment, data chunk identifier backup module may be configured to perform data chunk identifier backup by storing in backup storage the data chunk identifiers referenced by the optimized stream metadata of the data streams, and the chunk containers that store the chunks identified by the associated data chunk identifiers.

For instance, data chunk identifier backup module may perform a flowchart shown in FIG. Flowchart and data chunk identifier backup module are described as follows. In step of flowchart , the optimized stream metadata of each optimized data stream is analyzed to determine a corresponding at least one data chunk identifier for the at least one data chunk referenced by the optimized stream metadata.

For example, in an embodiment, for each optimized data stream identified in request , data chunk identifier backup module may retrieve the corresponding optimized stream metadata in the form of a stream map chunk from stream container Furthermore, data chunk identifier backup module may analyze the metadata of the retrieved stream map chunk to determine the data chunk identifiers for any data chunks referenced by the metadata of stream map chunk from chunk containers The data chunk identifiers are included in the metadata e.

Data chunk identifier backup module may reference a redirection table e.

Deduplication for Dummies - What is deduplication?

In step , an optimized stream structure for each optimized data stream is stored in the backup storage with the corresponding at least one data chunk identifier. For instance, data chunk identifier backup module may store the optimized stream structures of each of the optimized data streams in backup storage with the corresponding data chunk identifiers determined in step e. The chunk identifiers may be stored in association with the optimized stream structures in any manner, including being stored external to or within the corresponding optimized stream structures in backup storage In an embodiment, data chunk identifier backup module may store stream container in backup storage so that the stream map chunks of the optimized data streams are backed up e.

Furthermore, data chunk identifier backup module may store the one or more of chunk containers that store any data chunks referenced by the stream map chunks of the optimized data streams so that the referenced data chunks are backed up e. In an embodiment, data chunk identifier backup module may store chunk store in its entirety in backup storage so that all chunks of the optimized data streams are backed up.

A benefit of data chunk identifier backup is that a backup application e. Similarly to un-optimized backup described above, data chunk identifier backup can be relatively slow and complex, and may use a backup API implemented by data chunk identifier backup module Furthermore, data chunk identifier backup breaks the modularity of the chunk store and exposes the internal chunk store implementation.

As such, optimized backup, un-optimized backup, item level backup, and data chunk identifier backup are each backup embodiments that may be implemented by data backup module of FIG. Furthermore, data backup module may implement backup embodiments that are combinations of one or more of optimized backup, un-optimized backup, item level backup, and data chunk identifier backup. For instance, in one embodiment, data backup module may implement a combination of optimized backup and un-optimized backup.

In an embodiment, optimized backup may be selected to be performed when a full volume backup is performed. If a single optimized data stream is to be backed up, un-optimized backup may be performed.


  • Legal Liabilities in Safety and Loss Prevention: A Practical Guide, Second Edition.
  • Choi, Baek-Young.
  • About backups and restores of Microsoft Data Deduplication file systems;

For numbers of optimized data streams in between one and all optimized data streams, optimized backup or un-optimized backup may be selected. For instance, optimized backup or un-optimized backup may be toggled between based on heuristics. For instance, in an embodiment, data backup module may be configured to perform backup up of optimized data streams according to process shown in FIG. In step of flowchart , a backup technique is selected based on heuristics. Heuristics module is configured to determine heuristics used to select a backup technique to be used to backup optimized data streams.

In step , the backup technique is performed to backup the plurality of optimized data streams in the backup storage. Based on the heuristics determined by heuristics module , one of optimized file backup module or rehydrating backup module may be selected to backup the optimized data streams e. For instance, in an embodiment, heuristics module may provide an enable signal to the one of optimized file backup module or rehydrating backup module selected to backup the optimized data streams.

The enable signal enables the one of optimized file backup module or rehydrating backup module to backup the optimized data streams according to their corresponding backup technique. As such, heuristics module may determine that optimized data streams were selected for backup according to an exclude mode.

In such case, the optimized data stream backup technique is selected by heuristics module due to the determined exclude mode. As such, it is a better tradeoff to backup the selected data streams in their optimized form plus all chunk store container files, even though the container files may include some chunks referenced by optimized data streams other than those that were selected by the user. As such, heuristics module may determine that optimized data streams were selected for backup according to an include mode.

In such case, the un-optimized data stream backup technique is selected by heuristics module due to the determined include mode. As such, it is a better tradeoff to backup just the selected files in their un-optimized form. In such case, there is no need to backup the chunk store container files. In another embodiment, heuristics module may implement relatively advanced heuristics. A tradeoff in backup technique may be based on a delta between the optimized-size and logical-size of files within a backup scope.

A storage space waste may be determined as the storage space in chunk store container files consumed by chunks that are referenced only by optimized data streams that are not included in the backup scope.

FiLeD: File Level Deduplication Approach

As such, heuristics module may determine e. In an embodiment, some assumptions may be made unless the chunk store is configured to accurately report how many chunks are referenced by a given group of data streams. If the chunk store is capable of reporting a number of chunks referenced by a designated group of optimized data streams, heuristics module may determine the amount of wasted storage space as the space filled by all chunks minus the space filled by chunks referenced by optimized data streams designated for backup.

Based on the amount of wasted storage space e. For instance, referring to FIG. Heuristics module may select the backup technique to be the optimized data stream backup technique if the determined amount of space in chunk store consumed by the non-referenced data chunks is less than a predetermined threshold e.

Heuristics module may select the backup technique to be the un-optimized data stream backup technique if the determined amount of space in chunk store consumed by the non-referenced data chunks is greater than the predetermined threshold. In one example, heuristics module may perform the following analysis to set a backup technique. Given a backup scope which includes one or more namespace roots, where a namespace root is a folder including sub-folders recursively, the following heuristics parameters may be defined:.

If the backup application uses unoptimized backup, then the total backup size on the backup media is A. Example Data Optimization Restore Embodiments. With reference to FIG. For instance, one or more chunks e. In such case, the backed up version of the optimized data stream may be desired to be restored from backup storage in non-optimized form.

Currently used backup techniques are likely to back up entire file system namespaces, which includes optimized files and chunk store container files. Furthermore, many currently used backup techniques still use sequential backup media for backup storage , such as tape media. As such, it is desirable for techniques for restoring data streams to be capable of taking into account different backup media types and backup formats.

Embodiments are described in this subsection for restoring optimized data streams from backup storage. Embodiments enable optimized data streams e. According to an embodiment, the chunks for a particular data stream are enabled to be restored from backup storage rather than restoring further unneeded portions of the backed up chunk containers. Embodiments enable selective restore of data streams out of a full optimized backup, without requiring an understanding of the data optimization metadata, and without making assumptions about the backup media type or backup format.

In embodiments, restore of data streams in data optimization systems may be performed in various ways. Note that in an embodiment, data backup system of FIG. Backup storage stores a stream container and one or more chunk containers e. Furthermore, data restore module includes a file reconstructor and a restore application Flowchart may be performed by file reconstructor , in an embodiment.

Flowchart and data restore module are described as follows. Restore application may include restore functionality only, or may also include backup functionality e. In the case of a data optimization system, the file written to storage by restore application in response to a file restore request is the optimized stream structure, which contains a reference to the stream map chunk.

For instance, restore application may receive a request for a data stream to be retrieved from backup storage Request may identify the data stream by file name. However, because optimized stream structure is a reparse point for the data stream requested in request e. For instance, file reconstructor may be configured to reconstruct the data stream according to flowchart in FIG. In step of flowchart , a request for a data stream to be retrieved from a chunk store in storage is received, the request including an identifier for optimized stream metadata corresponding to the data stream.

Request is a request for file reconstructor to reconstruct the data stream corresponding to optimized stream structure Request includes optimized stream structure Optimized stream structure includes an optimized stream metadata indicator, such as a stream map indicator, that may be used to locate optimized stream metadata e. As described elsewhere herein, the optimized stream metadata includes references to data chunks that may be used by file reconstructor to reassemble the data stream.

File reconstructor may be implemented in hardware, software, firmware, or any combination thereof. For instance, in an embodiment, file reconstructor may be an application programming interface API configured to rehydrate data streams based on an optimized data structure restored from backup storage. In an embodiment, the API of file reconstructor may be configured to receive request in the following form:. In step , a first call to a restore application is generated based on the optimized stream metadata identifier, the first call specifying a file name for a first chunk container in storage that stores optimized stream metadata identified by the optimized stream metadata identifier, and specifying an offset for the optimized stream metadata in the first chunk container.

Callback module is configured to generate calls, and is used by file reconstructor e. Callback module may make calls back to restore application e. File reconstructor receives the restored chunks, and uses them to reassemble the identified data stream. For instance, in an embodiment, callback module may implement calls according to the following callback structure:. In an embodiment where the optimized stream metadata is stored in the form of a stream map chunk, the first call generated by callback module according to step is a request to restore application for the stream map chunk identified by the stream map chunk identifier in optimized stream structure In an embodiment, callback module may generate the first call according to the callback structure described above.

With regard to the example of FIG. In step , the optimized stream metadata is received in response to the first call. In an embodiment, restore application receives first call , and accesses backup storage to obtain the stream map chunk identified in first call Restore application generates an access to stream container in backup storage , and retrieves a stream map chunk at the offset in stream container indicated in first call Restore application transmits stream map chunk to file reconstructor in a response In step , at least one data chunk identifier referenced in the optimized stream metadata is determined.

In an embodiment, file reconstructor analyzes metadata of the stream map e. The metadata of the stream map includes pointers in the form of data chunk identifiers e. NewSQL is an emerging database system and is now being used more and more widely. NewSQL systems need to improve data reliability by periodically backing up in-memory data, resulting in a lot of duplicated data. The traditional deduplication method is not optimized for the NewSQL server system and cannot take full advantage of hardware resources to optimize deduplication performance.

Therefore, how to utilize these hardware resources to optimize the performance of data deduplication is an important issue. To take advantage of the large number of CPU cores in the NewSQL server to optimize deduplication performance, DOMe parallelizes the deduplication method based on the fork-join framework. The fingerprint index, which is the key data structure in the deduplication process, is implemented as pure in-memory hash table, which makes full use of the large DRAM in NewSQL system, eliminating the performance bottleneck problem of fingerprint index existing in traditional deduplication method.

US20100250896A1 - System and method for data deduplication - Google Patents

DOMe is experimentally analyzed by two representative backup data. In the case of the theoretical speedup ratio of the server is Deduplication is an efficient data reduction technology, and it is used to mitigate the problem of huge data volume in storage systems. At present, deduplication is widely used in storage systems [ 1 , 2 ], especially in backup systems [ 3 — 5 ]. Database systems are an important part of IT infrastructure and are ubiquitous nowadays. Therefore, how to efficiently carry out deduplication on the database backup data has been an important study. Previous studies have investigated the effect of data deduplication on these data [ 6 , 7 ].

With the continuous development of semiconductor technology, the database server hardware is about to drastically change. As a result, the database architecture will also change. As the architecture of NewSQL database and its server hardware drastically changes, it is needed to redesign the deduplication method for NewSQL backup data to optimize the deduplication performance. The challenge of optimizing deduplication performance is mainly in the following aspects:.

First, in recent years, in-memory computing technology has aroused the attention of researchers [ 12 ] because of the rising capacity and falling prices of DRAM. Traditional relational database management system RDBMS keeps data on disk and maintains buffer cache in memory to improve performance. Consequently, the use of in-memory computing technology to improve NewSQL database performance has become a common method. However, the backup between NewSQL and traditional databases have different characteristics.

As a result, how to redesign the deduplication method so as not to affect backup performance is a challenge. Second, an efficient parallel deduplication method is needed. Content defined chunking CDC [ 7 ] can achieve high duplicate elimination ratio DER , and therefore is the most widely used data chunking algorithm in deduplication. However, the CDC algorithm needs to repeatedly slide one byte and calculate the Rabin fingerprint value of the backup data stream, which leads to very low performance.

Thus, how to design an efficient parallel deduplication method by utilizing a large number of CPU cores to improve the deduplication performance is a challenge. To address these challenges, we design a deduplication framework called DOMe to take advantage of next-generation NewSQL server features to optimize performance. In the case of the theoretical speedup ratio of the server system is 24, the speedup ratio of DOMe can achieve up to 18; 3 DOMe improved the deduplication throughput by 1. Chunking is the first step in the data deduplication process, in which a file or data stream is divided into small chunks of data so that each can be fingerprinted.

In FSC, if a part of a file or data stream, no matter how small, is modified by the operation of insertion or deletion, not only is the data chunk containing the modified part changed but also all subsequent data chunks will change, because the boundaries of all these chunks are shifted. This can cause otherwise identical chunks before modification to be completely different, resulting in a significantly reduced duplicate identification ratio of FSC-based data deduplication.

To address this boundary-shift problem [ 13 ], the content-defined chunking CDC algorithm, was proposed in LBFS [ 14 ], to chunk files or data streams for duplicate identification. By picking the same relative points in the object to be chunk boundaries, CDC localizes the new chunks created in every version to regions where changes have been made, keeping all other chunks the same.

Based on CDC algorithm, many researchers proposed improved chunking algorithms [ 7 , 16 , 17 ]. However, most of these methods target at reducing more duplicated data rather than improving performance. To improving performance, the researchers proposed using multi-thread method to speedup deduplication process [ 18 ]. But these methods is based on a traditional parallelized programming model and can be further improved to achieve a better performance. Therefore, we propose using fork-join model to speedup deduplication performance.

Fork-join is a state-of-art parallelized programming model and can make better use of multi core resources of modern CPU. Although our proposed method is exemplified in CDC, it can also be used for other state-of-art CDC based chunking algorithm. The framework of DOMe is shown in Fig 1. After loading into the DRAM, the data stream is carried out a parallelized deduplication process. When the task is less than a predefined threshold, the current subtask is assigned to a CPU core and the deduplication process is performed. The advantage of in-line deduplication over post-process deduplication is that it requires less storage, since duplicate data is never stored.

On the negative side, because hash calculations and lookups take so long, data ingestion can be slower, thereby reducing the backup throughput of the device. As the NewSQL database requires high backup performance, so we choose post-processing deduplication. Deduplication process requires a lot of CPU resources. To not affect the normal NewSQL transaction processing, the system needs to be more idle when the deduplication is processing, so as not to seize the transaction processing required CPU resources.

In the Linux system, we can use uptime command to monitor the system load. When monitoring the system load value within 1 minute is always lower than a pre-set threshold, the system is determined to be idle at this time. The DOMe then reads the backup data to process it. Data chunking, the central part of deduplication, divides the data stream into small chunks, which are used as the basic unit for duplicated data detection.

FSP divides the data stream into fixed-length chunks of data, requiring only a small amount of CPU computation. Thus has the advantage of high performance, but there is a problem of data update sensitivity. The problem is that after inserting or deleting in the data stream, all the data after the change point will slide forward or backward. And these data will be wrongly detected as new. This situation is avoided with CDC, but because CDC requires to slide the detection window one byte each time and calculates the Rabin fingerprints of the current window, resulting in a large amount of CPU computational.

Therefore, its performance is significantly lower than FSP. To address this problem, we proposed a fork-join based parallelization deduplication method to improve the performance. How to effectively parallelize the deduplication process is a complex issue.

The use of fork-join framework can significantly simplify the complexity of parallelization deduplication. Parallel sections may fork recursively until a certain task granularity is reached. The difficulty of parallelization deduplication with fork-join is how to split tasks. If the task is divided at the midpoint of the data stream, some data chunks are erroneously divided into smaller chunks, which will be identified as new data chunks, resulting in a reduction of the duplicate elimination ratio, as shown in Fig 2.

To solve this problem, we make the division point of fork-join method exactly at the CDC chunking boundary point, as shown in Fig 3. The method is detailed in algorithm 1. The purpose of duplicated data detection is to identify the unique data among current and all stored backup data. The duplicated data detection is based on the following properties of MD5 or SHA-1, and so on function.

Property 1 [ 19 ]: For two different data chunks, the probability of their corresponding MD5 hash value collision i. Each unique data chunk has a unique hash value because probability is very low in practice. Based on the above property, the MD5 hash value can be referred to the image of a fingerprint. Therefore, the duplicated data detection of each chunk can be achieved by checking the fingerprint, which can be formally defined as follows:.

Definition 1 Duplicated data detection. If true, the data chunk c is considered duplicated; otherwise, it is considered a unique chunk. Data detection process can be detailed in algorithm 2. When the DOMe is started, a purely in-memory hash table is initialized as the fingerprint index line 1. Then, the buffer is constantly checked to find any chunk exist line 3. When a data chunk is read line 4 , its MD5 value is calculated as the fingerprint fp line 5. The fingerprint index is searched to find whether fp exists line 6.

If fp is not found line 7 , the current chunk can be considered unique. The unique chunk is appended to the chunk store file. Afterward, the offset of the start position of the chunk is returned line 8. If the fp is found in fingerprint index line 10 , the current chunk is duplicated and the chunk need not be written. The chunk information chunkstoreFileName, offset can be obtained from the fingerprint index according to the fp line Finally, regardless of whether the current block is duplicated, the chunk information chunkstoreFileName, offset needs to be written to the metadata file of the current backup for data recovery line Fingerprint index holds all corresponding fingerprint and location information of all written chunks.

Due to the limited memory capacity of the server, the traditional method saves the fingerprint index table on disk and maintains the cache in memory. However, the server system of NewSQL has a large amount of memory. To optimize deduplication performance, DOMe maintains fingerprint index in memory and only periodically written to disk while the NewSQL database system is idle.

Three kinds of data: unique data, metadata and fingerprint index, need to be persisted in DOMe. The data that was detected as unique need to be written into chunk store file on the file system. Regardless of whether the chunk is duplicated, the corresponding information m of the chunk must be written to the metadata file for database recovery.

As described above, fingerprint index is maintained purely in memory and only periodically written to disk while the database system is idle. Fingerprint index may lose the data that is written after last persistence if a failure occurs, but it only cause a misjudgment of some duplicated data that will be written in future, leading to small waste of storage space.

Therefore, it is not a serious problem. H-store is a next generation of database management system developed by Brown University, Massachusetts Institute of Technology, and Yale University for on-line transaction processing applications. H-tore can provide high throughput, low latency SQL operation, H-store takes full advantage of large memory capacity features of modern server systems. H-store has been commercialized and its commercial version is VoltDB. VoltDB has a very excellent transaction performance, and claims its performance is times higher than traditional database system.

H-store represents the future development trend of the database technology. The DOMe implementation architecture is shown in Fig 4. H-store is a typical NewSQL database architecture, using non-page data management. An H-store cluster consists of a number of shared-nothing nodes, each of which contains several partition execution engines as the actual execution unit. Each backup execution engine serializes the respective tuple data when performing a backup and write to the hard disk. Since the data is horizontally distributed across different partition execution engines, the backup data only has significant duplicate data in the same partition, so the DOME method is implemented in each individual partition execution engine.

All the experiments in this section are run on a single server, the configuration of which is below:. To verify the effectiveness of DOMe, we compared this method with MUCH [ 18 ], which is based on a traditional parallel programming model. Benchmark 1 is the TPC-C. In TPC-C benchmark specification [ 23 ], a complex and representative OLTP application is simulated, which assumes a large commodity wholesaler exists. Several commodity library are distributed in different regions. Each warehouse is responsible for supplying ten sales points, and each sales point is responsible for providing service to customers.