Data deduplication
Regardless of the method, deduplication allows only one unique piece of information to be stored on media. Therefore, one of the most important aspects of deduplication is the level of detail.
There are several levels of data deduplication:
- bytes;
- files;
- blocks.
Each of these methods has its own advantages and disadvantages. Let’s take a closer look at them.
Data deduplication methods
Block level
This is considered the most popular deduplication method and involves analyzing a portion of data (a file) and then saving only unique repetitions of information for each individual block.
A block is considered to be a single logical unit of information with a characteristic size that can vary. All data during block-level deduplication is processed using hashing (e.g., SHA-1 or MD5).
Hash algorithms allow you to create and store a specific signature (identifier) in the deduplication database that corresponds to each individual unique data block.
Thus, if a file is changed over a certain period of time, only the changed blocks will be stored in the data storage, not the entire file.
There are two types of block deduplication: variable and fixed block length. The first option involves dividing files into blocks, each of which can be of different sizes.
This option is more effective in terms of reducing the amount of stored data than using deduplication with fixed block lengths.
File level
This method of deduplication involves comparing a new file with one that has already been saved. If a unique file is found, it will be saved. If the file is not new, only a link (pointer to this file) will be saved.
In other words, with this type of deduplication, only one version of the file is recorded, and all future copies will receive a pointer to the original file. The main advantage of this method is its simplicity of implementation without a significant reduction in performance.
Byte level
This method is similar in principle to the first one on our list, but instead of blocks, it compares old and new files byte by byte. This is the only method that can guarantee maximum elimination of duplicate files.
However, byte-level deduplication has a significant drawback: the hardware of the machine running the process must be extremely powerful, as it has higher requirements.
Data deduplication and backup
In addition to the above, during the data backup process, deduplication can be performed using different methods depending on:
- location of execution;
- data source (client);
- storage side (server).
Client-server deduplication
A combined data deduplication method in which the necessary processes can be run on both the server and the client. Before sending data from the client to the server, the software first attempts to “understand” which data has already been recorded.
For this type of deduplication, it is first necessary to calculate the hash of each data block and then send them to the server as a file list of different hash keys. The server compares the list of these keys and then sends the data blocks to the client.
This method significantly reduces the load on the network, as only unique data is transferred.
Client-side deduplication
This involves performing the operation directly at the data source. Therefore, with this type of deduplication, the client’s computing power will be used. After the process is complete, the data will be sent to storage devices.
This type of deduplication is always implemented using software. The main disadvantage of this method is the high load on the client’s RAM and processor. The key advantage is the ability to transfer data over a low-bandwidth network.
Server-side deduplication
Used when data arrives at the server in a completely unprocessed form — without encoding or compression. This type of deduplication is divided into software and hardware.
Hardware type
Implemented on the basis of a deduplication device, which is provided as a specific hardware solution that combines deduplication logic and data recovery procedures.
The advantage of this method is the ability to transfer the load from the server capacity to a specific hardware unit. The deduplication process itself becomes as transparent as possible.
Software type
This involves the use of special software that actually performs all the necessary deduplication processes. However, with this approach, it is always necessary to take into account the load on the server that will arise during the deduplication process.
Pros and cons
The following points can be attributed to the positive aspects of deduplication as a process:
- High efficiency. According to research by EMC, the data deduplication process reduces the need for information storage capacity by 10-30 times.
- Cost-effective when network bandwidth is low. This is due to the transfer of exclusively unique data.
- The ability to create backups more often and store backup copies of data for longer.
The disadvantages of deduplication include:
- The possibility of data conflicts if two different blocks generate the same hash key at the same time. This can damage the database, causing a failure when restoring from a backup.
- The larger the database, the higher the risk of a conflict. The solution to this problem is to increase the hash space.