Data Compression
Data Compression refers to the transformation of information that is performed to reduce its size. It is used to ensure efficient utilization of hardware resources that store, process, transmit, and perform any other operations on information.
[text_with_btn btn=”learn more” link=”/en-ru/solutions/tech-partners/netapp/” btn_size=”small”]Data Compression in NetApp Storage[/text_with_btn]The Data Compression process is based on the elimination of redundancy that is characteristic of pristine (uncompressed) data. The simplest example of information redundancy is too many repetitions of the same word in a text.
To remove this type of redundancy, you need to replace a frequently occurring word with a reference to another data fragment, which is encoded and has a strictly specified volume.
Reducing the “weight” of data can be obtained by replacing too frequently repeated data types with coded words and too rare data types with long codes (entropy coding). If the data has no redundancy (encrypted information, “white noise”, short signal, etc.), it will not be possible to compress it without loss of information.
Lossless Data Compression is a process that allows you to fully restore the original information if necessary, because the amount of stored information is not reduced, despite the reduction of the space it occupies.
The above feature can occur when the probabilities are not evenly distributed on the messages. For example, when some of the messages that are possible in theory did not occur in an earlier encoding of those messages.
Data Compression algorithms for unknown data types
We can distinguish 2 main methods for compressing data that has an unknown format:
- Each next compressible character is either placed in the output buffer in its original form, or a group of several compressible characters is replaced by a reference to a similar group of encoded characters. Such a method is most often used when creating self-extracting software.
- For each sequence of characters that are compressed, statistics (frequencies of occurrence of data in the code) are collected once or continuously. Based on these statistics, the probability of the value of the next character (or sequence of characters) to be encoded is determined. Then one type of entropy coding is used to replace frequently occurring data types with short codewords, and rare data types with longer ones.