Home »
Data Science
Data Deduplication in Data Science
Data Science | Data Deduplication: In this tutorial, we are going to learn about the Data Deduplication in Data Science, benefits of Data Deduplication, Deduplication vs. Compression, Implementing Data Deduplication, etc.
Submitted by Kartiki Malik, on March 23, 2020
Data Deduplication
Data deduplication is a technique accustomed to cut back the quantity of space in which a corporation save its data. In most organizations, the storage systems contain duplicate copies of the many items of information. As an example, a similar file could also be saved in many completely different places by different users, or 2 or additional files that aren't identical should embrace a lot of similar data.
Deduplication eliminates these additional copies by saving only one copy of the info and substitution the opposite copies with pointers that lead back to the first copy. Firms oftentimes use deduplication in backup and disaster recovery applications, however, it is an accustomed liberate area in primary storage still.
Deduplication at the File or Lager Bock Level
In its simplest kind, deduplication takes place on the file level, it eliminates duplicate copies of a similar file. This sort of deduplication is typically known as file-level deduplication or Single Instance Storage (SIS).
Deduplication also can surface on the block level, eliminating duplicate blocks of information that occur in non-identical files.
Block-level deduplication frees up extra space than SIS, and a specific kind called a variable block or variable-length deduplication has become very famous. Usually, the phrase data deduplication is employed as a word for block-level or variable-length deduplication
Benefits of Data Deduplication
The primary advantage of data deduplication is that it reduces the quantity of disk or tape that organizations got to purchase, which successively reduces prices. NetApp reports that in some cases, deduplication will cut back storage needs up to 95%, however, the kind of data you're attempting to deduplicate and therefore the quantity of file sharing your organization will influence your deduplication magnitude relation. Whereas deduplication is applied to data held on tape, the comparatively high prices of disk storage create deduplication a fashionable possibility for disk-based systems. Eliminating additional copies of information saves cash not solely on direct disk hardware prices, however additionally on connected prices, like electricity, cooling, maintenance, floor area, etc.
Deduplication also can cut back the quantity of network information measures needed for backup processes, and in some cases, it will speed up the backup and recovery method.
Deduplication vs. Compression
Deduplication is typically confused with compression, another technique for reducing storage needs. Whereas deduplication eliminates redundant data, compression uses algorithms to save lots of data additional in brief. Some compression is lossless, which means that no data is lost within the method, However, "lossy" compression, which is often used with audio and video files, truly deletes a number of the less-important data enclosed in an exceedingly come to save lots of areas. Against this, deduplication solely eliminates additional copies of data; none of the first data is lost. Also, compression doesn't get obviate duplicated data -- the storage system might still contain multiple copies of compressed files.
Deduplication usually includes a larger impact on computer file size than compression. In an exceedingly typical enterprise backup state of affairs, compression might cut back backup size by a magnitude relation of 2:1 or 3:1, whereas deduplication will cut back backup size by up to 25:1, betting on what quantity duplicate data is within the systems. Usually, enterprises utilize deduplication and compression along to maximize their savings.
Implementing Data Deduplication
The process for implementing data deduplication technology varies wide betting on the kind of product and therefore the merchant. As an example, if deduplication technology is enclosed in an exceedingly backup appliance or storage answer, the implementation method is going to be a lot of completely different than for standalone deduplication code.
In general, deduplication technology is deployed in one amongst 2 basic ways: at the supply or the target. In supply deduplication, data copies are eliminated in primary storage before the info is shipped to the backup system. The advantage of supply deduplication is that reduces the information measure needs and time necessary for backing up data. On the drawback, supply deduplication consumes additional processor resources, and it is tough to integrate with existing systems and applications.
By distinction, target deduplication takes place at intervals the backup system and is commonly a lot of easier to deploy.
Target deduplication comes in 2 types: in-line or post-process.
- In-line deduplication takes place before the backup copy is written to disk or tape. The advantage of in-line deduplication is that it needs less cupboard space than post-process deduplication, however, it will weigh down the backup method.
- Post-process deduplication takes place when the backup has been written, thus it needs that organizations have a good deal of cupboard space out there for the first backup. However, post-process deduplication is typically quicker than in-line deduplication.
Deduplication Technology
Data deduplication may be extremely proprietary technology. Deduplication ways vary widely from the merchant to vendor, and lots of these ways are proprietary. As an example, Microsoft includes a patent on single instance storage. Additionally, Quantum owns a patent on variable-length deduplication. Several alternative vendors additionally own patents associated with deduplication technology.