Tuesday, 23 December 2014

NOISE AND DATA REDUNDANCY

Redundancy
Redundancy in information theory is the number of bits used to transmit or store a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit or store certain data. Data compression is a way to reduce or eliminate unwanted redundancy, while checksums are a way of adding desired redundancy for purposes of error detection when communicating over a noisy channel of limited capacity.
Data redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, when customer data are duplicated and attached with each product bought, then redundancy of data is a known source of inconsistency since customer might appear with different values for given attribute.

Definition - What does Data Redundancy mean?
Data redundancy is a condition created within a database or data storage technology in which the same piece of data is held in two separate places.
This can mean two different fields within a single database, or two different spots in multiple software environments or platforms. Whenever data is repeated, this basically constitutes data redundancy. This can occur by accident, but is also done deliberately for backup and recovery purposes.
Within the general definition of data redundancy, there are different classifications based on what is considered appropriate in database management, and what is considered excessive or wasteful. Wasteful data redundancy generally occurs when a given piece of data does not have to be repeated, but ends up being duplicated due to inefficient coding or process complexity.
A positive type of data redundancy works to safeguard data and promote consistency. Many developers consider it acceptable for data to be stored in multiple places. The key is to have a central, master field or space for this data, so that there is a way to update all of the places where data is redundant through one central access point. Otherwise, data redundancy can lead to big problems with data inconsistency, where one update does not automatically update another field. As a result, pieces of data that are supposed to be identical end up having different values.
Noisy Data
Noisy data is meaningless data. The term has often been used as a synonym for corrupt data. However, its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines, such as unstructured text. Any data that has been received, stored, or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy.
Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis. Statistical analysis can use information gleaned from historical data to weed out noisy data and facilitate data mining.

Noisy data can be caused by hardware failures, programming errors and gibberish input from speech or optical character recognition (OCR) programs. Spelling errors, industry abbreviations and slang can also impede machine reading.

No comments:

Post a Comment