Data transformation is one of the basic steps in the data preprocessing section. When we first learn feature scaling techniques, we will deal with the terms scale, standardization, and normalization a lot.
Normalization is one of the most frequently used data preparation techniques. In machine learning and data mining, this process helps us convert the values of numeric fields in a dataset to use a common scale.
If you have ever dealt with databases, you are probably familiar with the term “data normalization.” One of the challenges in databases is having attributes with different units, ranges, and scales.
Applying data mining or machine learning algorithms to data with drastic ranges can produce inaccurate results. Because of that, a data normalization process is needed.
Normalization of Data
Normalization is a logical design technique in a database that groups attributes from various entities. It aims to form a good relationship structure (without data redundancy/repetition), and most of the ambiguity can be eliminated.
In a nutshell, Database Normalization is the process of grouping data attributes. That form simple, non-redundant, flexible, and adaptable entities so that it can be ensured that the databases created are of good quality.
Database normalization consists of many forms. In database science, at least 9 forms of normalization exist, namely 1NF, 2NF, 3NF, EKNF, BCNF, 4NF, 5NF, DKNF, and 6NF.
When creating an optimal database, 1NF, 2NF, and 3NF databases will often be encountered. To become a Database Administrator (DBA), you must know how to optimize database normalization.
For example, one day, when the website you are creating is experiencing a decrease in performance, you may be asked if the database has been normalized correctly.
Normalization of Data Stages
Several stages of database normalization need to be carried out so that the results are appropriate and good, namely:
1. Unnormalized Form (UNF)
UNF is an abnormal form of data due to repeated groups in the data, so it becomes a problem when manipulating data.
2. First Normal Form (1NF)
1NF is a form of normalization for grouping several similar data to overcome anomaly problems. A data model is said to be in first normal form if each of its attributes has one and only one value.
Suppose there is an attribute that has more than one value. In that case, the attribute is a candidate to become a separate entity.
3. Second Normal Form (2NF)
2NF is the second form that performs table decomposition to find the primary key of each table. A data model is said to meet the second normal form if it satisfies the first normal form. Every non-identifier attribute of an entity depends entirely on all of the entity’s identifiers.
4. Third Normal Form (3NF)
3NF is a form of data normalization where there can be no attributes that depend on other fields.
Not on the primary key, so that attribute must be separated into a new table. A data model is said to be in third normal form if it is in second normal form.
None of the non-identifying attributes (not unique identifiers) depend on other non-identifying attributes. If present, split one of the attributes into a new entity, and the attributes that depend on it become the attributes of the new entity.
5. Code Normal Form (BCNF)
BCNF is a form of normalization that aims to overcome anomalies and over looping that cannot be overcome in 3NF.
Fifth Normal Form (5NF). 5NF is the stage to overcome occurrence of joint dependent. There is a breakdown of the relationship into two.
Why Need Data Normalization in Data Mining?
Data normalization is usually important when dealing with large datasets to ensure data consistency and quality.
Normalization is generally needed when the dataset’s attributes have different scales or ranges. For example, there is an imbalance where some data are too high, and some are too low.
If normalization is not carried out, it can cause dilution in data attributes with a lower scale. It’s because other attributes have values on a larger scale, even though these attributes also have the same degree of importance.
In conclusion, when there are many attributes, but these attributes have values on different scales, it can lead to a bad data model when performing data mining operations.
So the dataset needs to be normalized to bring all the attributes on the same scale. In addition, data normalization techniques are useful for ensuring data remains consistent.
Methods of Data Normalization
There are several data normalization methods, but in this article, we will discuss the three most frequently used techniques: Z-score normalization, min-max normalization, and decimal scaling normalization.
1. Z-score normalization
Z-score normalization, known as standardization, is a technique in which the attribute’s value will be normalized based on the mean and standard deviation.
The essence of this technique is to transform the data from values to a common scale where the mean is equal to zero, and the standard deviation is one.
Z-score normalization in data mining is useful for analyzing data that requires comparing values with the average value.
2. Min-max normalization
Which is easier to understand: the difference between 500 and 1000000 or between 0.5 and 1? Data is easier to understand when the minimum and highest values range is smaller.
The min-max normalization method transforms a data set into a scale ranging from 0 (min) to 1 (max).
The original data underwent linear modification in this data normalization procedure.
3. Decimal Scaling Normalization
In data mining, decimal scaling is another way to normalize. This method works by rounding the decimal number to the nearest decimal point.
This method normalizes the data by shifting the decimal point of the number. The data value, v, is normalized to be v’ using the formula below.
Importance of Normalization
A database design is bad if:
- The same data is stored in several places
(files or records). - Inability to generate information
certain. - An information loss has occurred.
- There is redundancy (repetition) or duplication of data, which wastes storage space and makes it difficult to update data.
- A NULL VALUE appears.
- Loss of information can occur when designing the database (doing the wrong decomposition process).
- The forms of normalization often used are 1st NF, 2nd NF, 3rd NF, and BCNF.