4  Data Preparation and Transformation in R

Data preparation and transformation are crucial steps in data analysis, ensuring that raw data is cleaned, structured, and formatted for further analysis. In R, several functions and packages help perform these tasks efficiently. This chapter covers essential data preparation techniques, including data import, cleaning, transformation, and manipulation.

4.1 Data manipulations and transformations

Data Handling in R

Data Handling in R refers to the importing, managing, transforming, and cleaning of data before analysis. R provides powerful functions and packages such as readr, dplyr, tidyr, and data.table to efficiently handle large datasets.

4.1.1 Importing Data into R

Importing data into R is a crucial step in data analysis. R offers various functions for loading data from different file types.

  • R supports importing data from various sources, including ** Excel, CSV, databases(SQL), JSON, and APIs**.

Import csv file from website

  • Paste the csv website link inside the read.csv() function.
  • Sample csv file website link
Code
# Read csv from website
Diabetes <- read.csv("https://raw.githubusercontent.com/yuvijen/Datasets/refs/heads/main/Diabetes.csv")

# Display the first few rows
head(Diabetes)
  Age Gender  BMI Family_History Physical_Activity      Diet_Type
1  48   Male 35.5             No              High Non-Vegetarian
2  18  Other 28.7            Yes            Medium Non-Vegetarian
3  21  Other 30.0            Yes              High Non-Vegetarian
4  25 Female 25.6             No            Medium     Vegetarian
5  78   Male 38.8             No              High Non-Vegetarian
6  60   Male 19.2             No              High     Vegetarian
  Smoking_Status Alcohol_Intake Stress_Level Hypertension Cholesterol_Level
1          Never           None       Medium          Yes             111.7
2        Current       Moderate         High           No             130.6
3        Current       Moderate         High          Yes             294.8
4         Former       Moderate         High          Yes             159.1
5        Current           High         High           No             215.0
6        Current           None          Low           No             160.2
  Fasting_Blood_Sugar Postprandial_Blood_Sugar HBA1C Heart_Rate Waist_Hip_Ratio
1               141.0                    165.6   8.9         94            0.91
2                83.1                    142.6   5.9         68            0.96
3               159.9                    212.4   4.8         70            0.88
4               133.3                    225.4  11.9         78            0.98
5               164.9                    218.1  11.6         65            0.85
6                77.8                    238.2   4.7         69            0.88
  Urban_Rural Health_Insurance Regular_Checkups
1       Urban               No               No
2       Rural              Yes              Yes
3       Rural               No               No
4       Rural               No               No
5       Urban               No               No
6       Urban               No              Yes
  Medication_For_Chronic_Conditions Pregnancies Polycystic_Ovary_Syndrome
1                                No           0                         0
2                                No           0                         0
3                               Yes           0                         0
4                               Yes           1                        No
5                               Yes           0                         0
6                               Yes           0                         0
  Glucose_Tolerance_Test_Result Vitamin_D_Level C_Protein_Level
1                         124.3            31.5            7.46
2                         151.4            12.5            5.64
3                         106.1            35.8            7.20
4                          85.6            15.4            6.53
5                          77.0            28.6            0.58
6                         180.2            49.0            1.83
  Thyroid_Condition Diabetes_Status
1               Yes             Yes
2               Yes              No
3                No             Yes
4               Yes              No
5                No             Yes
6                No              No

Import csv file from local folder

Sample csv data file - Download the csv dataset from this link.

copy the file path name or (if working directory is set) of the dataset and paste it inside the read.csv("file name")

  • To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
  • for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
# Read CSV file
Indiaagriculture = read.csv("Agri production india.csv")

# Display the first few rows
head(Indiaagriculture)
    Crop          State Cost.of.Cultivation....Hectare..A2.FL
1  ARHAR  Uttar Pradesh                               9794.05
2  ARHAR      Karnataka                              10593.15
3  ARHAR        Gujarat                              13468.82
4  ARHAR Andhra Pradesh                              17051.66
5  ARHAR    Maharashtra                              17130.55
6 COTTON    Maharashtra                              23711.44
  Cost.of.Cultivation....Hectare..C2 Cost.of.Production....Quintal..C2
1                           23076.74                           1941.55
2                           16528.68                           2172.46
3                           19551.90                           1898.30
4                           24171.65                           3670.54
5                           25270.26                           2775.80
6                           33116.82                           2539.47
  Yield..Quintal..Hectare..
1                      9.83
2                      7.47
3                      9.59
4                      6.42
5                      8.72
6                     12.69

Import excel file from local folder

readxl: To read Excel files in R, you need to install the readxl package.

  • The readxl package allows users to import Excel spreadsheets (.xls and .xlsx) into R for further analysis.
  • Supports both .xls (Excel 97-2003) and .xlsx (Excel 2007+) formats.
  • Reads specific sheets, ranges, and named regions.
Install readxl package
Code
Sample data file

Download the excel dataset from this link.

copy the file name (or file path name) of the dataset and paste it inside the function read_excel("file_name")

  • To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
  • for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
library(readxl)
Europeanagriculture = read_excel("Europeanagriculture.xlsx", sheet = 1)
head(Europeanagriculture)
# A tibble: 6 × 24
  Country  farms_number used_agricultural_area_ha standard_output_EUR
  <chr>           <dbl>                     <dbl> <chr>              
1 Belgium         36890                   1354250 8037986420         
2 Bulgaria       202720                   4468500 3842891030         
3 Czechia         26530                   3455410 NA                 
4 Denmark         35050                   2614600 10062442040        
5 Germany        276120                  16715320 49249020560        
6 Estonia         16700                    995100 801547060          
# ℹ 20 more variables: subsistence_semisubsistence_farms <chr>,
#   total_labour_persons <dbl>, total_labour_AWU <dbl>,
#   nonfamily_labour_persons <chr>, nonfamily_labour_AWU <chr>,
#   managers_basic_training <dbl>, managers_only_practical <dbl>,
#   managers_full_training <dbl>, managers_training_NA <chr>,
#   farms_SO_zero <chr>, farms_SO_less2000 <dbl>, `farms_SO_2000-3999` <dbl>,
#   `farms_SO_4000-7999` <dbl>, `farms_SO_8000-14999` <dbl>, …

Import Data from SQL database

  • First download the sql database file from this link
  • Save the sql file in your directory.
  • Install the required packages shown below.

install.packages("DBI") - Database interface package install.packages("RSQLite") - SQLite database driver (for MySQL, use RMariaDB) install.packages("dplyr") - For data manipulation

Code
Install the required packages for sql
Code
Importing Data from Databases (SQL)
# Load required libraries
library(DBI)
library(RSQLite)
library(dplyr)

# Connect to the database
con <- dbConnect(RSQLite::SQLite(), "my_database.sqlite")

# List tables in the database
dbListTables(con)
[1] "employees"
Code
# Import an entire table into R
empdata <- dbReadTable(con, "employees") 

# Display the first few rows
head(empdata)
  id    name age salary
1  1   Alice  25  50000
2  2     Bob  30  60000
3  3 Charlie  35  70000
Code
# Close the database connection
dbDisconnect(con)

4.1.2 Data file creation in R

use data.frame() function

Code
# Create a simple data frame
students <- data.frame(
  Name = c("Arun", "Arun", "Charan", "Divya", "Eswar", 
           "Fathima", "Gopal", "Harini", "Ilango", "Jayanthi",
           "Kiran", "Lavanya", "Mohan", "Nandini", "Omkar", 
           "Pavithra", "Qasim", "Raji", "Sanjay", "Tulsi"),
  Age = c(25, 25, 35, 28, 22, 
          40, 33, 27, 31, 29,
          24, 26, NA, 28, 30,
          27, 34, 29, NA, 26),
  Height = c(5.6, 5.6, NA, 5.4, 6.0, 
             5.3, 5.9, 5.5, 5.7, 5.8,
             5.9, NA, 6.1, 5.4, 5.6,
             5.5, 5.8, 5.3, 5.7, NA),
  Gender = c("M", "M", NA, "F", "F", 
             "F", "M", "F", "M", "F",
             "M", "F", "M", NA, "M",
             "F", "M", "F", "F", "M"),

  Marks = c(58, 58, 65, 89, 89, 
            80, 81, 82, 83, 83,
            84, 84, 85, 85, 86, 
            86, 87, 88, 90, 92),
  Attendance = c(84, 84, 85, 85, 86, 
                 86, 86, 87, 89, 89,
                 89, 90, 90, 91, 81,
                 92, 83, 95, 96, 94),
  Residence = c("Day Scholar", "Day Scholar", "Hosteller", "Hosteller", "Day Scholar",
                "Hosteller", "Hosteller", "Day Scholar", "Hosteller", "Day Scholar",
                "Hosteller", "Day Scholar", "Hosteller", "Day Scholar", "Hosteller",
                "Hosteller", "Day Scholar", "Hosteller", "Day Scholar", "Day Scholar")
)

students
       Name Age Height Gender Marks Attendance   Residence
1      Arun  25    5.6      M    58         84 Day Scholar
2      Arun  25    5.6      M    58         84 Day Scholar
3    Charan  35     NA   <NA>    65         85   Hosteller
4     Divya  28    5.4      F    89         85   Hosteller
5     Eswar  22    6.0      F    89         86 Day Scholar
6   Fathima  40    5.3      F    80         86   Hosteller
7     Gopal  33    5.9      M    81         86   Hosteller
8    Harini  27    5.5      F    82         87 Day Scholar
9    Ilango  31    5.7      M    83         89   Hosteller
10 Jayanthi  29    5.8      F    83         89 Day Scholar
11    Kiran  24    5.9      M    84         89   Hosteller
12  Lavanya  26     NA      F    84         90 Day Scholar
13    Mohan  NA    6.1      M    85         90   Hosteller
14  Nandini  28    5.4   <NA>    85         91 Day Scholar
15    Omkar  30    5.6      M    86         81   Hosteller
16 Pavithra  27    5.5      F    86         92   Hosteller
17    Qasim  34    5.8      M    87         83 Day Scholar
18     Raji  29    5.3      F    88         95   Hosteller
19   Sanjay  NA    5.7      F    90         96 Day Scholar
20    Tulsi  26     NA      M    92         94 Day Scholar

4.1.3 Viewing and Exploring Data

Explore the Data

Checking Data Structure
  • View() - To view the entire table in a window
  • head() - View the first few rows
  • str() - Check structure of dataset
  • summary() - Summary statistics
Code
head(students)
     Name Age Height Gender Marks Attendance   Residence
1    Arun  25    5.6      M    58         84 Day Scholar
2    Arun  25    5.6      M    58         84 Day Scholar
3  Charan  35     NA   <NA>    65         85   Hosteller
4   Divya  28    5.4      F    89         85   Hosteller
5   Eswar  22    6.0      F    89         86 Day Scholar
6 Fathima  40    5.3      F    80         86   Hosteller
Code
str(students)
'data.frame':   20 obs. of  7 variables:
 $ Name      : chr  "Arun" "Arun" "Charan" "Divya" ...
 $ Age       : num  25 25 35 28 22 40 33 27 31 29 ...
 $ Height    : num  5.6 5.6 NA 5.4 6 5.3 5.9 5.5 5.7 5.8 ...
 $ Gender    : chr  "M" "M" NA "F" ...
 $ Marks     : num  58 58 65 89 89 80 81 82 83 83 ...
 $ Attendance: num  84 84 85 85 86 86 86 87 89 89 ...
 $ Residence : chr  "Day Scholar" "Day Scholar" "Hosteller" "Hosteller" ...
Code
summary(students)
     Name                Age            Height         Gender         
 Length:20          Min.   :22.00   Min.   :5.300   Length:20         
 Class :character   1st Qu.:26.00   1st Qu.:5.500   Class :character  
 Mode  :character   Median :28.00   Median :5.600   Mode  :character  
                    Mean   :28.83   Mean   :5.653                     
                    3rd Qu.:30.75   3rd Qu.:5.800                     
                    Max.   :40.00   Max.   :6.100                     
                    NA's   :2       NA's   :3                         
     Marks         Attendance     Residence        
 Min.   :58.00   Min.   :81.00   Length:20         
 1st Qu.:81.75   1st Qu.:85.00   Class :character  
 Median :84.50   Median :88.00   Mode  :character  
 Mean   :81.75   Mean   :88.10                     
 3rd Qu.:87.25   3rd Qu.:90.25                     
 Max.   :92.00   Max.   :96.00                     
                                                   
Checking Missing Values
Code
# Count missing values per column
colSums(is.na(students))
      Name        Age     Height     Gender      Marks Attendance  Residence 
         0          2          3          2          0          0          0 
Checking Duplicates
Code
# Identify duplicate rows
duplicates <- students[duplicated(students), ]
print(duplicates)
  Name Age Height Gender Marks Attendance   Residence
2 Arun  25    5.6      M    58         84 Day Scholar

4.1.4 Handling missing values and imputations

Replace missing values with mean
Code
# Create a copy of the original data for data cleaning. 

clean <- students
#  Calculate the mean height (excluding NA)
mean_height <- mean(clean$Height, na.rm = TRUE)

# Replace NA values in Height with the calculated mean
clean$Height[is.na(clean$Height)] <- mean_height

clean
       Name Age   Height Gender Marks Attendance   Residence
1      Arun  25 5.600000      M    58         84 Day Scholar
2      Arun  25 5.600000      M    58         84 Day Scholar
3    Charan  35 5.652941   <NA>    65         85   Hosteller
4     Divya  28 5.400000      F    89         85   Hosteller
5     Eswar  22 6.000000      F    89         86 Day Scholar
6   Fathima  40 5.300000      F    80         86   Hosteller
7     Gopal  33 5.900000      M    81         86   Hosteller
8    Harini  27 5.500000      F    82         87 Day Scholar
9    Ilango  31 5.700000      M    83         89   Hosteller
10 Jayanthi  29 5.800000      F    83         89 Day Scholar
11    Kiran  24 5.900000      M    84         89   Hosteller
12  Lavanya  26 5.652941      F    84         90 Day Scholar
13    Mohan  NA 6.100000      M    85         90   Hosteller
14  Nandini  28 5.400000   <NA>    85         91 Day Scholar
15    Omkar  30 5.600000      M    86         81   Hosteller
16 Pavithra  27 5.500000      F    86         92   Hosteller
17    Qasim  34 5.800000      M    87         83 Day Scholar
18     Raji  29 5.300000      F    88         95   Hosteller
19   Sanjay  NA 5.700000      F    90         96 Day Scholar
20    Tulsi  26 5.652941      M    92         94 Day Scholar
Remove missing values
Code
clean <- na.omit(clean)
clean
       Name Age   Height Gender Marks Attendance   Residence
1      Arun  25 5.600000      M    58         84 Day Scholar
2      Arun  25 5.600000      M    58         84 Day Scholar
4     Divya  28 5.400000      F    89         85   Hosteller
5     Eswar  22 6.000000      F    89         86 Day Scholar
6   Fathima  40 5.300000      F    80         86   Hosteller
7     Gopal  33 5.900000      M    81         86   Hosteller
8    Harini  27 5.500000      F    82         87 Day Scholar
9    Ilango  31 5.700000      M    83         89   Hosteller
10 Jayanthi  29 5.800000      F    83         89 Day Scholar
11    Kiran  24 5.900000      M    84         89   Hosteller
12  Lavanya  26 5.652941      F    84         90 Day Scholar
15    Omkar  30 5.600000      M    86         81   Hosteller
16 Pavithra  27 5.500000      F    86         92   Hosteller
17    Qasim  34 5.800000      M    87         83 Day Scholar
18     Raji  29 5.300000      F    88         95   Hosteller
20    Tulsi  26 5.652941      M    92         94 Day Scholar

Remove Duplicates

Code
clean <- clean[!duplicated(clean), ]
clean
       Name Age   Height Gender Marks Attendance   Residence
1      Arun  25 5.600000      M    58         84 Day Scholar
4     Divya  28 5.400000      F    89         85   Hosteller
5     Eswar  22 6.000000      F    89         86 Day Scholar
6   Fathima  40 5.300000      F    80         86   Hosteller
7     Gopal  33 5.900000      M    81         86   Hosteller
8    Harini  27 5.500000      F    82         87 Day Scholar
9    Ilango  31 5.700000      M    83         89   Hosteller
10 Jayanthi  29 5.800000      F    83         89 Day Scholar
11    Kiran  24 5.900000      M    84         89   Hosteller
12  Lavanya  26 5.652941      F    84         90 Day Scholar
15    Omkar  30 5.600000      M    86         81   Hosteller
16 Pavithra  27 5.500000      F    86         92   Hosteller
17    Qasim  34 5.800000      M    87         83 Day Scholar
18     Raji  29 5.300000      F    88         95   Hosteller
20    Tulsi  26 5.652941      M    92         94 Day Scholar

Changing Data Types

Code
# Convert column to numeric
clean$Height <- as.numeric(clean$Height)

# Convert column to character
clean$Name <- as.character(clean$Name)
str(clean)
'data.frame':   15 obs. of  7 variables:
 $ Name      : chr  "Arun" "Divya" "Eswar" "Fathima" ...
 $ Age       : num  25 28 22 40 33 27 31 29 24 26 ...
 $ Height    : num  5.6 5.4 6 5.3 5.9 ...
 $ Gender    : chr  "M" "F" "F" "F" ...
 $ Marks     : num  58 89 89 80 81 82 83 83 84 84 ...
 $ Attendance: num  84 85 86 86 86 87 89 89 89 90 ...
 $ Residence : chr  "Day Scholar" "Hosteller" "Day Scholar" "Hosteller" ...
 - attr(*, "na.action")= 'omit' Named int [1:4] 3 13 14 19
  ..- attr(*, "names")= chr [1:4] "3" "13" "14" "19"

4.1.5 Data Manipulation with dplyr

The dplyr package provides functions for filtering, selecting, modifying, and restructuring data.

Selecting Specific Columns

  • Install dplyr package
Code
Code
library(dplyr)

# Select specific columns, not including height column
clean <- clean %>% select(Name,Age,Residence, Gender, Marks, Attendance)
clean
       Name Age   Residence Gender Marks Attendance
1      Arun  25 Day Scholar      M    58         84
4     Divya  28   Hosteller      F    89         85
5     Eswar  22 Day Scholar      F    89         86
6   Fathima  40   Hosteller      F    80         86
7     Gopal  33   Hosteller      M    81         86
8    Harini  27 Day Scholar      F    82         87
9    Ilango  31   Hosteller      M    83         89
10 Jayanthi  29 Day Scholar      F    83         89
11    Kiran  24   Hosteller      M    84         89
12  Lavanya  26 Day Scholar      F    84         90
15    Omkar  30   Hosteller      M    86         81
16 Pavithra  27   Hosteller      F    86         92
17    Qasim  34 Day Scholar      M    87         83
18     Raji  29   Hosteller      F    88         95
20    Tulsi  26 Day Scholar      M    92         94

Removing Columns

Code
# Remove specific columns
clean <- clean %>% select(-Residence)
print(clean)
       Name Age Gender Marks Attendance
1      Arun  25      M    58         84
4     Divya  28      F    89         85
5     Eswar  22      F    89         86
6   Fathima  40      F    80         86
7     Gopal  33      M    81         86
8    Harini  27      F    82         87
9    Ilango  31      M    83         89
10 Jayanthi  29      F    83         89
11    Kiran  24      M    84         89
12  Lavanya  26      F    84         90
15    Omkar  30      M    86         81
16 Pavithra  27      F    86         92
17    Qasim  34      M    87         83
18     Raji  29      F    88         95
20    Tulsi  26      M    92         94

Filtering Data

Code
# Filter rows where column value is greater than 100
clean <- clean %>% filter(clean$Attendance >= 85)
clean
       Name Age Gender Marks Attendance
1     Divya  28      F    89         85
2     Eswar  22      F    89         86
3   Fathima  40      F    80         86
4     Gopal  33      M    81         86
5    Harini  27      F    82         87
6    Ilango  31      M    83         89
7  Jayanthi  29      F    83         89
8     Kiran  24      M    84         89
9   Lavanya  26      F    84         90
10 Pavithra  27      F    86         92
11     Raji  29      F    88         95
12    Tulsi  26      M    92         94

Sorting Data

Code
# Arrange data in descending order
clean <- clean %>% arrange(desc(clean$Marks))
clean
       Name Age Gender Marks Attendance
1     Tulsi  26      M    92         94
2     Divya  28      F    89         85
3     Eswar  22      F    89         86
4      Raji  29      F    88         95
5  Pavithra  27      F    86         92
6     Kiran  24      M    84         89
7   Lavanya  26      F    84         90
8    Ilango  31      M    83         89
9  Jayanthi  29      F    83         89
10   Harini  27      F    82         87
11    Gopal  33      M    81         86
12  Fathima  40      F    80         86

4.1.6 Reshaping Data with tidyr

Converting Wide Data to Long Format

Code
library(tidyr)

clean <- clean %>% pivot_longer(cols = c("Marks", "Attendance"), names_to = "Measure",values_to = "Value")
clean
# A tibble: 24 × 5
   Name       Age Gender Measure    Value
   <chr>    <dbl> <chr>  <chr>      <dbl>
 1 Tulsi       26 M      Marks         92
 2 Tulsi       26 M      Attendance    94
 3 Divya       28 F      Marks         89
 4 Divya       28 F      Attendance    85
 5 Eswar       22 F      Marks         89
 6 Eswar       22 F      Attendance    86
 7 Raji        29 F      Marks         88
 8 Raji        29 F      Attendance    95
 9 Pavithra    27 F      Marks         86
10 Pavithra    27 F      Attendance    92
# ℹ 14 more rows

Converting long data to wide format

Code
# Assuming your current dataframe is named 'clean' in long format:
clean <- clean %>% pivot_wider(names_from = Measure, values_from = Value)

clean
# A tibble: 12 × 5
   Name       Age Gender Marks Attendance
   <chr>    <dbl> <chr>  <dbl>      <dbl>
 1 Tulsi       26 M         92         94
 2 Divya       28 F         89         85
 3 Eswar       22 F         89         86
 4 Raji        29 F         88         95
 5 Pavithra    27 F         86         92
 6 Kiran       24 M         84         89
 7 Lavanya     26 F         84         90
 8 Ilango      31 M         83         89
 9 Jayanthi    29 F         83         89
10 Harini      27 F         82         87
11 Gopal       33 M         81         86
12 Fathima     40 F         80         86

4.1.7 Exporting Data from R

Export the data file as a csv file to local folder

The write.csv() function in R is used to export a dataset into a Comma-Separated Values (CSV) file, making it easy to save, share, and analyze data in other tools like Excel, Python, or SQL.

write.csv(data, "folder path/filename.csv") To save the CSV in a specific folder, provide the path.

Code
export the data frame as csv to local
# Export the data frame as csv
write.csv(clean, "cleanstudentsdata.csv")

Export data file as a excel file to local folder

writexl: Writing Data to Excel Files

The writexl package provides an easy way to export data from R into an Excel file without requiring external dependencies.

Key Features:

  • Writes .xlsx files quickly.
  • Preserves column types and formats.
Install writexl package
Code
install writexl package
install.packages("writexl")
  • export the clean data file as xlsx file.
Code
library(writexl)
# Write data to an Excel file
write_xlsx(clean, "cleanstudentsdata.xlsx")

To check the list of files present in the directory (or project)

Code
check the list of files in the directory

4.2 Normalization and Standardization in R

Normalization and standardization are essential techniques in data preprocessing to scale numeric variables, ensuring that they contribute equally to statistical analysis and machine learning models. Since different variables may have varying ranges, applying these transformations helps prevent bias towards larger values and enhances the efficiency of algorithms.

4.2.1 Normalization

Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1]. This method ensures that all features have the same scale, preventing variables with large ranges from dominating those with smaller ranges.

Formula for Min-Max Normalization \[ X’ = \frac{X - X_{min}}{X_{max} - X_{min}} \]

where:

  • \(X{\prime}\) is the normalized value.
  • \(X\) is the original value. -\(X_{min}\) and \(X_{max}\) are the minimum and maximum values in the dataset.

When to Use Normalization?

  • When data follows a non-Gaussian (skewed) distribution.
  • When features need to be scaled between 0 and 1 (e.g., for machine learning).
  • When working with neural networks and distance-based algorithms (e.g., k-NN, SVM).
Implementation of Normalization in R
Code
# Sample dataset
df <- data.frame(Height = c(150, 160, 170, 180, 190, 200),
                   Weight = c(50, 60, 65, 75, 80, 90))

# Min-Max Normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Apply normalization to dataset
df_normalized <- as.data.frame(lapply(df, normalize))

# Print results
print(df_normalized)
  Height Weight
1    0.0  0.000
2    0.2  0.250
3    0.4  0.375
4    0.6  0.625
5    0.8  0.750
6    1.0  1.000

🔹 Output: The values are now between 0 and 1, ensuring equal contribution to models.


4.2.2 Standardization

Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation. Unlike min-max normalization, it does not bound values to a fixed range.

Formula for Z-score Standardization

\[ X’ = \frac{X - \mu}{\sigma} \]

where:

  • \(X{\prime}\) is the standardized value.
  • \(X\) is the original value.
  • \(\mu\) is the mean of the dataset.
  • \(\sigma\) is the standard deviation of the dataset.

When to Use Standardization?

  • When data follows a normal (Gaussian) distribution.
  • When using linear regression, logistic regression, PCA, and clustering.
  • When handling outliers, as it is less sensitive to extreme values than normalization.
Implementation of Standardization in R
Code
# Standardization function (Z-score scaling)
standardize <- function(x) {
  return ((x - mean(x)) / sd(x))
}

# Apply standardization
df_standardized <- as.data.frame(lapply(df, standardize))

# Print results
print(df_standardized)
      Height     Weight
1 -1.3363062 -1.3801311
2 -0.8017837 -0.6900656
3 -0.2672612 -0.3450328
4  0.2672612  0.3450328
5  0.8017837  0.6900656
6  1.3363062  1.3801311

🔹 Output: The transformed data now has zero mean (0) and a standard deviation of 1, making it suitable for statistical modeling.


4.2.3 Key Differences Between Normalization and Standardization

Feature Normalization Standardization
Definition Rescales values to a fixed range (typically 0 to 1). Transforms values to have a mean of 0 and a standard deviation of 1.
Formula \[ X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}} \] \[ X' = \frac{X - \mu}{\sigma} \]
Effect on Data Retains original distribution but compresses values into a fixed range. Adjusts the data to a standard normal distribution (bell-shaped).
Range Typically [0,1] (Min-Max Scaling). Centered around 0, without fixed bounds.
Best Used When Data does not follow a normal distribution. Data follows (or approximately follows) a normal distribution.
Sensitive to Outliers? Yes – A single extreme value can distort scaling. No – Since standardization uses mean and standard deviation, it is more robust to outliers.
Example Dataset Heights of individuals in cm (e.g., 150 cm – 200 cm). Exam scores that vary in scale (e.g., 0-100 vs. 0-10).
Common Applications Neural Networks, KNN, Distance-based models (SVM, Clustering). Linear Regression, Logistic Regression, PCA.
When to Use Normalization vs. Standardization?
  • Use Normalization when:
    • Your data has varying scales and does not follow a normal distribution.
    • You are using distance-based models like KNN, K-Means, Neural Networks.
    • Features have different units (e.g., age in years, salary in dollars).
  • Use Standardization when:
    • Your data follows a Gaussian (Normal) distribution.
    • You are using models that assume normality, like Linear Regression, PCA, Logistic Regression.
    • You need a dataset where outliers have less impact.

4.2.4 Practical Applications of Normalization and Standardization

Domain Use Case Method Reason for Preference
Machine Learning Neural Networks Normalization Keeps feature values within (0,1) range, improving model convergence.
Machine Learning PCA Standardization PCA assumes normally distributed data; z-score scaling ensures proper feature weights.
Machine Learning k-NN Algorithm Normalization KNN is distance-based; scaling prevents one feature from dominating others.
Healthcare Medical Imaging Normalization Pixel intensities are scaled (0 to 1) for better visualization and deep learning models.
Finance Stock Prediction Standardization Stock prices vary widely; standardization makes comparisons meaningful.
E-commerce Purchases Data Normalization Standardizes transaction amounts and frequency for better customer segmentation.
Social Media Analytics Sentiment Analysis Standardization Engagement and word frequency metrics often follow a normal distribution.
Supply Chain Demand Forecast Standardization Ensures seasonal fluctuations are balanced for forecasting accuracy.
Image Processing Object Detection Normalization Image pixel values are scaled, improving computational efficiency.
NLP Word Embeddings Standardization Normalizing word frequency enhances embedding quality and training efficiency.

4.3 Creating and Managing Dummy Variables

Dummy variables are binary (0/1) indicators used in regression models, classification tasks, and machine learning to represent categorical data numerically. Many statistical and machine learning algorithms require numerical inputs, making dummy variables essential for handling categorical variables.

4.3.1 Why Are Dummy Variables Needed?

  • Some models, like linear regression, do not work directly with categorical data.
  • Variables such as Gender (Male/Female) or Customer Type (New/Returning) need to be converted into numbers.
  • Dummy variables help in representing categorical effects in statistical and predictive models.

4.3.2 Creating Dummy Variables in R

Using ifelse() for Binary Categories

If a variable has two categories (e.g., Male/Female), we can create a dummy variable using ifelse().

Code
# Sample dataset
  data <- data.frame(Gender = c("Male", "Female", "Male", "Female", "Male"))

# Create a dummy variable for "Male" (1 = Male, 0 = Female)
  data$Male_Dummy <- ifelse(data$Gender == "Male", 1, 0)

print(data)
  Gender Male_Dummy
1   Male          1
2 Female          0
3   Male          1
4 Female          0
5   Male          1

Using model.matrix() for Multiple Categories

For categorical variables with more than two categories, one-hot encoding is used to create multiple dummy variables.

Code
# Sample dataset
  data <- data.frame(Department = c("HR", "IT", "Finance", "HR", "IT"))

# Create dummy variables
  dummy_vars <- model.matrix(~ Department - 1, data = data)

# Convert to dataframe
  dummy_vars_df <- as.data.frame(dummy_vars)

# Print dataset
  print(dummy_vars_df)
  DepartmentFinance DepartmentHR DepartmentIT
1                 0            1            0
2                 0            0            1
3                 1            0            0
4                 0            1            0
5                 0            0            1

Note: The -1 in model.matrix(~ Department - 1, data) removes the intercept, ensuring that each category is represented separately.


4.3.3 Managing Dummy Variables

When working with dummy variables, consider the following:

  • Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid multicollinearity.
  • Check for Redundant Variables: If one category can be predicted from the others, remove it.
  • Scaling Dummy Variables: If using models like KNN or clustering, scaling may be required.

To remove one dummy variable in regression analysis:

Code
# Remove one dummy variable (HR) to avoid multicollinearity
  dummy_vars_df <- dummy_vars_df[, -1]
  print(dummy_vars_df)
  DepartmentHR DepartmentIT
1            1            0
2            0            1
3            0            0
4            1            0
5            0            1

4.3.4 Applications of Dummy Variables

Application Domain Example Use Case
Regression Analysis Car Type (Sedan/SUV/Truck) Convert car types into dummy variables for predicting car prices.
Customer Segmentation Customer Type (New/Returning) Categorize customers into segments for personalized marketing strategies.
Fraud Detection Payment Method (Credit/Debit) Detect potential fraud by analyzing payment method patterns.
Healthcare Analytics Disease Presence (Yes/No) Transform categorical disease labels into binary format for machine learning models.
  • Dummy variables convert categorical data into numerical format for better processing.

  • Avoid multicollinearity by removing one category in regression models.

  • For large datasets, consider using the fastDummies package for efficient dummy variable creation.

  • Using dummy variables effectively helps improve statistical modeling and machine learning performance, ensuring categorical data can be used in predictive models.