A Beginner’s Guide to Principal Components Analysis (PCA) with R

Name: A Beginner’s Guide to Principal Components Analysis (PCA) with R
Rating: 5 (34 reviews)
Author: Mohammed looti

Mohammed looti

A Beginner’s Guide to Principal Components Analysis (PCA) with R

Data Science, Data Visualization, dimensionality reduction, machine learning, PCA, PCA in R, principal component analysis, Principal Components Analysis, R, R programming, statistical analysis, Unsupervised Learning

Principal Components Analysis (PCA) stands as a foundational and powerful unsupervised machine learning technique widely utilized across data science and statistical modeling. At its core, PCA addresses the fundamental challenge of handling high-dimensional data through dimensionality reduction. Its primary objective is to transform a large set of correlated variables into a smaller, more manageable set of synthetic variables—known as principal components—that still manage to capture the maximum possible variance within the original, complex dataset.

The mathematical elegance of PCA lies in its ability to construct these new variables as linear combinations of the initial predictors. By focusing analysis solely on these crucial components, data scientists can drastically simplify the inherent data structure, leading to enhanced interpretability, easier visualization, and often, more stable performance in subsequent predictive modeling tasks. This transformation is invaluable when working with massive datasets where numerous factors interact simultaneously.

The motivation for employing PCA becomes immediately apparent when confronted with datasets featuring a large number of predictor variables, often denoted by p. Attempting to analyze relationships across all possible pairs of variables quickly becomes overwhelming. For example, a dataset containing just p = 15 predictors would necessitate the inspection of 105 distinct pairwise scatterplots, rendering traditional visual inspection and exploratory data analysis highly impractical. PCA offers a robust, systematic solution to this analytical bottleneck, providing a low-dimensional summary that faithfully reflects the crucial underlying variation. If we can summarize the majority of the data’s variability in just two dimensions (PC1 and PC2), we can effectively project all observations onto a simple, easy-to-interpret scatterplot.

The Mathematical Foundation of Principal Components Analysis

The formal goal of PCA is to distill the original set of p predictors—X₁, X₂, … , X_p—into a new, condensed set of M principal components: Z₁, … , Z_M, where M is selected to be significantly smaller than p. This process ensures that the vast majority of the information encoded in the variance is preserved in the reduced space, maximizing information retention while minimizing complexity.

Each principal component, Z_m, is mathematically defined as a linear combination of the original variables: Z_m = ΣΦ_jmX_j. The set of constants, Φ_1m, Φ_2m, … , Φ_pm, are critical; they are referred to as the loadings, and they dictate the exact direction and magnitude of the component in the data space. These loadings reveal how much each original variable contributes to the construction of that specific principal component.

Principal components are constructed sequentially based on the strict criterion of variance maximization and independence, ensuring they capture unique facets of the data structure:

The first principal component, Z₁, is derived as the linear combination of predictors that captures the absolute largest possible variance present in the entire dataset.
The second principal component, Z₂, is subsequently defined as the next linear combination that maximizes the remaining residual variance, with the crucial constraint that it must be mathematically orthogonal (i.e., completely uncorrelated) to Z₁.
The third principal component, Z₃, follows the same logic, maximizing the remaining variance while being orthogonal to both Z₁ and Z₂.
This iterative construction continues until all p principal components have been calculated, although in practical applications, we typically only retain the first few components that explain the dominant portion of the variance.

Implementing PCA: Key Steps and Linear Algebra

In applied settings, calculating these optimal linear combinations relies heavily on established methods from linear algebra, specifically the decomposition of the data’s covariance structure. The standard procedure involves three essential, sequential steps that ensure robust and meaningful component extraction:

Standardizing the Data (Scaling): Prior to any calculation, every variable must be standardized to possess a mean of 0 and a standard deviation of 1. This step is absolutely critical because PCA is inherently sensitive to the scale of the variables; without standardization, variables measured in larger units would unfairly dominate the resulting principal components, skewing the analysis.
Calculating the Covariance Matrix: Following scaling, we compute the covariance matrix for the standardized variables. This matrix serves as a comprehensive summary, quantifying the variance of each individual variable and the relationships (covariances) between every pair of variables in the dataset.
Determining Eigenvalues and Eigenvectors: The final step involves calculating the eigenvalues and their corresponding eigenvectors of the covariance matrix.

The outcome of this linear algebra procedure provides an elegant solution to the PCA problem: the eigenvector associated with the largest eigenvalue precisely defines the first principal component (PC1). This vector identifies the direction within the data space that accounts for the maximum possible variance. Subsequent principal components (PC2, PC3, etc.) are similarly defined by eigenvectors corresponding to the second, third, and subsequent largest eigenvalues.

The remainder of this tutorial transitions from theory to application, providing a detailed, practical guide on how to successfully perform and interpret Principal Components Analysis using the statistical programming language R.

Step 1: Preparing the R Environment and Loading the Data

To begin our practical demonstration, we first establish the necessary environment in R. We load the powerful tidyverse package, which provides a cohesive collection of tools highly optimized for efficient data manipulation, exploration, and visualization within the R ecosystem.

library(tidyverse)

For our analysis, we will utilize the popular USArrests dataset, which is conveniently included within R’s base installation. This dataset compiles statistics relating to arrests per 100,000 residents across each of the U.S. states in 1973. The variables detail rates for three primary crime categories: Murder, Assault, and Rape. Additionally, it includes UrbanPop, which represents the percentage of the population residing in urban areas within that state.

The following R commands load the dataset into the active session and display the initial observations, allowing us to quickly confirm the dataset’s structure and variable types:

# Load the built-in USArrests data
data("USArrests")

# View the first six rows to inspect the structure
head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado    7.9     204       78 38.7

Step 2: Calculating and Interpreting Principal Components (Loadings)

With the data successfully prepared, we proceed to compute the principal components using R’s native function, prcomp(). Critically, we must specify the argument scale = TRUE within the function call. This command instructs R to automatically standardize all variables—ensuring they have zero mean and unit variance—before the components are calculated, adhering strictly to the required preprocessing step in PCA methodology.

A common convention in R’s PCA output is that the direction (sign) of the eigenvectors may sometimes be inverted compared to standard statistical interpretation. To ensure consistency and ease of interpretation, we multiply both the resulting rotation matrix and the component scores by -1. This adjustment does not alter the underlying variance explained, but merely flips the direction of the component arrows in the visualized space.

# Calculate principal components, ensuring data scaling
results <- prcomp(USArrests, scale = TRUE)

# Reverse the signs for conventional interpretation (optional but recommended)
results$rotation <- -1*results$rotation

# Display the principal component loadings
results$rotation

               PC1        PC2        PC3         PC4
Murder   0.5358995 -0.4181809  0.3412327 -0.64922780
Assault  0.5831836 -0.1879856  0.2681484  0.74340748
UrbanPop 0.2781909  0.8728062  0.3780158 -0.13387773
Rape     0.5434321  0.1673186 -0.8177779 -0.08902432

By carefully analyzing the resulting rotation matrix, we can assign meaningful interpretations to the new components. The first principal component (PC1) exhibits strong positive loadings for Murder (0.536), Assault (0.583), and Rape (0.543). Given that all these measures relate to violent criminal activity, PC1 can be robustly interpreted as a generalized measure of overall violent crime rate across the states.

In contrast, the second principal component (PC2) is dominated by a very high positive loading for UrbanPop (0.873), while displaying relatively low loadings for the crime variables. Therefore, PC2 largely captures the underlying degree of urbanization within the various states. Crucially, because PC2 is orthogonal to PC1, it captures variance in urbanization that is statistically independent of the overall crime rate captured by PC1.

The principal components scores—the coordinates of each state observation in the new, reduced feature space—are stored in results$x. We also apply the sign reversal to these scores for consistency with the loadings:

# Reverse the signs of the scores
results$x <- -1*results$x

# Display the transformed coordinates for the first six states
head(results$x)

                  PC1        PC2         PC3          PC4
Alabama     0.9756604 -1.1220012  0.43980366 -0.154696581
Alaska      1.9305379 -1.0624269 -2.01950027  0.434175454
Arizona     1.7454429  0.7384595 -0.05423025  0.826264240
Arkansas   -0.1399989 -1.1085423 -0.11342217  0.180973554
California  2.4986128  1.5274267 -0.59254100  0.338559240
Colorado    1.4993407  0.9776297 -1.08400162 -0.001450164

Step 3: Visualizing the Data Structure with a Biplot

One of the most powerful advantages of PCA is its capacity to collapse complex, high-dimensional datasets into an easily interpretable two-dimensional visualization. This is achieved through the use of a biplot, which simultaneously maps two distinct elements onto the plane defined by PC1 and PC2: the observations (individual states) and the variable loadings (the original crime categories).

When generating the visualization in R using the biplot() function, the argument scale = 0 is crucial. This setting ensures that the arrows representing the original variables are scaled correctly according to their calculated loadings, thereby preserving the proportional relationship between the variables and the principal components for accurate interpretation.

biplot(results, scale = 0)

Biplot for principal components analysis in R

The resulting biplot immediately provides visual insight into the data’s underlying structure. States positioned close to one another exhibit similar profiles across the four original variables. Furthermore, the direction and length of the variable arrows indicate their influence: states plotted far along the direction of a specific arrow (e.g., “Murder”) possess characteristically high values for that variable.

For instance, observation of the plot reveals that the state of “Georgia” is situated very close to the direction vector of the Murder variable. This visual proximity strongly suggests that Georgia is highly associated with elevated murder rates compared to the average state and relative to the other variables. We can quickly validate this interpretation by inspecting the original dataset, sorted by murder rate:

# Display states with the highest murder rates in the original dataset
head(USArrests[order(-USArrests$Murder),])

               Murder Assault UrbanPop Rape
Georgia          17.4     211       60 25.8
Mississippi      16.1     259       44 17.1
Florida          15.4     335       80 31.9
Louisiana        15.4     249       66 22.2
South Carolina   14.4     279       48 22.5
Alabama          13.2     236       58 21.2

Step 4: Assessing Variance Explained and Selecting Components

A pivotal step in utilizing PCA is determining how much of the total variance present in the original dataset is successfully explained by each successive component. This measure is essential for deciding the optimal number of components (M) to retain for subsequent modeling or visualization purposes. This ratio is calculated by squaring the standard deviation (sdev) of each component and dividing by the sum of all variances (total variance).

# Calculate the proportion of total variance explained by each principal component
results$sdev^2 / sum(results$sdev^2)

[1] 0.62006039 0.24744129 0.08914080 0.04335752

These numerical results clearly demonstrate the dominant explanatory power concentrated in the first two components:

The first principal component (PC1) alone accounts for approximately 62.0% of the total variance.
The second principal component (PC2) explains an additional 24.7% of the total variance.
Collectively, the first two components capture a substantial 86.7% of the total variability in the four original crime and urbanization measures.
The remaining components (PC3 and PC4) explain only minor residual variance: 8.9% and 4.3%, respectively.

Given that PC1 and PC2 account for such a significant majority of the total variance, projecting the data onto the two-dimensional biplot (as demonstrated in Step 3) is a statistically robust and highly informative simplification. The patterns observed in this reduced space reliably reflect the core underlying structure of the original four-dimensional data.

To finalize the selection process, we often rely on a visual aid known as a scree plot. This plot graphically displays the eigenvalues (variance explained) against the index number of the principal component. The goal is to identify the “elbow”—the point on the graph where the marginal gain in variance explained begins to sharply diminish.

# Calculate total variance explained for plotting
var_explained = results$sdev^2 / sum(results$sdev^2)

# Create the scree plot using ggplot2 (part of tidyverse)
qplot(c(1:4), var_explained) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  ylim(0, 1)

Scree plot in R

Practical Applications of Principal Components Analysis

Beyond its theoretical appeal, PCA serves two critical, practical functions that are indispensable in modern data analysis and predictive modeling workflows.

1. Exploratory Data Analysis (EDA) and Visualization: PCA is frequently one of the initial steps undertaken when analyzing a novel dataset. By generating visualizations like the biplot, analysts can rapidly identify natural clusters or groupings among observations (e.g., states exhibiting similar crime profiles). Furthermore, PCA clarifies which original variables are the primary drivers of the largest differences or variations observed within the data, providing immediate, actionable insights.

2. Principal Components Regression (PCR) and Model Stabilization: The derived principal components are often utilized as new, uncorrelated predictors in regression models, a robust technique known as Principal Components Regression. This approach is highly effective for addressing situations where the original predictors exhibit severe multicollinearity—a scenario where strong correlations among predictors destabilize standard linear regression models, leading to inflated variance in coefficient estimates. Since principal components are inherently constructed to be orthogonal (uncorrelated), substituting them for the original variables eliminates the multicollinearity issue, resulting in more stable and reliable model estimates.

This comprehensive guide has demonstrated the theory, calculation steps, and interpretation of Principal Components Analysis using the statistical programming language R. The complete R code utilized throughout this tutorial, including all setup and calculation commands, is available for replication and further study via the external link provided below.

The complete R code used in this tutorial, including the setup and calculations, can be accessed here for replication.

Cite this article

APAMLACHICAGOHARVARDIEEEAMA

Mohammed looti (2025). A Beginner’s Guide to Principal Components Analysis (PCA) with R. PSYCHOLOGICAL STATISTICS. Retrieved from https://statistics.arabpsychology.com/principal-components-analysis-in-r-step-by-step-example/

Mohammed looti. "A Beginner’s Guide to Principal Components Analysis (PCA) with R." PSYCHOLOGICAL STATISTICS, 6 Nov. 2025, https://statistics.arabpsychology.com/principal-components-analysis-in-r-step-by-step-example/.

Mohammed looti. "A Beginner’s Guide to Principal Components Analysis (PCA) with R." PSYCHOLOGICAL STATISTICS, 2025. https://statistics.arabpsychology.com/principal-components-analysis-in-r-step-by-step-example/.

Mohammed looti (2025) 'A Beginner’s Guide to Principal Components Analysis (PCA) with R', PSYCHOLOGICAL STATISTICS. Available at: https://statistics.arabpsychology.com/principal-components-analysis-in-r-step-by-step-example/.

[1] Mohammed looti, "A Beginner’s Guide to Principal Components Analysis (PCA) with R," PSYCHOLOGICAL STATISTICS, vol. X, no. Y, ص Z-Z, November, 2025.

Mohammed looti. A Beginner’s Guide to Principal Components Analysis (PCA) with R. PSYCHOLOGICAL STATISTICS. 2025;vol(issue):pages.

Download Post (.PDF)

Table of Contents