Clement Airiohuodion

Logo

Welcome to My Page

View the Project on GitHub clembrain/Advanced_Statistics_Using_R_Studio

๐Ÿ“บ Concrete Compressive Strength: Advanced Statistical & Machine Learning Analysis

๐Ÿ—•๏ธ Date: June 2024 ๐Ÿงช Domain: Civil Engineering | Statistical Modeling | Machine Learning ๐Ÿ“Š Tools Used: R, RStudio, ggplot2, caret, xgboost, randomForest ๐Ÿ“ Dataset: Yeh, I.-C. (1998) - High Performance Concrete Laboratory Dataset ๐Ÿ–ฑ๏ธ Portfolio: View Full Repo


๐Ÿ“Œ Project Overview

This project explores the factors influencing concrete compressive strength using statistical and machine learning methods. With over 1,030 lab observations, we investigate ingredient proportions (e.g., Cement, Water, Fly Ash) and curing age to determine their predictive power on strength. The approach spans data cleaning, EDA, regression modeling, log transformations, and ML algorithms such as Random Forest and XGBoost.


๐ŸŽฏ Objectives


๐Ÿ—’๏ธ Dataset Description

Variable Description
Cement, Slag, Fly Ash Binder materials (kg/mยณ)
Water, Superplasticizer Fluid & admixtures (kg/mยณ)
Coarse/Fine Aggregate Fill materials (kg/mยณ)
Age Days since casting (1โ€“365)
Concrete Category Based on aggregate ratio (Categorical)
Contains Fly Ash Binary (TRUE/FALSE)
Compressive Strength Target variable (MPa)

๐Ÿ” Data Preparation & Exploration

๐Ÿ“Œ Key Steps:


Load Libraries

Installing and loading libraries for data analysis.


Load Dataset

The dataset was imported using โ€œread_excelโ€ R-code above.


Check Duplicates

With the code above, 78 duplicates were seen


Rename Columns

Variable names were rewritten for ease of use during analysis using the โ€œcolnamesโ€


Detect Outliers (IQR)

Above is the r-code I used to detect outliers in numeric columns


Outlier Count Per Column

Ploting the above, visualises outliers counted for each Variable.


Replace Outliers with Median

This steps above help replace outliers with median values of respective columns


Variable Distribution (EDA)

The above visualises distributions of all continuous variables using histograms and density plots.


๐Ÿ“ˆ Correlation & Exploratory Analysis


Correlation Analysis

The code calculates the correlation matrix for my numerical variables and displays the correlation matrix



๐Ÿ“Š Regression Modeling

1๏ธโƒฃ Simple Linear Regression (SLR)

Strength = 13.44 + 0.08 * Cement
Rยฒ = 0.25

2๏ธโƒฃ Multiple Linear Regression (MLR)


Correlation Analysis

The above select the variables by their number of arrangement in the dataset to show a matrix of linearity between โ€œIVโ€™sโ€ and the target variable.


Strength = 0.07*Cement + 1.11*Superplasticizer + 0.10*Age - 0.08*Water
Adjusted Rยฒ = 0.58

/screenshots/residuals_plot.png - Residual distribution


๐Ÿ“Š Log Transformation Model

/screenshots/log_model_result.png - Log-transformed residual plots


๐Ÿค– Machine Learning Models

โœ… Random Forest


ML Model (RF)

IncNodePurity


Above is the plot of the variable importance of each variable based on the Node of Purity. It has an RMSE of 2.628103


/screenshots/rf_importance.png - RF Variable importance

โœ… XGBoost Regressor

/screenshots/xgboost_metrics.png - XGBoost performance


๐Ÿ“Š Hypothesis Testing


Hypothesis Testing



โœ… Conclusion


Conclusion


XGBoost was the top-performing model with 99.7% accuracy. Cement, Superplasticizer, and Age positively impacted strength. Excess Water had a negative effect. This analysis helps civil engineers design optimized concrete mixtures with stronger, more durable structures.


๐Ÿ”— Access the Full Portfolio