๐บ Concrete Compressive Strength: Advanced Statistical & Machine Learning Analysis
๐๏ธ Date: June 2024
๐งช Domain: Civil Engineering | Statistical Modeling | Machine Learning
๐ Tools Used: R, RStudio, ggplot2, caret, xgboost, randomForest
๐ Dataset: Yeh, I.-C. (1998) - High Performance Concrete Laboratory Dataset
๐ฑ๏ธ Portfolio: View Full Repo
๐ Project Overview
This project explores the factors influencing concrete compressive strength using statistical and machine learning methods. With over 1,030 lab observations, we investigate ingredient proportions (e.g., Cement, Water, Fly Ash) and curing age to determine their predictive power on strength. The approach spans data cleaning, EDA, regression modeling, log transformations, and ML algorithms such as Random Forest and XGBoost.
๐ฏ Objectives
- Explore how material compositions impact concrete strength.
- Identify significant predictors using regression and hypothesis testing.
- Develop and evaluate predictive models using ML algorithms.
- Visualize data relationships, assumptions, and model performance.
๐๏ธ Dataset Description
Variable |
Description |
Cement, Slag, Fly Ash |
Binder materials (kg/mยณ) |
Water, Superplasticizer |
Fluid & admixtures (kg/mยณ) |
Coarse/Fine Aggregate |
Fill materials (kg/mยณ) |
Age |
Days since casting (1โ365) |
Concrete Category |
Based on aggregate ratio (Categorical) |
Contains Fly Ash |
Binary (TRUE/FALSE) |
Compressive Strength |
Target variable (MPa) |
๐ Data Preparation & Exploration
๐ Key Steps:
- Removed 78 duplicates
- Treated skewed outliers using median replacement
- Converted categorical variables to factors
- Handled type inconsistencies and formatting

Installing and loading libraries for data analysis.

The dataset was imported using โread_excelโ R-code above.

With the code above, 78 duplicates were seen

Variable names were rewritten for ease of use during analysis using the โcolnamesโ

Above is the r-code I used to detect outliers in numeric columns

Ploting the above, visualises outliers counted for each Variable.


The above visualises distributions of all continuous variables using histograms and density plots.
๐ Correlation & Exploratory Analysis

The code calculates the correlation matrix for my numerical variables and displays the correlation matrix
- Strongest correlation: Cement vs Strength (r = 0.50)
- Superplasticizer and Age: Moderate positive correlation
- Water: Negative correlation (r = -0.22)
๐ Regression Modeling
1๏ธโฃ Simple Linear Regression (SLR)
Strength = 13.44 + 0.08 * Cement
Rยฒ = 0.25
2๏ธโฃ Multiple Linear Regression (MLR)

The above select the variables by their number of arrangement in the dataset to show a matrix of linearity between โIVโsโ and the target variable.
Strength = 0.07*Cement + 1.11*Superplasticizer + 0.10*Age - 0.08*Water
Adjusted Rยฒ = 0.58
/screenshots/residuals_plot.png
- Residual distribution
- Rยฒ = 0.79
- Improved variance, passed assumptions
/screenshots/log_model_result.png
- Log-transformed residual plots
๐ค Machine Learning Models
โ
Random Forest


Above is the plot of the variable importance of each variable based on the Node of Purity. It has an RMSE of 2.628103
- Rยฒ = 0.92
- RMSE = 2.63
- Top features: Cement, Age, Superplasticizer
/screenshots/rf_importance.png
- RF Variable importance
โ
XGBoost Regressor
/screenshots/xgboost_metrics.png
- XGBoost performance
๐ Hypothesis Testing

- Superplasticizer: p < 0.001 โ
- Water: p < 0.001 (negative effect) โ
- Fly Ash: weak positive correlation
- ANOVA: No significant difference across categories
- Interaction: Fly Ash ร Category is significant
โ
Conclusion

XGBoost was the top-performing model with 99.7% accuracy. Cement, Superplasticizer, and Age positively impacted strength. Excess Water had a negative effect. This analysis helps civil engineers design optimized concrete mixtures with stronger, more durable structures.
๐ Access the Full Portfolio
- ๐ GitHub Repo
- ๐ธ Place all visuals in
/screenshots/
directory for automatic rendering