Back to projects
Data ScienceData Analytics
DigiCO Data Wrangling & Predictive Analytics
End-to-end cleaning, NLP & predictive modelling on 10,000+ retail records
Built end-to-end data cleaning, preprocessing, sentiment analysis, outlier detection, and predictive modelling workflows across 10,000+ retail transaction records using Python, Pandas, NumPy, Scikit-learn, and NLP techniques.
PythonPandasNumPyScikit-learnNLPData CleaningPredictive ModellingStatistical Analysis
Problem
A retail dataset spanning customer, warehouse, pricing, delivery, and transaction tables was incomplete, noisy, and unfit for predictive modelling.
My role
Data Scientist — owned data cleaning, NLP sentiment analysis, outlier detection, and predictive model development.
Solution
Built a reproducible Pandas pipeline that validated and standardised each source, restored incomplete records via median/mode and geospatial inference, applied VADER sentiment analysis to free-text reviews, and trained Scikit-learn regression models with residual-based outlier detection and statistical transformations.
Challenges
- ›Heterogeneous schemas across customer, warehouse, pricing, delivery, and transaction tables.
- ›Missing values requiring statistical, geospatial, and sentiment-based imputation.
- ›Highly skewed numeric features hurting model performance.
Key features
- ›Cleaned and validated 10,000+ retail transaction records across customer, warehouse, pricing, delivery, and transaction datasets.
- ›Restored 1,000+ incomplete records using median, mode, geospatial inference, sentiment analysis, and rule-based reconstruction.
- ›Applied VADER NLP sentiment analysis to classify customer satisfaction behaviour from unstructured review data.
- ›Developed regression-based outlier detection and predictive modelling using Scikit-learn and residual analysis.
- ›Achieved above 97% model accuracy and reduced feature skewness by up to 80% through statistical transformations.
Results
- ›Predictive model achieving above 97% accuracy in project evaluation.
- ›Up to 80% reduction in feature skewness after transformations.
- ›Reusable wrangling and modelling templates for future retail datasets.
Above 97% model accuracy (project evaluation)
Skewness reduced by up to 80%
1,000+ records restored