Data ScienceData Analytics

DigiCO Data Wrangling & Predictive Analytics

End-to-end cleaning, NLP & predictive modelling on 10,000+ retail records

Built end-to-end data cleaning, preprocessing, sentiment analysis, outlier detection, and predictive modelling workflows across 10,000+ retail transaction records using Python, Pandas, NumPy, Scikit-learn, and NLP techniques.

PythonPandasNumPyScikit-learnNLPData CleaningPredictive ModellingStatistical Analysis

Case Study

Problem

A retail dataset spanning customer, warehouse, pricing, delivery, and transaction tables was incomplete, noisy, and unfit for predictive modelling.

My role

Data Scientist — owned data cleaning, NLP sentiment analysis, outlier detection, and predictive model development.

Solution

Built a reproducible Pandas pipeline that validated and standardised each source, restored incomplete records via median/mode and geospatial inference, applied VADER sentiment analysis to free-text reviews, and trained Scikit-learn regression models with residual-based outlier detection and statistical transformations.

Challenges

›Heterogeneous schemas across customer, warehouse, pricing, delivery, and transaction tables.
›Missing values requiring statistical, geospatial, and sentiment-based imputation.
›Highly skewed numeric features hurting model performance.

Key features

›Cleaned and validated 10,000+ retail transaction records across customer, warehouse, pricing, delivery, and transaction datasets.
›Restored 1,000+ incomplete records using median, mode, geospatial inference, sentiment analysis, and rule-based reconstruction.
›Applied VADER NLP sentiment analysis to classify customer satisfaction behaviour from unstructured review data.
›Developed regression-based outlier detection and predictive modelling using Scikit-learn and residual analysis.
›Achieved above 97% model accuracy and reduced feature skewness by up to 80% through statistical transformations.

Results

›Predictive model achieving above 97% accuracy in project evaluation.
›Up to 80% reduction in feature skewness after transformations.
›Reusable wrangling and modelling templates for future retail datasets.

Above 97% model accuracy (project evaluation)

Skewness reduced by up to 80%

1,000+ records restored