Back to projects
Data ScienceData Analytics

DigiCO Data Wrangling & Predictive Analytics

End-to-end cleaning, NLP & predictive modelling on 10,000+ retail records

Built end-to-end data cleaning, preprocessing, sentiment analysis, outlier detection, and predictive modelling workflows across 10,000+ retail transaction records using Python, Pandas, NumPy, Scikit-learn, and NLP techniques.

PythonPandasNumPyScikit-learnNLPData CleaningPredictive ModellingStatistical Analysis

Problem

A retail dataset spanning customer, warehouse, pricing, delivery, and transaction tables was incomplete, noisy, and unfit for predictive modelling.

My role

Data Scientist — owned data cleaning, NLP sentiment analysis, outlier detection, and predictive model development.

Solution

Built a reproducible Pandas pipeline that validated and standardised each source, restored incomplete records via median/mode and geospatial inference, applied VADER sentiment analysis to free-text reviews, and trained Scikit-learn regression models with residual-based outlier detection and statistical transformations.

Challenges

  • Heterogeneous schemas across customer, warehouse, pricing, delivery, and transaction tables.
  • Missing values requiring statistical, geospatial, and sentiment-based imputation.
  • Highly skewed numeric features hurting model performance.

Key features

  • Cleaned and validated 10,000+ retail transaction records across customer, warehouse, pricing, delivery, and transaction datasets.
  • Restored 1,000+ incomplete records using median, mode, geospatial inference, sentiment analysis, and rule-based reconstruction.
  • Applied VADER NLP sentiment analysis to classify customer satisfaction behaviour from unstructured review data.
  • Developed regression-based outlier detection and predictive modelling using Scikit-learn and residual analysis.
  • Achieved above 97% model accuracy and reduced feature skewness by up to 80% through statistical transformations.

Results

  • Predictive model achieving above 97% accuracy in project evaluation.
  • Up to 80% reduction in feature skewness after transformations.
  • Reusable wrangling and modelling templates for future retail datasets.

Above 97% model accuracy (project evaluation)

Skewness reduced by up to 80%

1,000+ records restored