Back to projects
Data ScienceData Analytics

Flickr Photo Dataset — Exploratory Data Analysis

EDA on 64,000+ photo records across XML & JSON sources

Performed end-to-end exploratory data analysis on a 64,000+ record Flickr photo dataset using XML and JSON sources to identify behavioural, geographic, and publishing-pattern insights.

PythonPandasEDAXMLJSONData CleaningData VisualisationStatistical Analysis

Problem

A large, heterogeneous Flickr metadata corpus across XML and JSON needed unified parsing and exploratory analysis to surface meaningful behavioural and geographic patterns.

My role

Data Analyst — owned ingestion, parsing, cleaning, EDA, and visualisation.

Solution

Built a reproducible Pandas pipeline to parse and merge XML/JSON sources, then ran univariate, bivariate, and multivariate analyses with structured visualisations to summarise behavioural, geographic, and publishing-latency insights.

Challenges

  • Reconciling XML and JSON schemas into one analytical view.
  • Handling temporal, geographic, and tag dimensions consistently.
  • Translating multi-dimensional patterns into clear narratives.

Key features

  • Parsed and merged XML and JSON datasets into a single analysis-ready structure.
  • Conducted univariate, bivariate, and multivariate analysis across temporal, geographic, tag, and publishing-latency dimensions.
  • Analysed photo metadata across 20+ countries and Australian states.
  • Identified behavioural insights around minor city users, tag distribution, and publication latency.
  • Generated downstream ML question ideas for geo-tag quality control, recommendation, and latency prediction.

Results

  • Cohesive EDA report across 64,000+ records.
  • Documented behavioural insights and downstream ML question ideas.
  • Reusable parsing and EDA templates.

64,000+ records analysed

20+ countries covered

Multi-dimensional EDA report