Skip to content

Homework 3 — Exploring & Visualising Data

Overview

In HW3 you will:

  • Perform exploratory data analysis (EDA) on three datasets.
  • Produce clear tables/plots and concise interpretations.
  • Practice data cleaning (types, missing values, normalisation).
  • Deliver a reproducible report.

Tools: Use anything you like (Python/R/SQL/editors). Your report must be readable on GitHub (README.md or .ipynb). Commit all code needed to reproduce outputs.

Data repo: https://github.com/su-mt4007/data Files used here: IRIS.csv, artportalen.csv, stroke-data.csv, cell_phones_total.csv.

Report Requirements (markdown or notebook)

Keep it short and scannable:

  • Title & brief intro (datasets + goals)
  • Sections per task with headings
  • Methods (a few sentences + code)
  • Tables/figures with captions and readable labels
  • Notes on parsing/cleaning (e.g., separators, decimals, type fixes)
  • How to run (e.g., pip install -r requirements.txt → open notebook)

Avoid committing large raw data; give download instructions instead.

A. Exploratory Data Analysis

A1) IRIS

File: IRIS.csv (three species; sepal/petal lengths & widths)

  1. Distributions (box/violin/histograms) Produce a distribution figure (boxplots are fine).

    • Compare distributions of sepal/petal features across species.
    • One or two concise conclusions.
  2. Pairs plot Produce a pairs plot (scatter matrix) with species colouring.

    • Show relationships between sepal and petal dimensions, colour/shape by species.
    • Note 2–3 relationships (e.g., strongest separations, near-linear pairs).
    • Briefly state what the figure shows (2–4 sentences).

Acceptance (IRIS): correct file load, labelled axes/legends, three concise interpretations total (one per sub-task).

A2) Birdwatching (Artportalen 2022, Royal National Park)

File: artportalen.csv (bird sightings; citizen-science)

  1. Familiarise & describe

    • Show the first rows and a short data dictionary (your inferred meaning of key columns).
  2. Most prevalent species

    • Define prevalent (e.g., most observations, unique days with sightings, or unique locations). State your choice.
    • Report the top N (N≥5) in a tidy table.
  3. Monthly distribution (top 3)

    • For your top 3 species, show a monthly time series or bar chart (one figure or small multiples).
    • One sentence per species describing the pattern.
  4. Rarest species

    • Define rarest (consistent with your prevalence definition); show a small table (e.g., 5–10 entries).
    • Brief note on potential reporting bias (citizen science).
  5. Your own questions (≥3)

    • Pose three non-trivial questions and answer them with brief text + a supporting table/plot each.
    • Each question should surface a distinct property of the data (seasonality, spatiality, observer effects, etc.).

Acceptance (Birdwatching): explicit prevalence/rarity definitions, at least one figure for monthly patterns, three self-posed Q&As with evidence.

A3) Predicting Strokes (EDA only)

File: stroke-data.csv (individual attributes + stroke outcome)

  • Produce one to three well-labelled plots that explore relationships between selected features and the stroke outcome.
  • For at least one categorical feature, compare stroke rates across groups (e.g., bar with percentages).
  • For at least one numeric feature, compare distributions by outcome (e.g., KDE/box/ECDF).
  • Write clear, non-causal conclusions (descriptive, not predictive; note potential confounders and class imbalance).

Acceptance (Strokes): ≥2 plots (categorical + numeric), group rates correctly computed, 2–4 sentences of careful interpretation.

B. Data Preparation

B1) Cleaning cell_phones_total.csv

Dataset: total mobile phones per country (1960–2019). Values may be strings with suffixes k=1e3, M=1e6, B=1e9, mixed decimal signs, and missing values.

Tasks

  1. Type normalisation: Convert all relevant value cells to numeric.

    • Parse suffixes k/M/B robustly (case-insensitive).
    • Handle decimal marks (. or ,) consistently.
  2. Missing values: Treat missing data appropriately.

    • Choose sensible imputation or leave as NaN if justified—explain briefly.
    • You do not need to treat all years/countries identically.
  3. Tidy output table: Produce a wide table with columns: iso-3, 2015, 2016, 2017, 2018, 2019.

    • All parsing is done; values are numeric or NaN.
    • Sort by the 2015 column descending (ties arbitrary).
    • Show the first ~10 rows in the report.

Acceptance (Cleaning): demonstrated parsing of suffixes + decimals, documented missing-value choice, correctly typed final table.

Acceptance Criteria (overall)

  • Clarity: Report is organised with headings; figures/tables have titles, axes, legends.
  • Correctness: Code reads files correctly (separators/decimals), computations match definitions you stated.
  • Completeness: All A1–A3 and B1 deliverables present.
  • Reproducibility: Code and minimal environment notes included; random seeds set where relevant.

Submission

  1. Push your work to username-hw-3.

  2. Open an Issue titled HW3 – Submission (optional label: ready-for-grading). Include:

    • Link to your report (HW3/README.md or HW3/HW3.ipynb)
    • 2–3 lines summarising results (one per dataset is fine)
    • Notes on any parsing/cleaning decisions

Assignment deadline: Monday 23:59 (Europe/Stockholm)

Peer Review (after the deadline)

Comment under your partner’s HW3 – Submission Issue. Copy this checklist:

  • Coverage: Are all required tasks (A1–A3, B1) completed?
  • Definitions: Are prevalence/rarity and cleaning choices stated and applied consistently?
  • Figures/tables: Labels, readability, and match to claims?
  • Reproducibility: Can you see how to run the code and reproduce outputs?
  • One highlight & one suggestion: Specific and actionable.
  • Discussion: When cleaning the data for the "Birdwatching" task, has the author written a concise code, or have they cleaned the data in multiple steps? Which do you prefer and why?

Peer-review deadline: Thursday 23:59 (Europe/Stockholm)

Grading

Per-homework scale U / G / VG based on:

  • Completeness (all tasks + submission + peer review)
  • Clarity (well-structured report; concise, precise writing)
  • Correctness & Reproducibility (parsing/cleaning done properly; code produces shown outputs)

Notes

  • Late submissions/reviews require an extra task and are graded Pass/Fail only (no VG).
  • Keep claims descriptive, not causal; cite any external sources or code you adapt.