Appearance
Homework 3 — Exploring & Visualising Data
Overview
In HW3 you will:
- Perform exploratory data analysis (EDA) on three datasets.
- Produce clear tables/plots and concise interpretations.
- Practice data cleaning (types, missing values, normalisation).
- Deliver a reproducible report.
Tools: Use anything you like (Python/R/SQL/editors). Your report must be readable on GitHub (README.md or .ipynb). Commit all code needed to reproduce outputs.
Data repo: https://github.com/su-mt4007/data Files used here: IRIS.csv, artportalen.csv, stroke-data.csv, cell_phones_total.csv.
Report Requirements (markdown or notebook)
Keep it short and scannable:
- Title & brief intro (datasets + goals)
- Sections per task with headings
- Methods (a few sentences + code)
- Tables/figures with captions and readable labels
- Notes on parsing/cleaning (e.g., separators, decimals, type fixes)
- How to run (e.g.,
pip install -r requirements.txt→ open notebook)
Avoid committing large raw data; give download instructions instead.
A. Exploratory Data Analysis
A1) IRIS
File: IRIS.csv (three species; sepal/petal lengths & widths)
Distributions (box/violin/histograms) Produce a distribution figure (boxplots are fine).
- Compare distributions of sepal/petal features across species.
- One or two concise conclusions.
Pairs plot Produce a pairs plot (scatter matrix) with species colouring.
- Show relationships between sepal and petal dimensions, colour/shape by species.
- Note 2–3 relationships (e.g., strongest separations, near-linear pairs).
- Briefly state what the figure shows (2–4 sentences).
Acceptance (IRIS): correct file load, labelled axes/legends, three concise interpretations total (one per sub-task).
A2) Birdwatching (Artportalen 2022, Royal National Park)
File: artportalen.csv (bird sightings; citizen-science)
Familiarise & describe
- Show the first rows and a short data dictionary (your inferred meaning of key columns).
Most prevalent species
- Define prevalent (e.g., most observations, unique days with sightings, or unique locations). State your choice.
- Report the top N (N≥5) in a tidy table.
Monthly distribution (top 3)
- For your top 3 species, show a monthly time series or bar chart (one figure or small multiples).
- One sentence per species describing the pattern.
Rarest species
- Define rarest (consistent with your prevalence definition); show a small table (e.g., 5–10 entries).
- Brief note on potential reporting bias (citizen science).
Your own questions (≥3)
- Pose three non-trivial questions and answer them with brief text + a supporting table/plot each.
- Each question should surface a distinct property of the data (seasonality, spatiality, observer effects, etc.).
Acceptance (Birdwatching): explicit prevalence/rarity definitions, at least one figure for monthly patterns, three self-posed Q&As with evidence.
A3) Predicting Strokes (EDA only)
File: stroke-data.csv (individual attributes + stroke outcome)
- Produce one to three well-labelled plots that explore relationships between selected features and the stroke outcome.
- For at least one categorical feature, compare stroke rates across groups (e.g., bar with percentages).
- For at least one numeric feature, compare distributions by outcome (e.g., KDE/box/ECDF).
- Write clear, non-causal conclusions (descriptive, not predictive; note potential confounders and class imbalance).
Acceptance (Strokes): ≥2 plots (categorical + numeric), group rates correctly computed, 2–4 sentences of careful interpretation.
B. Data Preparation
B1) Cleaning cell_phones_total.csv
Dataset: total mobile phones per country (1960–2019). Values may be strings with suffixes k=1e3, M=1e6, B=1e9, mixed decimal signs, and missing values.
Tasks
Type normalisation: Convert all relevant value cells to numeric.
- Parse suffixes
k/M/Brobustly (case-insensitive). - Handle decimal marks (
.or,) consistently.
- Parse suffixes
Missing values: Treat missing data appropriately.
- Choose sensible imputation or leave as
NaNif justified—explain briefly. - You do not need to treat all years/countries identically.
- Choose sensible imputation or leave as
Tidy output table: Produce a wide table with columns:
iso-3,2015,2016,2017,2018,2019.- All parsing is done; values are numeric or
NaN. - Sort by the 2015 column descending (ties arbitrary).
- Show the first ~10 rows in the report.
- All parsing is done; values are numeric or
Acceptance (Cleaning): demonstrated parsing of suffixes + decimals, documented missing-value choice, correctly typed final table.
Acceptance Criteria (overall)
- Clarity: Report is organised with headings; figures/tables have titles, axes, legends.
- Correctness: Code reads files correctly (separators/decimals), computations match definitions you stated.
- Completeness: All A1–A3 and B1 deliverables present.
- Reproducibility: Code and minimal environment notes included; random seeds set where relevant.
Submission
Push your work to
username-hw-3.Open an Issue titled
HW3 – Submission(optional label:ready-for-grading). Include:- Link to your report (
HW3/README.mdorHW3/HW3.ipynb) - 2–3 lines summarising results (one per dataset is fine)
- Notes on any parsing/cleaning decisions
- Link to your report (
Assignment deadline: Monday 23:59 (Europe/Stockholm)
Peer Review (after the deadline)
Comment under your partner’s HW3 – Submission Issue. Copy this checklist:
- Coverage: Are all required tasks (A1–A3, B1) completed?
- Definitions: Are prevalence/rarity and cleaning choices stated and applied consistently?
- Figures/tables: Labels, readability, and match to claims?
- Reproducibility: Can you see how to run the code and reproduce outputs?
- One highlight & one suggestion: Specific and actionable.
- Discussion: When cleaning the data for the "Birdwatching" task, has the author written a concise code, or have they cleaned the data in multiple steps? Which do you prefer and why?
Peer-review deadline: Thursday 23:59 (Europe/Stockholm)
Grading
Per-homework scale U / G / VG based on:
- Completeness (all tasks + submission + peer review)
- Clarity (well-structured report; concise, precise writing)
- Correctness & Reproducibility (parsing/cleaning done properly; code produces shown outputs)
Notes
- Late submissions/reviews require an extra task and are graded Pass/Fail only (no VG).
- Keep claims descriptive, not causal; cite any external sources or code you adapt.