Skip to content

Homework 4 — SQL, Regex & “Phones per Capita”

Overview

In HW4 you will:

  • Reshape & join multiple datasets to compute cell phones per capita.
  • Create a comparison table and a growth plot for top countries.
  • Write pure SQL queries against an SQLite database.
  • Write pure regular expressions to extract and match hashtags.

Tools: Use anything you like (Python/R/SQL/editors). Your report must be readable on GitHub (README.md or .ipynb) and include all code needed to reproduce results.

Data repo: https://github.com/su-mt4007/data Files used here: cell_phones_total.csv, pop_data.csv, country_data.csv, user_actions.db, comments.txt.

Report Requirements (Markdown or Notebook)

Keep it concise and reproducible:

  • Title & brief intro (what you computed and why)
  • Sections: Joining Data → SQL → Regex
  • Short methods descriptions with code blocks/cells
  • Figures/tables with captions and clear labels
  • Reproducibility notes (how to run; environment/requirements)

Part A — Joining Data (Phones per Capita)

Use your cleaned cell_phones_total.csv from HW3 as the starting point.

A1) Prepare & reshape

  1. Load population data: pop_data.csv. Handle missing values before joining (state your approach briefly).

  2. Reshape both datasets to tidy long form with one value per row:

    • cell_phones_long: columns iso-3, year, n_cellphones
    • population_long: columns iso-3, year, population(Hint: pivot/melt as needed; ensure numeric types.)
  3. Join on iso-3 + year to obtain:

    iso-3 | year | n_cellphones | population

    (Show the first ~10 rows.)

A2) Map country names & compute per-capita

  1. Load country_data.csv to map iso-3country_name.

  2. Add phones_per_capita = n_cellphones / population.

  3. Build a wide comparison table for 2015–2019 with rows = year, columns = top 5 countries by phones_per_capita in 2019 (like the example).

    • If your top-5 differ due to missing-data decisions, that’s fine—document your approach.
    • Show the full 2015–2019 block for those countries.

A3) Growth over the last decade

Create a plot showing phones_per_capita trajectories for your top-5 countries over the last 10 available years (ending in 2019).

  • If some years are missing, choose the longest continuous window ≥ 8 years and state the window used.
  • In the caption, report each country’s relative change: [ \text{growth} = \frac{\text{value}{end}}{\text{value}{start}} - 1 ] (No need for CAGR unless you prefer it—be consistent.)

Acceptance (Part A)

  • Clear long-format join table with correct keys and numeric columns.
  • Correct per-capita calculation; wide table for 2015–2019 using 2019 top-5.
  • Growth plot over the last decade (or stated window), with labelled axes/legend and reported relative change.
  • Brief, justified handling of missing values.

Part B — SQL (SQLite)

File: user_actions.db (SQLite). Table: user_actions. Inspect the schema first (e.g., PRAGMA table_info(user_actions);, SELECT * LIMIT 5;).

Rule: Each task must be solved only with SQL (SQLite dialect). You may run queries via CLI/DB Browser/pandas—but the logic must be SQL.

  1. Sign-ups Retrieve the usernames of all users who performed the action "signup".

  2. Log counts per user For each user, return user_id, username, and count of log entries.

  3. Same-day login & signup Identify users who performed both login and signup on the same calendar day. Return user_id, username.

    • If timestamps include time-of-day, cast/truncate to date in SQL.

What to include in the report

  • The exact SQL for each task.
  • A small result preview (e.g., first ~10 rows) or a screenshot (optional).

Acceptance (Part B)

  • Queries are valid SQLite and run as-is.
  • Results match the stated criteria (distinct users where appropriate; correct grouping; correct same-day logic).

Part C — Regex (hashtags)

File: comments.txt — one user comment per line. Assume hashtags have the form #tag where a tag is letters/numbers/underscore.

Rule: Solve using only regular expressions (e.g., Python re or your tool’s regex engine). No tokenisers or extra parsing logic.

  1. Extract hashtags from a single comment Write a regex that returns all hashtags from a given comment. Example goal: for “Comment 1” it should return ["#programming", "#tips"].

  2. Comments mentioning both #programming and #python Write a regex that matches whole comments that contain both hashtags (in any order, anywhere in the line).

  3. (Optional) Unique hashtags in the whole file (*) Using your extraction regex, collect unique hashtags across all lines and output them (sorted is a plus).

Hints

  • A robust hashtag token often looks like: #[A-Za-z0-9_]+
  • For task 2, consider lookaheads to require both tags: e.g., (?=.*#programming)(?=.*#python).*

Acceptance (Part C)

  • Provide the regex patterns (not just code calling them) and a short note on flags used (e.g., multiline).
  • Show example matches for at least one comment; list of unique tags for the optional task if attempted.

Submission

  1. Push your work to username-hw-4.

  2. Open an Issue titled HW4 – Submission (optional label: ready-for-grading). Include:

    • Link to your report (HW4/README.md or HW4/HW4.ipynb)
    • 2–4 lines summarising your results (table + plot + query/regex notes)
    • Any notes on missing-data handling and date parsing

Assignment deadline: Monday 23:59 (Europe/Stockholm)

Peer Review (after the deadline)

Comment under your partner’s HW4 – Submission Issue. Copy this checklist:

  • Coverage: Part A (join, per-capita, table, plot), Part B (3 SQL queries), Part C (regex tasks)
  • Correctness: Joins/keys, per-capita formula, SQL grouping and date logic, regex behaviour
  • Clarity: Well-labelled table/figure; readable queries/regex; concise explanations
  • Reproducibility: Code and minimal environment notes included
  • One highlight & one suggestion: Specific and actionable
  • Extra question: Can the regex handle all cases? Can you come up with a (possibly malformed) example where the regex produces an unexpected result?

Peer-review deadline: Thursday 23:59 (Europe/Stockholm)

Grading

Per-homework scale U / G / VG based on:

  • Completeness (all tasks + submission + peer review)
  • Clarity (clean structure, labels, brief justifications)
  • Correctness & Reproducibility (joins, calculations, SQL/regex logic; code runs or is clearly explained)

Notes

  • Late submissions/reviews require an extra task and are graded Pass/Fail only (no VG).
  • Document any deviations (e.g., shorter year window due to missing values).