Appearance
Homework 4 — SQL, Regex & “Phones per Capita”
Overview
In HW4 you will:
- Reshape & join multiple datasets to compute cell phones per capita.
- Create a comparison table and a growth plot for top countries.
- Write pure SQL queries against an SQLite database.
- Write pure regular expressions to extract and match hashtags.
Tools: Use anything you like (Python/R/SQL/editors). Your report must be readable on GitHub (README.md or .ipynb) and include all code needed to reproduce results.
Data repo: https://github.com/su-mt4007/data Files used here: cell_phones_total.csv, pop_data.csv, country_data.csv, user_actions.db, comments.txt.
Report Requirements (Markdown or Notebook)
Keep it concise and reproducible:
- Title & brief intro (what you computed and why)
- Sections: Joining Data → SQL → Regex
- Short methods descriptions with code blocks/cells
- Figures/tables with captions and clear labels
- Reproducibility notes (how to run; environment/requirements)
Part A — Joining Data (Phones per Capita)
Use your cleaned
cell_phones_total.csvfrom HW3 as the starting point.
A1) Prepare & reshape
Load population data:
pop_data.csv. Handle missing values before joining (state your approach briefly).Reshape both datasets to tidy long form with one value per row:
cell_phones_long: columnsiso-3,year,n_cellphonespopulation_long: columnsiso-3,year,population(Hint: pivot/melt as needed; ensure numeric types.)
Join on
iso-3+yearto obtain:iso-3 | year | n_cellphones | population(Show the first ~10 rows.)
A2) Map country names & compute per-capita
Load
country_data.csvto mapiso-3→country_name.Add
phones_per_capita = n_cellphones / population.Build a wide comparison table for 2015–2019 with rows =
year, columns = top 5 countries by phones_per_capita in 2019 (like the example).- If your top-5 differ due to missing-data decisions, that’s fine—document your approach.
- Show the full 2015–2019 block for those countries.
A3) Growth over the last decade
Create a plot showing phones_per_capita trajectories for your top-5 countries over the last 10 available years (ending in 2019).
- If some years are missing, choose the longest continuous window ≥ 8 years and state the window used.
- In the caption, report each country’s relative change: [ \text{growth} = \frac{\text{value}{end}}{\text{value}{start}} - 1 ] (No need for CAGR unless you prefer it—be consistent.)
Acceptance (Part A)
- Clear long-format join table with correct keys and numeric columns.
- Correct per-capita calculation; wide table for 2015–2019 using 2019 top-5.
- Growth plot over the last decade (or stated window), with labelled axes/legend and reported relative change.
- Brief, justified handling of missing values.
Part B — SQL (SQLite)
File: user_actions.db (SQLite). Table: user_actions. Inspect the schema first (e.g., PRAGMA table_info(user_actions);, SELECT * LIMIT 5;).
Rule: Each task must be solved only with SQL (SQLite dialect). You may run queries via CLI/DB Browser/pandas—but the logic must be SQL.
Sign-ups Retrieve the usernames of all users who performed the action
"signup".Log counts per user For each user, return user_id, username, and count of log entries.
Same-day login & signup Identify users who performed both
loginandsignupon the same calendar day. Return user_id, username.- If timestamps include time-of-day, cast/truncate to date in SQL.
What to include in the report
- The exact SQL for each task.
- A small result preview (e.g., first ~10 rows) or a screenshot (optional).
Acceptance (Part B)
- Queries are valid SQLite and run as-is.
- Results match the stated criteria (distinct users where appropriate; correct grouping; correct same-day logic).
Part C — Regex (hashtags)
File: comments.txt — one user comment per line. Assume hashtags have the form #tag where a tag is letters/numbers/underscore.
Rule: Solve using only regular expressions (e.g., Python
reor your tool’s regex engine). No tokenisers or extra parsing logic.
Extract hashtags from a single comment Write a regex that returns all hashtags from a given comment. Example goal: for “Comment 1” it should return
["#programming", "#tips"].Comments mentioning both
#programmingand#pythonWrite a regex that matches whole comments that contain both hashtags (in any order, anywhere in the line).(Optional) Unique hashtags in the whole file
(*)Using your extraction regex, collect unique hashtags across all lines and output them (sorted is a plus).
Hints
- A robust hashtag token often looks like:
#[A-Za-z0-9_]+ - For task 2, consider lookaheads to require both tags: e.g.,
(?=.*#programming)(?=.*#python).*
Acceptance (Part C)
- Provide the regex patterns (not just code calling them) and a short note on flags used (e.g., multiline).
- Show example matches for at least one comment; list of unique tags for the optional task if attempted.
Submission
Push your work to
username-hw-4.Open an Issue titled
HW4 – Submission(optional label:ready-for-grading). Include:- Link to your report (
HW4/README.mdorHW4/HW4.ipynb) - 2–4 lines summarising your results (table + plot + query/regex notes)
- Any notes on missing-data handling and date parsing
- Link to your report (
Assignment deadline: Monday 23:59 (Europe/Stockholm)
Peer Review (after the deadline)
Comment under your partner’s HW4 – Submission Issue. Copy this checklist:
- Coverage: Part A (join, per-capita, table, plot), Part B (3 SQL queries), Part C (regex tasks)
- Correctness: Joins/keys, per-capita formula, SQL grouping and date logic, regex behaviour
- Clarity: Well-labelled table/figure; readable queries/regex; concise explanations
- Reproducibility: Code and minimal environment notes included
- One highlight & one suggestion: Specific and actionable
- Extra question: Can the regex handle all cases? Can you come up with a (possibly malformed) example where the regex produces an unexpected result?
Peer-review deadline: Thursday 23:59 (Europe/Stockholm)
Grading
Per-homework scale U / G / VG based on:
- Completeness (all tasks + submission + peer review)
- Clarity (clean structure, labels, brief justifications)
- Correctness & Reproducibility (joins, calculations, SQL/regex logic; code runs or is clearly explained)
Notes
- Late submissions/reviews require an extra task and are graded Pass/Fail only (no VG).
- Document any deviations (e.g., shorter year window due to missing values).