Appearance
Homework 5 — Data from the Web (REST & Scraping)
Overview
In HW5 you will:
- Fetch JSON from a public REST API (Nobel Prize).
- Parse nested JSON and build a word-cloud from prize motivations.
- Scrape a small website (Books to Scrape) across multiple pages and assemble a tidy table.
Tools: Use anything you like (Python/R/etc.). Your report must be readable on GitHub (README.md or .ipynb). Commit all code needed to reproduce results.
Report Requirements (Markdown or Notebook)
Keep it concise and reproducible:
- Title & brief intro (what you fetched/scraped and why)
- Sections: REST API → Web Scraping
- Short methods with code blocks/cells
- Figures/tables with captions and clear labels
- Reproducibility notes (how to run; environment/requirements; where raw data is saved)
Avoid committing large binary files; you may cache raw JSON/HTML as small text files.
Part A — REST API (Nobel Prize)
Goal: Fetch Physics prize data via the Nobel Prize API (v2.x), extract all motivations, and visualise word frequencies with a word cloud.
A1) Fetch JSON
- Query the Nobel API for Nobel Prizes in Physics (all available years).
- Handle pagination if present (iterate until all pages are retrieved).
- Save the raw response(s) to disk (e.g.,
data/nobel_physics.json) for reproducibility.
Tip: When APIs return a top-level object with lists, capture the list that contains individual prizes/entries. Record any query parameters you used (category, year range, limit, offset, etc.).
A2) Parse & extract motivations
- From the JSON, extract every motivation string associated with the Physics prizes (include all laureates’ motivations).
- Clean text (lowercase, strip punctuation/whitespace, remove stopwords; consider stemming/lemmatisation optional).
- Keep a short list of domain stopwords (e.g., “nobel”, “prize”, “physics”, “prizes”, “laureate”, “motivation”) so the cloud isn’t dominated by boilerplate.
A3) Word cloud
- Generate and display a word cloud from the cleaned motivations.
- Include the top 20 terms with counts in a small table next to or below the cloud.
- Briefly interpret (2–4 sentences): What themes recur? Any surprises?
Acceptance (Part A)
- API call(s) shown with parameters; pagination handled if needed.
- Motivations extracted for all Physics entries; cleaning steps stated.
- Word cloud + top-terms table included, with a short interpretation.
- Raw JSON cached locally and referenced in the report.
Part B — Web Scraping (Books to Scrape)
Site: https://books.toscrape.com/Scope: First three catalogue pages (page 1–3) → 20 books per page → 60 rows total.
Target table (exact columns):
upc | title | price | ratingDefinitions
- upc: Product page UPC string.
- title: Book title.
- price: Price string as shown (e.g., “£51.77”).
- rating: Star rating as a word (e.g., “One”, “Two”, … “Five”).
B1) Strategy
- Start from the catalogue page 1 and follow pagination to pages 2 and 3.
- For each book, follow the link to its detail page to extract UPC (and confirm price/rating if needed).
- Assemble a single DataFrame/table with exactly 60 rows and the four columns above.
B2) Robustness & etiquette
- Set a custom User-Agent header.
- Add a small delay between requests (e.g., 0.5–1.0s).
- Handle unexpected HTML gracefully (missing fields → skip or record
NAwith a note). - If the site uses relative links, resolve to absolute URLs safely.
B3) Deliverable
- Show the first 5 rows as a preview and the overall row count (should be 60).
- Save the final table to
data/books_page1-3.csv(or similar).
Acceptance (Part B)
- Exactly 60 rows with the specified columns and non-empty UPCs.
- Clear method (pagination, per-book detail fetch, delays).
- Clean, readable code; a brief note on any anomalies encountered.
Submission
Push your work to
username-homework-5.Open an Issue titled
HW5 – Submission(optional label:ready-for-grading). Include:- Link to your report (
HW5/README.mdorHW5/HW5.ipynb). - 2–4 lines summarising results (API + cloud; scraper table size).
- Any notes on rate limiting, pagination, or HTML quirks.
- Link to your report (
Assignment deadline: Tuesday 23:59 (Europe/Stockholm)
Peer Review (after the deadline)
Comment under your partner’s HW5 – Submission Issue. Copy this checklist:
- Coverage: API fetch + motivations + word cloud; scraper with 60 rows and required columns
- Reproducibility: Raw JSON cached; scraper code deterministic with delays; environment notes included
- Clarity: Report structure, labelled figure/table, concise explanations
- Correctness: All Physics motivations included; UPC/price/rating correctly captured
- One highlight & one suggestion: Specific and actionable
Peer-review deadline: Thursday 23:59 (Europe/Stockholm)
Grading
Per-homework scale U / G / VG based on:
- Completeness (all tasks + submission + peer review)
- Clarity (clean structure; readable outputs; brief, precise writing)
- Correctness & Reproducibility (API handling, parsing/cleaning, scraper reliability; code runs or is clearly explained)
Notes
- Late submissions/reviews require an extra task and are graded Pass/Fail only (no VG).
- Ethics: Scrape politely (headers, delays) and stay within the stated page scope.
- Citations: If you adapt code (e.g., for word clouds), cite your source briefly in the report.