Skip to content

Homework 5 — Data from the Web (REST & Scraping)

Overview

In HW5 you will:

  • Fetch JSON from a public REST API (Nobel Prize).
  • Parse nested JSON and build a word-cloud from prize motivations.
  • Scrape a small website (Books to Scrape) across multiple pages and assemble a tidy table.

Tools: Use anything you like (Python/R/etc.). Your report must be readable on GitHub (README.md or .ipynb). Commit all code needed to reproduce results.

Report Requirements (Markdown or Notebook)

Keep it concise and reproducible:

  • Title & brief intro (what you fetched/scraped and why)
  • Sections: REST API → Web Scraping
  • Short methods with code blocks/cells
  • Figures/tables with captions and clear labels
  • Reproducibility notes (how to run; environment/requirements; where raw data is saved)

Avoid committing large binary files; you may cache raw JSON/HTML as small text files.

Part A — REST API (Nobel Prize)

Goal: Fetch Physics prize data via the Nobel Prize API (v2.x), extract all motivations, and visualise word frequencies with a word cloud.

A1) Fetch JSON

  • Query the Nobel API for Nobel Prizes in Physics (all available years).
  • Handle pagination if present (iterate until all pages are retrieved).
  • Save the raw response(s) to disk (e.g., data/nobel_physics.json) for reproducibility.

Tip: When APIs return a top-level object with lists, capture the list that contains individual prizes/entries. Record any query parameters you used (category, year range, limit, offset, etc.).

A2) Parse & extract motivations

  • From the JSON, extract every motivation string associated with the Physics prizes (include all laureates’ motivations).
  • Clean text (lowercase, strip punctuation/whitespace, remove stopwords; consider stemming/lemmatisation optional).
  • Keep a short list of domain stopwords (e.g., “nobel”, “prize”, “physics”, “prizes”, “laureate”, “motivation”) so the cloud isn’t dominated by boilerplate.

A3) Word cloud

  • Generate and display a word cloud from the cleaned motivations.
  • Include the top 20 terms with counts in a small table next to or below the cloud.
  • Briefly interpret (2–4 sentences): What themes recur? Any surprises?

Acceptance (Part A)

  • API call(s) shown with parameters; pagination handled if needed.
  • Motivations extracted for all Physics entries; cleaning steps stated.
  • Word cloud + top-terms table included, with a short interpretation.
  • Raw JSON cached locally and referenced in the report.

Part B — Web Scraping (Books to Scrape)

Site: https://books.toscrape.com/Scope: First three catalogue pages (page 1–3) → 20 books per page60 rows total.

Target table (exact columns):

upc | title | price | rating

Definitions

  • upc: Product page UPC string.
  • title: Book title.
  • price: Price string as shown (e.g., “£51.77”).
  • rating: Star rating as a word (e.g., “One”, “Two”, … “Five”).

B1) Strategy

  • Start from the catalogue page 1 and follow pagination to pages 2 and 3.
  • For each book, follow the link to its detail page to extract UPC (and confirm price/rating if needed).
  • Assemble a single DataFrame/table with exactly 60 rows and the four columns above.

B2) Robustness & etiquette

  • Set a custom User-Agent header.
  • Add a small delay between requests (e.g., 0.5–1.0s).
  • Handle unexpected HTML gracefully (missing fields → skip or record NA with a note).
  • If the site uses relative links, resolve to absolute URLs safely.

B3) Deliverable

  • Show the first 5 rows as a preview and the overall row count (should be 60).
  • Save the final table to data/books_page1-3.csv (or similar).

Acceptance (Part B)

  • Exactly 60 rows with the specified columns and non-empty UPCs.
  • Clear method (pagination, per-book detail fetch, delays).
  • Clean, readable code; a brief note on any anomalies encountered.

Submission

  1. Push your work to username-homework-5.

  2. Open an Issue titled HW5 – Submission (optional label: ready-for-grading). Include:

    • Link to your report (HW5/README.md or HW5/HW5.ipynb).
    • 2–4 lines summarising results (API + cloud; scraper table size).
    • Any notes on rate limiting, pagination, or HTML quirks.

Assignment deadline: Tuesday 23:59 (Europe/Stockholm)

Peer Review (after the deadline)

Comment under your partner’s HW5 – Submission Issue. Copy this checklist:

  • Coverage: API fetch + motivations + word cloud; scraper with 60 rows and required columns
  • Reproducibility: Raw JSON cached; scraper code deterministic with delays; environment notes included
  • Clarity: Report structure, labelled figure/table, concise explanations
  • Correctness: All Physics motivations included; UPC/price/rating correctly captured
  • One highlight & one suggestion: Specific and actionable

Peer-review deadline: Thursday 23:59 (Europe/Stockholm)

Grading

Per-homework scale U / G / VG based on:

  • Completeness (all tasks + submission + peer review)
  • Clarity (clean structure; readable outputs; brief, precise writing)
  • Correctness & Reproducibility (API handling, parsing/cleaning, scraper reliability; code runs or is clearly explained)

Notes

  • Late submissions/reviews require an extra task and are graded Pass/Fail only (no VG).
  • Ethics: Scrape politely (headers, delays) and stay within the stated page scope.
  • Citations: If you adapt code (e.g., for word clouds), cite your source briefly in the report.