Appearance
Reproducible Analysis with GitHub, Jupyter, and Functional Programming
Who this is for. Anyone doing data analysis in Python or R who wants results to be easy to rerun, check, and share.
You will learn to
- set up an isolated environment for a project,
- write runnable notebooks with clear narrative,
- version your work with Git/GitHub,
- apply functional programming (FP) ideas for predictability.
Resources (Markdown, Git, setup)
Resources (functional programming)
This post is hands-on: follow along once, then use it as a reference.
Why reproducibility?
Reproducibility means someone else (or future-you) can run your analysis and get the same numbers. It builds trust, makes debugging easier, and lets others build on your work.
We’ll combine:
- Python or R for the analysis,
- Jupyter Lab for literate, executable documents,
- Git (local) + GitHub (remote) for versioning and collaboration.
Functional programming (FP) ties it together by encouraging pure, modular code with fewer surprises.
Quick admin
Please register your GitHub username on the Moodle course page.
No GitHub yet? Create an account → then register it on Moodle. We’ll review homework via GitHub.
Install the essentials
Git & GitHub Desktop
What & why
- Git tracks versions of your code and notebooks.
- GitHub hosts your repository online for backup, sharing, and feedback.
- GitHub Desktop is a simple GUI that bundles Git.
Install
- Download GitHub Desktop: https://desktop.github.com/
- Sign in with your GitHub account.
Tip: Learn by doing. We’ll create a repo and push within minutes.
Miniconda & Jupyter Lab
What & why
- (Mini)Conda manages packages and isolated environments (one clean setup per project).
- Jupyter Lab runs notebooks that mix code, text, and plots.
Install Miniconda
- Go to https://docs.conda.io/projects/miniconda/en/latest/
- Download for your OS and run the installer.
Create a course environment
Pick Python or R and run in your terminal:
conda config --add channels conda-forge && conda create --name py_env pandas numpy matplotlib jupyterlab -yconda config --add channels conda-forge && conda create --name renv r-essentials r-base=4.1.3 -yConda environments keep dependencies from clashing across projects.
Launch Jupyter Lab
Activate your environment and start Jupyter:
conda activate ENV_NAME
jupyter labUse py_env for Python or renv for R.
Markdown in notebooks
Use Markdown for narrative: explain goals, assumptions, and decisions near the code that implements them. Start with the “Basics” link above.
Git basics you’ll actually use
Docs & guides
Ignore noisy files
Create a .gitignore in the repo root (or via GitHub Desktop → Repository Settings → Ignored Files). At minimum:
.ipynb_checkpointsFork vs clone (which do I pick?)
- Fork: your own copy, detached from the original (great for starting from a template or contributing to open source).
- Clone: a local copy of a repo you already collaborate on (keeps the link to the original).
Branches (safe experiments)
A branch lets you try ideas without risking main.
In GitHub Desktop:
- New branch: Current Branch → New Branch…
- Publish to send it to GitHub.
- Merge: Current Branch → Choose a branch to merge into… → Create merge commit
- Delete: right-click the branch → Delete
Functional programming (FP) for predictable analyses
FP treats programs as pure functions over immutable data. The payoffs:
- Same input → same output (fewer “works on my machine” bugs)
- Easier testing and parallelization
- Clear, composable pipelines (
map/filter/reduce)
Core ideas at a glance
- Immutability: don’t change values in place; return new ones.
- Pure functions: output depends only on inputs; avoid side effects.
- Higher-order functions: functions that take/return functions.
- Composition: build bigger behavior by chaining small functions.
Examples (Python & R)
Pure vs. impure
Python
# Impure: mutates external state
total = 0
def add_to_total(value):
global total
total += value
# Pure: depends only on inputs
def add(a, b):
return a + bR
# Impure: mutates external state
total <- 0
add_to_total <- function(value) {
total <<- total + value
}
# Pure: depends only on inputs
add <- function(a, b) {
a + b
}Higher-order functions
Python
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x * x, numbers)) # map
even_numbers = list(filter(lambda x: x % 2 == 0, numbers)) # filterR
library(purrr)
numbers <- c(1, 2, 3, 4, 5)
squared <- map(numbers, ~ .x * .x)
even_numbers <- keep(numbers, ~ .x %% 2 == 0)Immutability
Python
immutable_list = [1, 2, 3]
new_list = [*immutable_list, 4] # original unchangedR
immutable_vector <- c(1, 2, 3)
new_vector <- c(immutable_vector, 4) # original unchangedComposition
Python
def add_one(x): return x + 1
def square(x): return x * x
square_and_add_one = lambda x: add_one(square(x))
square_and_add_one(3) # 10R
add_one <- function(x) x + 1
square <- function(x) x * x
square_and_add_one <- purrr::compose(add_one, square)
square_and_add_one(3) # 10Map + reduce
Python
numbers = [1, 2, 3, 4, 5]
squared_and_sum = sum(map(lambda x: x * x, numbers)) # 55R
library(purrr)
numbers <- c(1, 2, 3, 4, 5)
squared_and_sum <- numbers %>% map_dbl(~ .x * .x) %>% reduce(`+`) # 55Quick win: first repo + first notebook (5 steps)
- Create environment:
py_envorrenv. - Start Jupyter:
jupyter lab, makeweek1.ipynb. - Write: one Markdown cell (“Goal”), one code cell that prints
2 + 2. - Init repo (GitHub Desktop → File → New repository…).
- Commit & publish: write a clear message (“Add week1 notebook”), then Publish repository.
From now on: small, frequent commits with clear messages beat giant, infrequent ones.
Common pitfalls (and fixes)
“My notebook runs differently each time.”
Use pure functions and fix random seeds; restart kernel and Run All before committing.“Conflicting packages.”
Use a fresh Conda environment per project; avoid installing globally.“Gigantic notebooks in Git.”
Keep outputs lightweight; avoid committing large data files—store data in/data/and add it to.gitignore(or use Git LFS if needed).
Mini-exercises
- Markdown warm-up: write a short intro cell with a heading, a list, and a link.
- Environment check: import
pandas/numpy(Python) ortidyverse(R) and print their versions. - Pure function: write a function that standardizes a numeric vector/array and returns the standardized copy (don’t mutate inputs).
- Git practice: make a branch
feat/add-standardize, add your function, commit twice (once for function, once for tests/usage), merge intomain, delete the branch.
Putting it all together (your workflow)
- Create a clean environment and launch Jupyter Lab.
- Start a GitHub repo (use GitHub Desktop).
- Write notebooks with clear Markdown and pure, testable functions.
- Commit often; use branches to experiment; maintain a
.gitignore. - Push to GitHub so your work is transparent and reproducible.
Your first task
Create your first notebook and push it to GitHub. See homework 1 for specifics.