Homework 3

Instructions

Your solutions should be in form of a report in .md format. Make sure to document you procedure properly. You will be tasked with exploring and analysing datasets. Make sure that your answers are clear and that you have documented your procedure.

In this assigment, your task is to analyse and process datasets using the tools introduced in the lectures page.

Make sure of the following:

Your solutions should be in form of a report either a .ipynb or .md format.
You have documented your procedure properly.
Your answers are clear and concise.

When you are finished push you results to Github and raise an issue, just as you have done in previous homeworks. To pass the homework you will have to complete the assigments below and also finish the peer-review.

Feel free to contact me if anything is unclear.

Exploratory Data Analysis

IRIS data

In the file IRIS.csv you will find data on three species of iris flowers. The data contains information about the dimensions of aspects of the flower. Your task is to visualise the dataset.

Is there a relationship between sepal dimensions and petal dimensions? Generate the following figure.

iris-dim

What can you say about the relationship given the figure?

How are the sepal and petal dimensions distributed? Generate the following figure.

iris-box

What can you conclude from this figure?

The so called pairs-plot is a very simple way of quickly analysing realtionships between data. Generate the following figure

iris-pair

Briefly, mention how the different variables are related to each other.

Birdwatching

On Artportalen, you can find data on animals, plats and mushrooms. The dataset has been aggregate by both scientist and hobbyists, which is what we call citizen science. In the file artportalen.csv you will find data on bird sightings made in 2022 in the royal national park. Your task is to explore and analyse the dataset.

Begin by familiarising yourself with the dataset.

After you have made yourself familiar with the dataset, answer the following questions.

What are the most prevelant species?
What is the monthly distribution of the top 3 most prevelant species
What are the rarest species?

Now it is time for you to explore the dataset on your own. Generate at least 3 questions on your own and explore the dataset. What does these questions + answers tell you about the data? Make sure the questions highlight something in the dataset and is significant.

Predicting Strokes

In the file stroke-data.csv you can find data about stroke cases and information about the individuals it pertains. find out more about the dataset.

Your task is to explore this dataset on your own. Where does your exploration lead you? What can you say about the dataset? Explain the content of the dataset and generate at least 3 serious questions that give you insight.

Data Preparation

Cleaning data

Take a look a the cell_phones_total.csv file. It contains data about the number of phones withtin countries over the years, from 1960-2019. You will find that some numbers are represented as string, where k=1e3, M=1e6 and B=1e9. The dataset also contains missing values.

Your task is to clean the dataset. Make sure to:

Deal with missing values in a suitable way (fill and/or remove missing values).
Convert all the relevant cells to numbers (not strings).

Each step has to be documented.

When you are finished preparing the data present a table of the following format: It does not have to exactly match but all the missing values should be handled. Here I have sorted the values by the year 2015.

iso-3	2015	2016	2017	2018	2019
CHN	1.29e+09	1.36e+09	1.47e+09	1.65e+09	1.73e+09
IND	1e+09	1.13e+09	1.17e+09	1.18e+09	1.15e+09
USA	3.82e+08	3.96e+08	4e+08	4.22e+08	nan
IDN	3.39e+08	3.86e+08	4.35e+08	3.19e+08	3.45e+08
BRA	2.58e+08	2.44e+08	2.18e+08	2.07e+08	nan

Hint: All the missing values should no be treated equally and you should not have to remove any data.

Homework 3 ​

Instructions ​

Exploratory Data Analysis ​

IRIS data ​

Birdwatching ​

Predicting Strokes ​

Data Preparation ​

Cleaning data ​