Skip to content

Homework 4

Instructions

Make sure of the following:

  • Your solutions should be in form of a report in .md or .ipynb format.
  • You have documented your procedure properly.
  • Your answers are clear and concise.

When you are finished push you results to Github and raise an issue, just as you have done in previous homeworks. To pass the homework you will have to complete the assigments below and also finish the peer-review.

Feel free to contact me if anything is unclear.

Joining Data

Use the cleaned data from the previous homework!

Now that we have cleaned our cell phones data, we want to analyse the number of cell phones per capita. To make this possible, we need to expand our original dataset. The file pop_data.csv contains data about the population size of countries over years. Once again there will be missing values. Deal with them before proceeding.

To combine the cell phone data with the population data, we need to reshape both datasets. This can be done by pivotting the tables into the right format. Remember the 3 rules from week 3. Present a similar table to below the one below.

iso-3yearn_cellphonespopulation
ABW1960054608
AFG196008.62247e+06
AGO196005.3572e+06
ALB196001.6088e+06
AND196009443

The iso-3 format is nice to work with but for presentation it is better to work with real names of the countries. A mapping between iso-3 and country name can be found in the country_data.csv file.

Combine the cell phone, population and country data into a single table. Calcualte the number of phones per capita. I.e n_cellphones/population_size. Present a similar table to below. Here, the coutries are the ones with highest number of cell phones per capita in 2019.

yearMacaoHong KongUnited Arab EmiratesAntigua and BarbudaSeychelles
20153.088232.29042.007421.956841.58426
20163.143512.398932.212521.987541.60546
20173.21012.488772.183432.019341.74243
20183.348742.670212.199082.008161.8499
20193.284852.863652.127741.997461.9872

Note that you might get a slightly different table based on how you deal with missing data. This is okay! make sure to document your procedure.

Analyse the country with highest growth rate of cell phones (per capita) in the last 10 years. That is, create a plot that illustrates the growth rate of the countries in the table above over the last 10 years.

Hint: You can pivot the tables to get the desired output.

SQL

The file user_actions.db in the data repo contains an SQLite database. In the database there is a table named user_actions. Analyse the table and solve the following tasks.

  • Retrieve the usernames of all users who have performed the "signup" action.
  • Find the total number of log entries for each user. Display the user_id, username, and the count of log entries.
  • Identify users who have both logged in (action = 'login') and signed up (action = 'signup') on the same day. Display the user_id and username.

Each task should be solved only using SQL.

Regex

The file comments.txt in the data repo contains lines of text, each representing a user comment. Users sometimes include tags in their comments using the format "#tag". Analyse the file and solve the following tasks.

  • Write a regular expression to extract all hashtags from a given comment. For example, applying the regex to comment 1 should return ["#programming", "#tips"].
  • Create a regular expression to find comments that mention both "#programming" and "#python". Apply the regex to comment 2 and check if it matches.
  • Using your regular expression, extract all unique hashtags from the entire text file. (*)

The last task is optional (*)!

Each task should be solved only using regex.

Good Luck!