Appearance
Lecture 6 Joining Data, SQL and Introduction to RegEx
Resources
Joining Data
SQL
- W3, SQL Basics
- R4DS chapter 19
- SQLite Python, R
RegEx
- R4DS chapter 14
- P4DA chapter 7.4
- RegEx Cheat Sheet
You can find examples and motivation in the resources.
Summary
In this lecture, we explore key topics in data manipulation and querying. First, we discuss the concept of joins, which allow us to combine different sources of data into a single dataframe. This is a crucial skill for data scientists, as data is often scattered across multiple sources such as databases, the internet, and experimental datasets. Joining enables us to consolidate these scattered datasets into one.
Next, we cover the basics of SQL, focusing on the fundamental syntax for making queries. This includes both retrieving and storing data in a database, laying the foundation for effective database management.
Finally, we delve into the use of RegEx for pattern matching in strings. RegEx is a powerful tool for bulk searching patterns and is an industry standard. It can be applied across various tools we frequently use, such as Git, Python/R, and SQL, making it an essential skill for efficient data processing.
Joins
Joining tables is the process of combining data from multiple tables into a single table. This operation is fundamental in data analysis and database management, as it allows us to consolidate scattered datasets. There are several types of joins, each with unique implications for the structure and size of the resulting table. The four most common types are left join
, right join
, inner join
, and outer join
. Each serves a specific purpose, depending on how we want to merge the tables.
The following image visually illustrates these operations:
Image from: https://www.alphacodingskills.com/sql/img/sql-join.PNG
In this context, we are always referring to combining two tables at a time. When working with more than two tables, we can achieve the desired result by combining them iteratively, two tables at a time.
A common approach is to iteratively expand a dataset by incorporating additional datasets through left join, which is widely used due to its practicality. However, the other types of joins— right join
, inner join
, and outer join
—are equally important for creating more intricate and specific dataset combinations.
Mastering the art of joining tables is essential for working effectively with SQL-based databases. Joins play a critical role in SQL, as efficient data storage typically involves splitting data into multiple related tables. Understanding how to reassemble this data through joins is a key skill, which we will explore further in the upcoming lecture.
SQL
What is SQL? Structured Query Language (SQL) is a computer language that enables us to communicate and manipulate SQL databases(db) and it is one of the most widely used languages for db management.
There are many flavours (versions) of SQL. However, they all share the basic syntax as below and often only differ in the way the basic syntax is extended.
SQL is a Relational Database Management System (RDBMS), meaning that the data is stored in tables that are seperated based on the scope of the database. A nice illustration of how data is stored can be seen in the figure below.
Image borrowed from:https://r4ds.hadley.nz/
Here, the tables are seperated to reflect differring objects. We also see that the tables are linked or related to each other through specific variables (keys). This is where the name relational database comes from.
Basic Syntax
Writing an SQL query(a question or a requests) is analagous to speaking to the database. Once we get a grasp of the basic syntax we will be able to query the database and retrieve data.
The syntax will be familiar from the tools we have already learnt.
We use SELECT
to determine the variables we want and FROM
to specify the table we want it from. Following the image above, we can make the following query.
SQL
SELECT carrier, dest, tailnum
FROM flights;
Make sure to end the query with ;
We can use SELELCT *
to get all variables in the table. The query above is the simplest query one can make in SQL.
We can add a layer of complexity by filtering the table. We can do this by using the WHERE
command. Continuing the example, we can make the following query.
SQL
SELECT carrier, dest, tailnum
FROM flights
WHERE dest='sweden';
Furthermore, we can use the logical operators AND
, OR
, NOT
to perform logical chaining of conditions. The basic syntax is the following.
SQL
SELECT variable1, variable2, ... -- * if you want all variables
FROM table
WHERE condition1 AND/OR/NOT condition2 AND/OR/NOT condition3 ...;
To aggregate multiple tables into a single table we can use ... JOIN
. The syntax is similar to the previous lecture. ON
specifies the key(s) to join on. Continuing the example above the syntax is the following.
SQL
SELECT carrier, dest, tailnum, airlines.names
FROM flights
LEFT JOIN airlines ON flights.carrier = airlines.carrier;
Using the syntax described above, we can chain multiple operations to generate the desired table. You can see more operations on W3.
Since the SQL syntax has been adopted by both pandas
and tidyverse
, the content above will not be completely new to learn.
Storing Data
To start storing data, we need to create a table to store it in. We can do this using the CREATE
command. The query is as follows.
SQL
CREATE TABLE table_name (
...
);
To our table, we need to add columns (variable), we can do this by specifying the name of the column and datatype. We can also specify constraints on the variable. The query is as follows.
SQL
CREATE TABLE table_name (
Variable1 datatype constraint,
Variable2 datatype constraint,
...
);
Common constraints are NOT NULL
, UNIQUE
, PRIMARY KEY
, FOREIGN KEY
. There are more constraints that can be added. They can be found on W3.
The unique identifier (id) of a row should be constrained with UNIQUE
and it must also be filled (NOT NULL
). The PRIMARY KEY
incorporates both constraints. Therefore, it is good practice to specify the row id using the PRIMARY KEY
constraint.
An example of creating a table can look something like the following query.
SQL
CREATE TABLE Persons (
name VARCHAR(255) NOT NULL,
age INT, -- It does not need to be specified
id_number VARCHAR PRIMARY KEY
);
Specific datatypes can be looked up on W3.
To change a table that already exists we can use the ALTER
command. Specific ways to alter tables can be found on W3.
Putting it all together, the SQL query looks as follows.
SQL
CREATE TABLE Persons (
Name VARCHAR(255) NOT NULL,
Age INT, -- It does not need to be specified
ID_number VARCHAR(255) PRIMARY KEY
);
CREATE TABLE Computers (
Computer_id INT PRIMARY KEY,
Model VARCHAR(255),
Brand VARCHAR(255),
Owner VARCHAR(255) FOREIGN KEY REFERENCES Persons(ID_number)
)
Here, we create two tables and use FOREIGN KEY
to link the owner of the computer to a specific person using the id-number. If we later realise that we want a person to have a variable specifying occupation, we can use the following command.
SQL
ALTER TABLE Persons
ADD Occupation VARCHAR(255);
With a table created, we can add entries to the table. This can be done using the INSERT
command. The full syntax looks as follows.
SQL
INSERT INTO table_name (variable1, variable2, ...)
VALUES (value1, value2, ...)
Note that the order of the values should follow the order of the variables. Furthermore, if the constraints are violated we will get an error.
In the example above the query may look like
SQL
INSERT INTO Persons (Name, ID_number)
VALUES ('Taariq', '19000101-xxxx');
Since, the Age variable does not need to be filled in, the query will populate the Persons
table and have an empty cell for the Age column.
We are only scratching the surface of SQL, there is a lot more to learn. However, with this basic knowledge we will be able to interact with databases. You can find more information in the listed resources (W3).
There are various libraries in Python and R to interact with SQL databases. We will primarily be working with SQLite databases since they are more accessible. Popular libraries are sqlite3 for Python and dbplyr for R.
RegEx
Regular Expressions (RegEx) is a powerful tool used in computer science and programming for pattern matching within strings. It provides a concise and flexible means of searching, matching, and manipulating text based on patterns.
A regular expression is a sequence of characters (a type of query) that defines a search pattern. These patterns can include a variety of elements such as literal characters, metacharacters (special characters with specific meanings), and quantifiers (to specify the number of occurrences). Regex is commonly used in tasks like text searching, validation, and text manipulation.
Before diving in to the syntax of RegEx, let's look at a simple example.
Consider a scenario where you want to extract email addresses from a text, for instance the following text.
Please contact support@example.com for assistance. For general inquiries, you can email info@company.com.
We can use the following RegEx to extract the emails in this text by matching them to a specific format.
regex
\b[\w._%+-]+@[\w.-]+\.[A-Za-z]{2,4}\b
The expression above might look daunting but it will make sense when you get thet gist of RegEx. Lets break down this expression.
\b
: Word boundary to ensure that the match is a whole word and not part of a larger sequence. For instance,\bcat\b
will match the word cat but not scattered.[\w._%+-]+
: Matches the username part of the email address, allowing alphanumeric characters, dots, underscores, percent signs, plus signs, and hyphens. Here\w
is a metacharacter which is short forA-Za-z0-9
.+
outside of the brackets matches 1 or more of the proceeding character.@
: Matches the @ symbol.[\w.-]+
: Matches the domain name, allowing alphanumeric characters, dots, and hyphens.\.
: Matches the dot before the top-level domain.[A-Z|a-z]{2,4}
: Matches the top-level domain (eg. .com) with at least two and at most 4 characters .\b
: Word boundary to complete the match.
To summarise The following image illustrates what we have done.
Image from: https://kottke.org/21/07/a-history-of-regular-expressions-and-artificial-intelligence
Even if it does not make sense yet that is fine. You will get the hang of it when you start using it.
Basic Syntax
Some important definitions in regex are the following.
Literals
Characters in a regex pattern that match themselves. For example, the regex abc
will match the string "abc" in the input.
Metacharacters
Special characters with a specific meaning in regex. Some common metacharacters include:
.
(dot): Matches any single character except a newline.^
: Anchors the regex at the start of the string.$
: Anchors the regex at the end of the string.*
: Matches 0 or more occurrences of the preceding character or group.+
: Matches 1 or more occurrences of the preceding character or group.?
: Matches 0 or 1 occurrence of the preceding character or group.|
: Acts like a logical OR, allowing alternatives. For example,a|b
matches either "a" or "b".()
: Groups characters together. For example,(abc)+
matches one or more occurrences of "abc".
Character Classes
[ ]
: Defines a character class. For example,[aeiou]
matches any vowel.[^ ]
: Negates a character class. For example,[^0-9]
matches any non-digit character.
Quantifiers
Control the number of occurrences of a character or group.
{n}
: Matches exactly n occurrences.{n,}
: Matches n or more occurrences.{n,m}
: Matches between n and m occurrences.
Escape sequences
Use a backslash \
to escape a metacharacter, allowing it to be treated as a literal character. For example, \.
matches a literal period.
Predefined character classes
\d
: Matches any digit (equivalent to[0-9]
).\D
: Matches any non-digit.\w
: Matches any word character (alphanumeric + underscore).\W
: Matches any non-word character.\s
: Matches any whitespace character.\S
: Matches any non-whitespace character.
Anchors
Specify the position in the string where a match must occur.
\b
: Word boundary.\B
: Non-word boundary.^
: Start of a line.$
: End of a line.
Modifiers
i
: Case-insensitive matching.g
: Global matching (find all matches, not just the first).
Wildcard
.*
is a common pattern to match any character (except newline) zero or more times.
I strongly believe that you learn regex by examples. So let's look a typical example of regex.
Example: Extracting Email Addresses from a List
Suppose you have a list of email addresses:
txt
john.doe@example.com
jane.smith@gmail.com
alice.jones@example.com
bob.miller@yahoo.com
Now, let's say you want to extract all the email addresses from the domain example.com. You can use the following regex:
regex
\b[A-Za-z0-9._%+-]+@example\.com\b
Explanation:
\b: Word boundary to ensure that we match the entire domain, not just a part of it. [A-Za-z0-9._%+-]+: Matches the username part of the email address, allowing letters, numbers, dots, underscores, percent signs, plus signs, and hyphens. @example.com: Matches the domain part, specifically example.com.
This pattern will match the following strings.
['john.doe@example.com', 'alice.jones@example.com']
This is a simple example of regex that is meant to illustrate the power of it.
Exercises
Here are some excersises, try them out on your own!
Simple Email Validation
txt
john.doe@example.com
jane.smith@gmail.com
alice.jones123@yahoo.com
invalid.email@domain
Output should be:
txt
Valid Email Addresses:
- john.doe@example.com
- jane.smith@gmail.com
- alice.jones123@yahoo.com
Invalid Email Addresses:
- invalid.email@domain
Extracting Phone Numbers
txt
Phone numbers:
123-456-7890,
(555) 987-6543,
9876543210, 555-1234
Invalid:
12-345-6789,
555-98765,
abcdefgh
Ouput should be:
txt
Valid Phone Numbers:
- 123-456-7890
- (555) 987-6543
- 9876543210
- 555-1234
Invalid Phone Numbers:
- 12-345-6789
- 555-98765
- abcdefgh
Extracting HTML Tags and Attributes
txt
<p class="intro">This is a <strong>sample</strong> paragraph.</p>
<p>No class here.</p>
<div id="container" class="main">
<h1>Title</h1>
<p>Content</p>
</div>
Output should be:
txt
HTML Tags and Attributes:
- <p class="intro">
- <strong>
Attributes in <div>:
- id="container"
- class="main"
You can execute RegEx using Python or R. Look at the resources for how to do this!