Relational Databases

Cuneiform tablet with bread and flour distributions, Ur III Period, c. 2100-2000 BC - Harvard Semitic Museum - Cambridge, MA - DSC06146

Tools

For our work with databases, we’ll be using three tools:

  • SQLite (database)
  • sqlite3 (Python module)
  • DB Browser (GUI application)

Tools: SQLite

  • lightweight relational database (RDBMS)
  • runs locally and stores data in files
  • no separate database server required
  • already installed on your system (as part of Python)
  • project website: https://www.sqlite.org

Tools: sqlite3

  • module in the standard library
  • facilitates interaction with the database by:
    1. managing connections
    2. passing queries to the database system
    3. making results of queries accessible to Python

Tools: DB Browser for SQLite

  • graphical application (GUI)
  • allows you to view the contents of a SQLite database
  • an optional convenience for this course
  • downloadable from https://sqlitebrowser.org

Databases

Image of a woman working with the card catalog in the Library of Congress, 1942

The Relational Model

  • Invented by E. F. Codd in the 1970s
  • Data stored in tables (relations)
  • Relations consist of rows (tuples)
  • Tuples have columns (attributes) that hold values
  • The term “database” now often connotes this type of relational database

Other DB Systems

Database design/theory is a much larger topic. In fact it is an entire course (INST 327)! Here are a just a few database systems you may encounter:

  • Relational:
    • mySQL/MariaDB
    • PostgreSQL
  • Non-relational (“NoSQL”):
    • Document: MongoDB (JSON), BaseX (XML)
    • Graph: Neo4j, Fuseki

‘Normalization’

  • Through a design process, data are “normalized”
  • Normalization means applying a set of rules (called “normal forms”) to increase efficiency and reduce redundancy
  • Many specifics of database design and normalization are outside the scope of this course

SQL

  • a standardized language is used to interact with a relational database
  • this is known as the Structured Query Language (SQL)
  • SQL has different flavors but generally works the same way across database systems

CRUD

Basic database operations are called “CRUD”:

  • CREATE: add data to the database
  • READ: retrive data from one or more tables
  • UPDATE: make changes to the data
  • DELETE: remove rows from a table (potentially with a cascade effect)

Using a Database with Python

Connecting to a database: in-memory

# import the sqlite3 module
>>> import sqlite3

# '::memory::' is special sqlite3 syntax
>>> conn = sqlite3.connect(':memory:')
>>> conn
<sqlite3.Connection object at 0x10507b110>

Connecting to a database: database file

# connect to an in-memory database
>>> conn2 = sqlite3.connect('test.sqlite')
>>> conn2
<sqlite3.Connection object at 0x10507b030>
# a file will appear in the working directory

The connection object

  • This object manages the database connection
  • In addition, we need a cursor to manage state
  • The cursor sends queries and contains results
# imagine we are creating a bibliographic db
conn = sqlite3.connect('biblio.sqlite')
# to query this db we must create a cursor
cursor = conn.cursor()

Setting up the database

# create a table to hold some data with CREATE
# docstrings are commonly used for queries
cq = '''CREATE TABLE books (
        title TEXT, author TEXT, date INTEGER
        )'''
cursor.execute(cq)

Create entries

# Next add a row to the table with INSERT
iq = '''INSERT INTO books VALUES (
     '2001: A Space Odyssey',
     'Arthur C. Clarke', '1951'
     )'''
cursor.execute(iq)

Scaling up

  • This works fine, but can be impractical at scale
  • Larger numbers of records can be created with executemany()
# First, create some data
# (in real life you might read this from a file)
# Structure it as a list of tuples
data = [("I, Robot", "Isaac Asimov", 1950),
        ("The Martian", "Andy Weir", 2012),
        ("The Left Hand Of Darkness",
            "Ursula K. Le Guin",1969)]

Scaling up (continued)

# Next, create a query with placeholders
imq = '''INSERT INTO books VALUES (?,?,?)'''

# Finally, pass query & data to executemany()
cursor.executemany(imq, data)

Read

# Next, to view the data, use a SELECT query
sq = '''SELECT title FROM books'''
# Execute the query as before, appending fetchall()
# which assigns results to 'books' variable
books = cursor.execute(sq).fetchall()
print(books)
[('2001: A Space Odyssey',),
    ('I, Robot',),
    ('The Martian',),
    ('The Left Hand Of Darkness',)]

Update

  • Now, imagine we need to update some data.
  • The Martian was in fact published in 2011, so let’s make that correction.
uq = '''UPDATE books
        SET year=2011
        WHERE title="The Martian"'''
cursor.execute(uq)

Verify Update

# now query just that row to verify
vq = '''SELECT * FROM books
        WHERE title="The Martian"'''
cursor.execute(vq)
print(cursor.fetchall())
[('The Martian', 'Andy Weir', 2011)]

Delete

  • Now, let’s look at deleting data
  • This works much as you would expect
dq = '''DELETE FROM books
        WHERE author="Isaac Asimov"'''
cursor.execute(dq)

Committing Changes

  • So far, what we have done has occurred in-memory
  • To persist, changes must be committed (saved)
# Normally you do this after completing each change
conn.commit()
# With changes saved, the connection can be closed
conn.close()

Summary

  • We have just covered a lot of ground
  • Connections, cursors, queries, CRUD
  • There is also a lot we have not touched
  • Next time we will look at normalization
  • Specifically: primary keys, foreign keys, and joins

Exercise

Exercise 1: Load a large dataset

The data in this CSV file (books.csv) consists of a list of titles, authors, and dates of important works of fiction. The data are similar to the data used in the above examples.

The first task is to create a program that can read the data in the attached file and load it into a database.

Normalizing and Joining Data

Why Normalize?

  • There are many reasons related to optimization
  • But the simplest way to think about it is this:
    1. Consider that one author can write many books
    2. Conversely one book can have many authors
    3. To model such relationships effectively, author data should be stored apart from book data

Identifiers (Keys)

  • The first requirement for modeling relationships between tables is to have unambiguous identifiers
  • These identifiers, called keys, allow data to be looked up
  • The unique id for a particular row in a table is called a primary key

Using Keys to Create Joins

  • Rows can also reference rows in other tables – this cross-reference is called a foreign key
  • For example, the row “Hamlet” in the plays table might reference “William Shakespeare” in the authors table

Creating Normalized Data

  • When designing a database, before doing any coding tables and their relationships should be mapped out
  • The diagram created during this mapping process is called an ERD
  • This stands for Entity-Relationship Diagram
  • In addition to mapping out relationships, you need to create code to analyze the data and write it to the correct locations

Creating Normalized Data (continued)

  • For example, in working with our list of books and authors, you might:
    1. Store the authors names in a separate table
    2. As you read the data file, lookup the author
    3. If the author is present already, get the id
    4. If the author is not present, add the author and get the id
    5. Add the book to the books table, referencing the author’s id

Flat Data

title author year
Things Fall Apart Chinua Achebe 1958
Chimera John Barth 1972
The Sot-Weed Factor John Barth 1960
Under the Volcano Malcolm Lowery 1947

Add Primary Keys

id title author year
1 Things Fall Apart Chinua Achebe 1958
2 Chimera John Barth 1972
3 The Sot-Weed Factor John Barth 1960
4 Under the Volcano Malcolm Lowry 1947

Move Authors to Own Table

id title author_id year
1 Things Fall Apart 1 1958
2 Chimera 2 1972
3 The Sot-Weed Factor 2 1960
4 Under the Volcano 3 1947
id name
1 Chinua Achebe
2 John Barth
3 Malcolm Lowry

Selecting Normalized Data

  • To lookup normalized data, you can use SQL’s JOIN syntax
  • You specify the fields to match on (linking foreign key to primary key)
jq = '''SELECT authors.name, books.title, books.year
        FROM books JOIN authors
        ON books.author_id=authors.id'''
books = cursor.execute(join_query, filter).fetchall()

Deleting Normalized Data

  • Normalizing data introduces some additional complications
  • Consider our authors and books examples
  • If you remove a row from the authors table, what happens to the author’s books?
  • There is a danger that orphaned rows will clutter the database

Deleting Normalized Data (continued)

  • In order to control the creation of bad data, SQL allows you to specify constraints in your database schema
  • Among the constraints is one called CASCADE DELETE
  • In essence, by specifying this constraint, you would force SQLite to remove books that were written by a deleted author when removing the author

Exercise 2: Normalization in action

Consider again the bibliographic database, note that there are multiple titles in the attached file written by a single author. In order to normalize this data, the author names should be moved into their own table and related to the book data through a relationship.

How can the authors data be related to the book titles? Can you create a program that will manage the normalization process at load time?