Tools

For our work with databases, we’ll be using three tools:

SQLite (database)
sqlite3 (Python module)
DB Browser (GUI application)

Tools: SQLite

lightweight relational database (RDBMS)
runs locally and stores data in files
no separate database server required
already installed on your system (as part of Python)
project website: https://www.sqlite.org

Tools: sqlite3

module in the standard library
facilitates interaction with the database by:
1. managing connections
2. passing queries to the database system
3. making results of queries accessible to Python

Tools: DB Browser for SQLite

graphical application (GUI)
allows you to view the contents of a SQLite database
an optional convenience for this course
downloadable from https://sqlitebrowser.org

Databases

The Relational Model

Invented by E. F. Codd in the 1970s
Data stored in tables (relations)
Relations consist of rows (tuples)
Tuples have columns (attributes) that hold values
The term “database” now often connotes this type of relational database

Other DB Systems

Database design/theory is a much larger topic. In fact it is an entire course (INST 327)! Here are a just a few database systems you may encounter:

Relational:
- mySQL/MariaDB
- PostgreSQL
Non-relational (“NoSQL”):
- Document: MongoDB (JSON), BaseX (XML)
- Graph: Neo4j, Fuseki

‘Normalization’

Through a design process, data are “normalized”
Normalization means applying a set of rules (called “normal forms”) to increase efficiency and reduce redundancy
Many specifics of database design and normalization are outside the scope of this course

SQL

a standardized language is used to interact with a relational database
this is known as the Structured Query Language (SQL)
SQL has different flavors but generally works the same way across database systems

CRUD

Basic database operations are called “CRUD”:

CREATE: add data to the database
READ: retrive data from one or more tables
UPDATE: make changes to the data
DELETE: remove rows from a table (potentially with a cascade effect)

Using a Database with Python

Connecting to a database: in-memory

# import the sqlite3 module
>>> import sqlite3

# '::memory::' is special sqlite3 syntax
>>> conn = sqlite3.connect(':memory:')
>>> conn
<sqlite3.Connection object at 0x10507b110>

Connecting to a database: database file

# connect to an in-memory database
>>> conn2 = sqlite3.connect('test.sqlite')
>>> conn2
<sqlite3.Connection object at 0x10507b030>
# a file will appear in the working directory

The connection object

This object manages the database connection
In addition, we need a cursor to manage state
The cursor sends queries and contains results

# imagine we are creating a bibliographic db
conn = sqlite3.connect('biblio.sqlite')
# to query this db we must create a cursor
cursor = conn.cursor()

Setting up the database

# create a table to hold some data with CREATE
# docstrings are commonly used for queries
cq = '''CREATE TABLE books (
        title TEXT, author TEXT, date INTEGER
        )'''
cursor.execute(cq)

Create entries

# Next add a row to the table with INSERT
iq = '''INSERT INTO books VALUES (
     '2001: A Space Odyssey',
     'Arthur C. Clarke', '1951'
     )'''
cursor.execute(iq)

Scaling up

This works fine, but can be impractical at scale
Larger numbers of records can be created with executemany()

# First, create some data
# (in real life you might read this from a file)
# Structure it as a list of tuples
data = [("I, Robot", "Isaac Asimov", 1950),
        ("The Martian", "Andy Weir", 2012),
        ("The Left Hand Of Darkness",
            "Ursula K. Le Guin",1969)]

Scaling up (continued)

# Next, create a query with placeholders
imq = '''INSERT INTO books VALUES (?,?,?)'''

# Finally, pass query & data to executemany()
cursor.executemany(imq, data)

Read

# Next, to view the data, use a SELECT query
sq = '''SELECT title FROM books'''
# Execute the query as before, appending fetchall()
# which assigns results to 'books' variable
books = cursor.execute(sq).fetchall()
print(books)
[('2001: A Space Odyssey',),
    ('I, Robot',),
    ('The Martian',),
    ('The Left Hand Of Darkness',)]

Update

Now, imagine we need to update some data.
The Martian was in fact published in 2011, so let’s make that correction.

uq = '''UPDATE books
        SET year=2011
        WHERE title="The Martian"'''
cursor.execute(uq)

Verify Update

# now query just that row to verify
vq = '''SELECT * FROM books
        WHERE title="The Martian"'''
cursor.execute(vq)
print(cursor.fetchall())
[('The Martian', 'Andy Weir', 2011)]

Delete

Now, let’s look at deleting data
This works much as you would expect

dq = '''DELETE FROM books
        WHERE author="Isaac Asimov"'''
cursor.execute(dq)

Committing Changes

So far, what we have done has occurred in-memory
To persist, changes must be committed (saved)

# Normally you do this after completing each change
conn.commit()
# With changes saved, the connection can be closed
conn.close()

Summary

We have just covered a lot of ground
Connections, cursors, queries, CRUD
There is also a lot we have not touched
Next time we will look at normalization
Specifically: primary keys, foreign keys, and joins

Exercise

Exercise 1: Load a large dataset

The data in this CSV file (books.csv) consists of a list of titles, authors, and dates of important works of fiction. The data are similar to the data used in the above examples.

The first task is to create a program that can read the data in the attached file and load it into a database.

Normalizing and Joining Data

Why Normalize?

There are many reasons related to optimization
But the simplest way to think about it is this:
1. Consider that one author can write many books
2. Conversely one book can have many authors
3. To model such relationships effectively, author data should be stored apart from book data

Identifiers (Keys)

The first requirement for modeling relationships between tables is to have unambiguous identifiers
These identifiers, called keys, allow data to be looked up
The unique id for a particular row in a table is called a primary key

Using Keys to Create Joins

Rows can also reference rows in other tables – this cross-reference is called a foreign key
For example, the row “Hamlet” in the plays table might reference “William Shakespeare” in the authors table

Creating Normalized Data

When designing a database, before doing any coding tables and their relationships should be mapped out
The diagram created during this mapping process is called an ERD
This stands for Entity-Relationship Diagram
In addition to mapping out relationships, you need to create code to analyze the data and write it to the correct locations

Creating Normalized Data (continued)

For example, in working with our list of books and authors, you might:
1. Store the authors names in a separate table
2. As you read the data file, lookup the author
3. If the author is present already, get the id
4. If the author is not present, add the author and get the id
5. Add the book to the books table, referencing the author’s id

Flat Data

title	author	year
Things Fall Apart	Chinua Achebe	1958
Chimera	John Barth	1972
The Sot-Weed Factor	John Barth	1960
Under the Volcano	Malcolm Lowery	1947

Add Primary Keys

id	title	author	year
1	Things Fall Apart	Chinua Achebe	1958
2	Chimera	John Barth	1972
3	The Sot-Weed Factor	John Barth	1960
4	Under the Volcano	Malcolm Lowry	1947

Move Authors to Own Table

id	title	author_id	year
1	Things Fall Apart	1	1958
2	Chimera	2	1972
3	The Sot-Weed Factor	2	1960
4	Under the Volcano	3	1947

id	name
1	Chinua Achebe
2	John Barth
3	Malcolm Lowry

Selecting Normalized Data

To lookup normalized data, you can use SQL’s JOIN syntax
You specify the fields to match on (linking foreign key to primary key)

jq = '''SELECT authors.name, books.title, books.year
        FROM books JOIN authors
        ON books.author_id=authors.id'''
books = cursor.execute(join_query, filter).fetchall()

Deleting Normalized Data

Normalizing data introduces some additional complications
Consider our authors and books examples
If you remove a row from the authors table, what happens to the author’s books?
There is a danger that orphaned rows will clutter the database

Deleting Normalized Data (continued)

In order to control the creation of bad data, SQL allows you to specify constraints in your database schema
Among the constraints is one called CASCADE DELETE
In essence, by specifying this constraint, you would force SQLite to remove books that were written by a deleted author when removing the author

Exercise 2: Normalization in action

Consider again the bibliographic database, note that there are multiple titles in the attached file written by a single author. In order to normalize this data, the author names should be moved into their own table and related to the book data through a relationship.

How can the authors data be related to the book titles? Can you create a program that will manage the normalization process at load time?