Module 5 Exercises

Background

Regular Expressions are particularly useful for processing large amounts of textual data to find patterns, for example to find email addresses, phone numbers, or social security numbers. In the archival profession, a common use-case for locating personally-identifiable information is to redact it when making content available to researchers.

The Enron Email Dataset is a body of approximately 500,000 email messages from top executives in the Enron Corporation, a Texas-based energy company that, after having gone bankrupt in 2001, was revealed to have engaged in massive accounting fraud. The email dataset became evidence in the ensuing investigation, and subsequently was released into the public domain.

PII finder