Fielding History (Bauer) Fall 2011
Fielding History: Relational Databases and Prose
by Jean Bauer
¶ 1 It wasn’t until I started writing the introduction to my dissertation, “Revolution-Mongers: Launching the U.S. Foreign Service, 1775-1825,” that I realized how much buildingThe Early American Foreign Service Database and its underlying open source software package, Project Quincy, influenced how I understand and explain my research. At that point the EAFSD had been live for four months, and I had been telling people that the two projects have a symbiotic relationship. My dissertation contains the stories, quirky situations, and historiographical analysis necessary to bring the past to scrutiny and life. The database provides the backdrop which showcases particular moments as quintessential or unusual, but it is also a standalone secondary resource, a separate publication in its own right. All of this was and remains true, but as I started that introduction, I became conscious of another way the two projects inform each other. As I described the nature of late eighteenth-century diplomacy — the difference between diplomats and consuls, the geopolitical realities of empire, the personal and commercial connections between Foreign Service officers — I found my description replicating the data structure I had built into the EAFSD, because that structure was the best way to get my background knowledge of my topic on paper.
¶ 2 When I realized this overlap I gave a little cheer, because I knew I had designed the EAFSD properly. Databases are normative statements about reality. If all data are theory-laden, then data structures are theories in and of themselves. When you design a database you are making proclamations about what matters (and, by implication, what can be safely ignored), and because relational databases are particularly constricting in how you can represent and link data, you are forced to be very explicit and systematic in your choices. This constriction has lead some historians to abandon relational databases for more flexible data structures, like XML or semantic linking. Some of this rejection is fueled by the fact that databases and statistical packages were adopted by historians before the technology was sufficiently advanced to handle historical sources with the nuance they require. We should remember that eighty-eight hole punch cards frustrated the cliometricians themselves, as well as their readers. In my opinion, much of the reaction against relational databases is simply another symptom of the split among historians that goes back to the very beginning. As a rule of thumb, if you prefer Herodotus to Thucydides you probably want XML. It all depends on your sources and temperament. Relational databases are powerful tools, but they work best when the data you want to record and analyze consists of discrete pieces of information with clear connections between them. However, you have to be careful while designing your database to ensure that you accurately model your field of study without feeding your own preconceptions back into your analysis.
¶ 3Designing a Database
Good decision support database design involves breaking the metadata description of a data set (and therefore its logical organization) into the smallest viable components and then linking those components back together to facilitate complex analysis. This process, known as normalization, helps keep the data set free of duplicates and protects the data from being unintentionally deleted or unevenly updated. These components are known as entities, and the links are called relationships. Each entity represents something in the “real world” which is modeled in the database. Entities contain fields, discrete pieces of data, each with a designated name and datatype (ex. “start_year” “integer”). Entities are sometimes referred to as tables and fields are also called attributes. Entities and relationships only make sense when discussed together, because they take their form from each other. Relationships connect entities, and entities are constructed based on how they relate to each other. But while the analytic power and stability of relational databases comes from its basis in relational algebra, the conceptions can be hard to grasp in the abstract. So, let us turn to a concrete example: The Early American Foreign Service Database.
¶ 4 The heart of my dissertation is concerned with tracing written information flows to and among American Foreign Service officers who served from 1775 to 1825. The database was created to help me track these flows, which are preserved in the historical record as letters. This brings up another crucial part of designing databases for historical projects: you need to think long and hard about the nature of the sources you are using and what data you need to analyze. For the network/prosopographical analysis I am doing, I do not want to record the full text of the letters, although I do use the database to determine which letters should be read in full. The best databases point you back to the original sources for more information. So the database structure had to begin with the information that can be extracted from a letter.
¶ 5Figure 1
¶ 6 Figure 1 illustrates the fielded data typically contained in a letter. Letters have the names of the sender and recipient. Letter writers usually indicate where they are writing and where they want to send the letter (whether the recipient is there when the letter arrives is, of course, another issue entirely). Letters also have a number of dates associated with them. There is the date the letter was begun, the date the letter was finished (with additional dates for addenda and enclosures), and, if you are very lucky, the date when the letter was received and then another date for when it was entered into an archive. So, if we are to model the data extracted from a letter, the resulting entity might look something like the second graphic.
¶ 7Figure 2: List of fields in an database entity designed to model a letter.
¶ 8 Letters can be sent to and from individuals or organizations (two or more people acting together). They are sent to and from locations on particular dates (more on this later). Letters are given titles for when you need to cite them, and in case the same letter is sent to more than one person, you can mark it as a “circular,” with the term ‘boolean’ meaning that the field can only have the values ‘true’ or ‘false.’ The Letters entity also has the ever-useful “Notes” field for any information that does not fit nicely into one of the pre-chosen fields. Notice also how many of the fields are marked as “foreign keys.” A foreign key means that the field in question is in fact one end of a relationship with another entity.
¶ 9 This means that in order to accurately trace a correspondence network the database needs to have entities for Individuals, Organizations, and Locations. Everything else is specific to a particular letter, including the title, the notes field, and the dates. How you choose to record information about people, places, and groups depends on what information you think you will be able to reliably gather about most of the members of each category. You want to strike a careful balance between the uneven richness of sources and a relatively uniform standard for comparison. Just because a person or organization left behind more surviving documentation does not automatically make them more important, just easier to study.
¶ 10 As you are designing entities to describe other parts of the database, it is often helpful to create tables that hold subject keywords you want to use for classifying and later searching. Pre-selected keywords often work best when a clearly defined set of people are in charge of marking up the content. They are great for searching, and if indexed in a hierarchical structure, can provide semantically powerful groupings (especially for geographical information). As a historian, however, I am wary of keywords that are imposed on a text. If someone calls himself a “justice,” I balk at calling him a “judge” even if it means a more efficient search.
¶ 11 Of course, it all depends on your data and what you want to do with it, but my preferred solution is have, at minimum, two layers of keywords. The bottom layer reflects the language in the text (similar to tagging), but those terms are then grouped into pre-selected types. You can fake hierarchies with tags, but it requires a far more careful attention to tag choices than I typically associate with that methodology. For example, in the EAFSD I have an entity called AssignmentTitles that contains all the titles given to U.S. Foreign Service officers by the various American governments. However, there were forty-five distinct titles used between 1775 and 1825, and without highly specialized knowledge it is difficult to understand how they related to each other. So I created another entity, AssigmentTypes, which groups those titles into three distinct types: “diplomatic,” “consular,” and “support staff,” allowing for ease of searching among similar appointments without having to remember every term for consul, or those performing consular functions, used by the Continental Congress, the Congress of the Confederation, and the State Department. It was this three-part distinction that I unconsciously replicated in the introduction to my dissertation, which made me realize the two publications were more intimately linked than I had previously understood.
¶ 12Modeling Time
When designing databases for historical research and teaching it is crucial to remember that these databases are works of history. One of the great challenges of digital history, but also one of our field’s most important contributions to digital humanities in general, is the careful representation of time. Our sources do not exist in some eternal present, but are bound to the past in ways that computers find hard to understand. Computers record time in ways that are simply ridiculous when you are trying to bring the past alive. Who thinks in date-time stamps? True, someone’s life can change in the blink of an eye, but fractional seconds are not helpful in recording human experiences. In fact, they impose an anachronistic, hyper-precise gloss on events that creates an unnecessary barrier to comprehension. While building the EAFSD there was a harrowing week when I could not enter dates prior to 1999, and any date field left blank reverted to today’s date. I could not concentrate on anything else while the two historical dates I had entered into the database were wrong.
¶ 13 Even so, relational databases have very powerful analytic tools for analyzing dates and date ranges that can be very useful for historical purposes. The trick, therefore, is to massage the strict date-time formats to hold your data in ways that are properly formatted, but also intellectually honest. Interface design is your friend in this case, because you can set a whole range of options for how you want your dates to be displayed. However, it is still important to think long and hard about how you want to record dates in the database.
¶ 14 How you record dates will depend on what sorts of dates your sources provide. While PostgreSQL (and other relational database packages) do not know how to handle dates that are not in the Julian calendar, with the appropriate settings they can record dates back to the fifth millennium B.C.E.Figure out how you want to map your dates to the Julian calendar, and explain that process clearly on your site and any documentation you provide. Depending on the age and completeness of your sources, you may need to record partial or fuzzy dates. Partial dates are dates that are missing pieces of information (ex. June 1922). Fuzzy dates are date ranges (ex. January 5-7, 1789). Neither are officially supported, but can (with some ease) be built into the data structure. For partial dates, you can choose to enter only the data you have (month and year) and leave day as 1. Then add a series of boolean flags called “day known,” “month known,” and “year known.” Depending on which of those fields are true, the system can display the dates appropriately. This means that on average you will have a fifteen-day margin of error on any of your partial dates, but can still use all the default date calculators. For date ranges, you can have start_date and end_date fields, or the fields can be labeled “no earlier than,” and “no later than,” which is how TEI (Text Encoding Initiative) handles date ranges. Keep in mind that the more elaborate the solution, the harder it will be to extract date information. The simplest solution that can be mapped to your sources is your best bet. Once the dates are in your system, you can decide how best to display them.
¶ 15Historical Prose
So, how does all of this affect the writing of history? One answer is that standalone secondary source databases are already a major form of publishing historical research. While I am not submitting the EAFSD as my dissertation, it is a publication in its own right. As more and more history finds its way online, databases will structure future research in ways that we need to be very careful and thoughtful about. Making data structures (and the theoretical decisions that underly them) transparent through good documentation is a first step toward educating our colleagues and students about the material they are likely to find available in digital formats. There are not nearly enough digital resources for historical sources that carefully explain the reasons why the designers built their databases the way they did.
¶ 16 Databases can also be used for note taking, which as Ansley Erickson has shown, is a powerful tool for research. But designing databases brings a whole new set of issues to the forefront of the researchers mind: What are the structural similarities of my sources? What are the most important elements of the world I study? What are the key relationships between those elements? How do I need to represent time? It is my belief that investigating these questions in a systematic way deepens the historian’s understanding of their own source material and analytic framework. How that is represented in their prose (if any is generated), will depend largely on the historian and the historical subject under investigation. At a bare minimum, finding the contours of your subjects’ reality will sharpen your own understanding of what is worth including in a narrative analysis, and what is best left aggregated in the database. Earlier uses of databases by cliometricians in the 1960s and 1970s focused on large-scale analysis to discover the average experience of people in different walks of life, whether in New England townships or the U.S. Army. In contrast, working with a database allows me to privilege the mistakes and missed communications of individual Foreign Service Officers. I have found that one of the greatest benefits of a data structure as constricting as a relational database, is its ability to place the downright weird in historical context. While I was drawn to the topic because of the Foreign Service’s ability to function despite being run entirely by amateurs who, at best, learned while doing, the database has allowed me to see where the especially interesting gaps or overcompensations occurred. By making it easier to find the overall trends, I am free to explore, without overstating, any anomalies I find in the course of my research. For those of us who work on trans-Atlantic and even global topics, that freedom can prove invaluable as we sculpt arguments from an ever expanding set of potential sources.
¶ 17About the author: Jean Bauer is the Digital Humanities Librarian at Brown University. She is finishing her dissertation, “Revolution-Mongers: Creating the U.S. Foreign Service, 1775-1825,” in the Corcoran Department of History at the University of Virginia. www.jeanbauer.com
A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model invented by Edgar F. Codd, of IBM's San Jose Research Laboratory. Most databases in widespread use are based on the relational database model.
RDBMSs have been a common choice for the storage of information in new databases used for financial records, manufacturing and logistical information, personnel data, and other applications since the 1980s. Relational databases have often replaced legacy hierarchical databases and network databases because they are easier to understand and use. However, relational databases have received unsuccessful challenge attempts by object database management systems in the 1980s and 1990s (which were introduced trying to address the so-called object-relational impedance mismatch between relational databases and object-oriented application programs) and also by XML database management systems in the 1990s. Despite such attempts, RDBMSs keep most of the market share, which has also grown over the years.
According to DB-Engines, in May 2017, the most widely used systems are Oracle, MySQL (open source), Microsoft SQL Server, PostgreSQL (open source), IBM DB2, Microsoft Access, and SQLite (open source).
According to research company Gartner, in 2011, the five leading commercial relational database vendors by revenue were Oracle (48.8%), IBM (20.2%), Microsoft (17.0%), SAP including Sybase (4.6%), and Teradata (3.7%).
In 1974, IBM began developing System R, a research project to develop a prototype RDBMS. However, the first commercially available RDBMS was Oracle, released in 1979 by Relational Software, now Oracle Corporation. Other examples of an RDBMS include DB2, SAP Sybase ASE, and Informix. In 1984, the first RDBMS for Macintosh began being developed, code-named Silver Surfer, it was later released in 1987 as 4th Dimension and known today as 4D.
Historical usage of the term
The term "relational database" was invented by E. F. Codd at IBM in 1970. Codd introduced the term in his research paper "A Relational Model of Data for Large Shared Data Banks". In this paper and later papers, he defined what he meant by "relational". One well-known definition of what constitutes a relational database system is composed of Codd's 12 rules. However, many of the early implementations of the relational model did not conform to all of Codd's rules, so the term gradually came to describe a broader class of database systems, which at a minimum:
- Present the data to the user as relations (a presentation in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns);
- Provide relational operators to manipulate the data in tabular form.
The first systems that were relatively faithful implementations of the relational model were from:
- University of Michigan -- Micro DBMS (1969)
- Massachusetts Institute of Technology (1971)
- IBM UK Scientific Centre at Peterlee -- IS1 (1970–72) and its successor, PRTV (1973–79)
The first system sold as an RDBMS was Multics Relational Data Store (1978). Ingres and IBM BS12 followed.
The most common definition of an RDBMS is a product that presents a view of data as a collection of rows and columns, even if it is not based strictly upon relational theory. By this definition, RDBMS products typically implement some but not all of Codd's 12 rules.
A second school of thought argues that if a database does not implement all of Codd's rules (or the current understanding on the relational model, as expressed by Christopher J Date, Hugh Darwen and others), it is not relational. This view, shared by many theorists and other strict adherents to Codd's principles, would disqualify most DBMSs as not relational. For clarification, they often refer to some RDBMSs as truly-relational database management systems (TRDBMS), naming others pseudo-relational database management systems (PRDBMS).
As of 2009, most commercial relational DBMSs employ SQL as their query language.
Alternative query languages have been proposed and implemented, notably the pre-1996 implementation of Ingres QUEL.
- ^Pratt, Philip J.; Last, Mary Z. (2014-09-08). Concepts of Database Management (8 ed.). Course Technology. p. 29. ISBN 9781285427102.
- ^"DB-Engines Ranking of Relational DBMS". Retrieved 2017-05-31.
- ^"Oracle the clear leader in $24 billion RDBMS market". 2012-04-12. Retrieved 2013-03-01.
- ^Funding a Revolution: Government Support for Computing Research. National Academies Press. 8 Jan 1999. ISBN 0309062780.
- ^Sumathi, S.; Esakkirajan, S. (13 Feb 2008). Fundamentals of Relational Database Management Systems. Springer. ISBN 3540483977.
- ^"Oracle Timeline"(PDF). Profit Magazine. Oracle. 12 (2): 26. May 2007. Retrieved 2013-05-16.
- ^"New Database Software Program Moves Macintosh Into The Big Leagues". tribunedigital-chicagotribune. Retrieved 2016-03-17.
- ^"A Relational Model of Data for Large Shared Data Banks"
- ^SIGFIDET '74 Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control
- ^Ramakrishnan, Raghu; Donjerkovic, Donko; Ranganathan, Arvind; Beyer, Kevin S.; Krishnaprasad, Muralidhar (1998). "SRQL: Sorted Relational Query Language"(PDF). e Proceedings of SSDBM.