Append-only LMS
© 2023-08-15 Luther Tychonievich
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
other posts
An infrastructure I built for multiple CS courses that uses append-only files as its database

Learning Management Systems (LMS) is a term I have never enjoyed; it means a website that provides lecture notes, videos, quizzes, assignment submission, and grades. Those don’t manage learning, they are just tooling to help run a course. I usually call them course tools, but know that others call them LMSes so I thought it worth mentioning both terms at the opening of this post.

I have written many course tools, and when other faculty use my tools (which about a dozen other faculty have) they sometimes note with surprise the lack of a database. I want to share a few thouhts related to that here.

Databases

The goal of a database, or more precisely a Database Management System or DBMS, is to organize, store, distribute, and manage access to data. I’ve used many DBMSes, including SQLite, OrientDB, PostgreSQL, MySQL, MariaDB, MongoDB, Neo4J, and Redis. But I’ve never used any of them extensively enough to feel fully relaxed working with them, so when I start a small side project I generally don’t add a DBMS to my code.

DBMSes vary in what they provide, but generally they manage concurrent access to data so that many programs can access it at the same time without stepping on one another and provide rapid searches and queries so that the data can be accessed in multiple ways. They achieve these goals by imposing some structure on what the data looks like, organizing it as a set of interconnected tables, independent data blobs, or interlinked nodes.

The concurrency bit deserves a bit more discussion. When multiple users or programs are accessing the same data at the same time there’s a risk that one will change the data half-way through another’s accessing it, causing nonsensical results; or that an update will read data, compute new values, and then write data not realizing that someone else modified what they’d read in between. Because these problems rely on the timing of multiple accesses, they are called race conditions and they are a common bugaboo of concurrent programming.

My course tools

I started writing LMSes as a small side project. First it was just a simple web form that let students upload code and graders download it. Then I added a grade-online interface, then automated feedback for students, then an office hour queuing system, group project management, quizzes, paper tests grading, spot-checking grades, regrades, extensions, excused absences, specification grading, and so on and so forth. But they key bit is that I always thought I was just adding a few features, maybe a hundred lines of code, and never set out to engineer a full LMS.

From the outset, I needed concurrent data access. But I didn’t think I was building something large enough to be worth my while wrangling a DBMS, so I used a feature every file system I know of provides that guarantees the lack of race conditions: append-only files.

File systems are technically a type of database, organized on the assumption that you’ll have individual files and the file system has no idea what’s inside them. They tend to provide a few concurrency protections, such as ensuring that you can’t make two files with the same name in the same directory. One of these is they allow files to be opened in append-only mode; in this mode when you give them some data to write they guarantee it will be placed contiguously (with no other writes interrupting it) at the end of the file (though it might not remain the end if some other write comes along after it).

Append-only files are simple and concurrency-safe, but they never let you modify data you’ve previously written. How could I use an append-only file to store grader decisions and still let the graders change their minds? By putting into the file not the current grade but rather the grader actions that have taken place. Then to compute the current grade I read the entire file, compute the compound effect of all actions, and display that to the user.

When I initially implemented this it felt like a hack, a temporary measure to prototype things before picking a DBMS for the real system later. But I was wrong: after 10 years or use by a dozen instructors and ten thousand students, and even with scores of features I’d never anticipated building, my course tools still use append-only files as their only database.

Append-only files work

Why do my simple append-only files work so well?

First, they just isn’t that much data being produced. I noted in my previous post that keystrokes and clicks just can’t be made that quickly; the append-only files just never get very large. Most users of my tools only generate a few events a minute, a few short burst of typing excepted. I pretty early switched to not even trying to store the non-typing information efficiently: I can use 50 bytes per event, adding extra information to make the action logs readable by a human, and still never come close to any kind of space limitation.

Second, they’re very fast. In practice, writes are as fast as can matter: even a cheap rotating magnetic disk can do the append-only writes faster than a mid-level network switch can provide it with data. Reads are also rapid, but require re-running the stored actions which I was initially worried would be slow. To mitigate that, I started storing every user’s actions in a separate file so that no single replay would take too long; that provided to be a poor decision from an efficiency standpoint. The thing that is slow is opening, reading, and closing a large number of different files. The only operation in my LMSes with a noticeable delay is providing the course instructor with a full gradebook because that might involve opening 10,000 files; but I’ve left that operation as is because it still only takes a few seconds, is something I only do a few times a semester, and hasn’t been worth redesigning to make more speedy. Also, once I realized that was the bottleneck I started designing my tools with fewer distinct files, combining more operations into each file, and as a consequence my later tools could show a 400-person class’s 20-assignment gradebook in under a second.

Third, they support operations I haven’t invented yet. I have a log of every action every user took. Come up with a new action type? Simply start appending it to the logs, no reformatting needed. Come up with a new report to generate? The data to do so is present already because all the data is present already. Want to inspect a strange occurrence and lack the tools to do so? I can simply open the logs and do it by hand: the full story is there if I have the patience to read it.

Where from here?

I believe in SQLite’s mission and design. I am intrigued by OrientDB’s multi-modal structure. Gremlin and TinkerPop make me happy. If an SQLite-like single-file graph database with C linkage came along I’d likely use it for my next project. Particularly if it was designed with asynchronous I/O in mind and thus easy to integrate with vibe.d.

But even if I one day switch to a DBMS, I’ll likely keep the append-only files around too. After a decade of using them, it is hard to imagine not having all the unprocessed data available and human-readable to enable future extensions and one-off inspections. If only the many other off-the-cuff design decisions I made in interest of rapid prototyping had worked out half as well!