I recently changed operating systems on several of my computers and because I had not set two of them up to handle OS changes smoothly¹¹ i.e. I’d put / and /home on the same logical volume, meaning my data and the OS’s data were all mixed up. that meant copying my user-created files to the new OS after installation²² I didn’t have to copy it away first because I make regular backups. When doing that, I was struck by how many of the files I’d created were dramatically larger than the information I’d put into them.

To understand this point, let’s consider three types of file creation programs. Most respond to keys, buttons, and menu items. Some illustration programs also capture pointer stroke data. And recording programs capture microphone and camera data.

Recordings

Recording data, especially video, is mind-boggling large. Your microphone probably captures 16-bit sound pressure @ 44.1KHz for 88.2 KiB per second, meaning 2.4 GiB per 8-hour work day. Your webcam probably captures somewhere between 720p @ 30Hz for 79 MiB per second and 1080p @ 60Hz for 356 MiB per second, meaning between 2.2 and 9.8 TiB per 8-hour work day.

My recording apps store much less data than that in the files they produce. They apply what is technically called lossy compression meaning they analyze the data produced, reorganize it in a way that stores the broad outlines separately from the fiddly details, and discard some of those details that is unlikely to interest me. Video, audio, and image compression is an ongoing research area, but the fairly mainstream algorithms my tools use compress audio 5–10 fold and video over 1000-fold.

Illustrations

Pointer stroke data is much more reasonable. My art tablets³³ I have several different Wacom tablets; the Intuos Pro produces the most data produce 6-byte packets of information @ 240Hz for 1.44 KiB per second, or 41 MiB per 8-hour work day.

In principle, illustration apps could apply two different kinds of compression to reduce file sizes. First, the input is being produced by a physical hand which is subject to inertia, meaning there is a very strong correlation between sequential packets. By encoding how much the new signal differs from what inertia would have predicted I’ve achieved a 5-fold compression without discarding any information – i.e. a lossless compression – and I’m no expert in this space; I’d not be surprised if a 10–20-fold compression was possible. Additionally, much of the data doesn’t matter: I often move the pointer between strokes and the fiddly details represent limitations in my motor control and can be discarded; together I’d guess these enable another 2–5-fold lossy compression.

All that said, the files produced by my illustration software tends to be around the same size that a raw uncompressed data recording would be. The software does some compression, more importantly they are storing what picture the stylus motions created rather than the stylus data itself.

Most apps

Most applications I use respond to keystrokes and to button and menu item clicks. A skilled typist can press around 500 intentional keystrokes a second on a keyboard with around 2⁶ keys for around 400 B / second or 11 MiB per 8-hour work day. Most of us operate at well under half that speed. But even a slow typist far outstrips a mouse user: the fastest mouse user I’ve seen clicks fewer than 10 things per second with much less than 2⁶ options visible at a time. Apps that use clicks to position things in space could, for a fast mouse user, approach the data production of a keyboard.

Once again, this can be usefully compressed. Lossy compression can recognize things like the sequence e backspace being the same as nothing at all; how much that saves will depend on the user, but probably isn’t that large. Lossless compression can take advantage of the predictability of text to get roughly a 5-fold compression⁴⁴ I’m extrapolating from some tests pioneered by Claude Shannon where you show someone some text cut off in the middle and ask them to guess what the next keystroke is. The more accurate the guesses, the less information (or, in Shannon’s term, entropy) is contained in each keystroke and the more it can be compressed. He did this with English, but I’ve had students do this with various languages as assignments in class and 1 bit per keystroke is a pretty common result..

Combing these observations, I’d expect around 1 MiB per day of actual information produced. But in practice these apps put much more data than that in the files the produce. Highly efficient apps like the one I’m using to type this blog post store roughly 6–10× the data I produce, doing the lossy compression but storing what’s left in a slightly-padded uncompressed way. But many other content creation tools I use – Blender, LibreOffice, GraphViz, and so on – store several fold more data than that. Like my art programs, they are computing what my keystrokes and clicks mean and storing the result in the file, but without first compressing the data. In the case of office tools like word processors and spreadsheets this results in so much data that the files are compressed as part of their creation with general-purpose lossless compression algorithms and still end up several times larger than the raw uncompressed event stream data would have been.

Derivatives and Cruft

I write a lot of code, and compile the code and run the compiled versions. Those compiled versions are an example of a derivative file, one I can deterministically re-create from the source file. Other example derivatives on my compupters include sheet music and audio files created from Lilypond sources, HTML generated from Markdown, PDF created from HTML and ODT, JSON created from YAML and GEDCOM, PNG created from SVG, and so on. The size of derivatives varies widely: I have source code that compiled to half its size and source code that compiled to 40 times its size. That said, whether small or large the derivatives are unnecessary to back up if the source is backed up.

Often there are files created along the way when interacting with an app or generating derivatives that are of only short-term interest. They may facilitate recovery if the app crashes or store the results intermediate operations to make future operations faster. Some of these temporary derivatives are discarded automatically, but many linger long after they have obvious use. The colloquial computing jargon for such left-over files is cruft, and a lot of it accumulates. For example, I’ve been using my work desktop computer for about 10 months and in that time the cruft I know how to search for comes to almost 200,000 files totaling around 28 GiB or roughly one quarter of all data on my hard drive.

I wish my computer had a consistent way of distinguishing between source files containing the app-processed version of what I created; product files containing usable file derivatives; and intermediate files containing derivatives of short-term interest. But to do that would require consistent behavior from all file-creating applications, which in turn would require that every application I use and every application they use and every shared library each of those use all adopt this new behavior; on my work desktop computer that comes to some 70,000 programs or libraries⁵⁵ I don’t actually use all of these: some came with the OS but I’ve never opened them, though I don’t have good tracking of which ones. There are also many that don’t create any files, though exactly how many I don’t know.. I doubt it will ever happen, but I still wish…