Why Putting JSON in a Database Column is a Bad Idea. Or, Why Would You Ever Put JSON in a Database Column?

When I was a beginner at databases, I was tempted to put JSON in a string column. It’s a weirdly common thing for beginners to relational databases to want to do. A lot of tutorials aimed at beginners suggest doing it for some reason. When people show up on Stack Overflow wanting to know why their code isn’t working, and that code happens to be putting JSON in a relational database column, sometimes someone will show up and tell them to quit doing that, but it’s only like a 50/50 shot, whereas pretty much every other questionable beginner practice will attract hordes of smartasses, some of them barely more than beginners themselves, who will strongly and sometimes stridently censure the beginner for contemplating it. I attribute the popularity of Mongo and other NoSQL databases in part to this instinct: Mongo is basically a database with nothing but string columns that have JSON in them, and some search functionality for JSON in those string columns.

If you’re sometimes tempted to put JSON in a string column, I hope I can explain today why it’s a bad idea and you shouldn’t do it, in a way that will make sense. On the other hand, maybe you’re one of those relational database savants who understood third normal form right away, and you don’t understand why anyone would ever want to put JSON in a string column instead of having a real schema. If so, I hope I can explain to you why someone might want to put JSON in a string column.

Strong Static Typing, Relational Data Modeling, and Escape Hatches

In “Is Weak Typing Strong Enough?”, Steve Yegge describes (buried somewhere in the point he was actually making about programming languages) how teams that he worked with at Amazon were too boxed in by the strong statically typed strict schema the relational data models imposed on them, and resorted to tactics like an untyped name / value system and passing an XML parameter as a string to a CORBA interface. (CORBA was a standard for an early style of distributed services that let you treat objects running on remote servers as if they were in the same address space and could call each others’ methods. It was also language agnostic, so you could run Java on one server and APL on another server and they could use their CORBA implementations to call each others methods. At least, that’s what Wikipedia said; CORBA predates my programming experience by some years.)

The point here is that relational database models are a sort of strong static typing. No matter how much you love strong static typing, sometimes it’s not flexible enough, and you need escape hatches. In 2005 when Steve Yegge wrote his piece, it was XML strings through CORBA. On a programming language level, it’s things like downcasting from Object in Java or using the dynamic keyword in C#. Storing JSON in a string column is another one of these escape hatches. What you get is flexibility. Let’s keep this in mind as we go into our big example, which I hope will demonstrate both why you shouldn’t put JSON in string columns and why you might want to sometimes.

Hugs and Warm Fuzzies: An Example

This example is loosely based on a real system I worked on that had JSON in string columns.

Let’s say we’re working on an MMORPG called Crayon Art Online. Unlike most MMORPGs, which are all about killing things, Crayon Art Online is all about love and friendship and positivity.

Every few hours, one of our servers needs to kick off a job that will read a bunch of player actions from the game’s main database and calculate metrics on them, storing them in a separate metrics database. Since the game is all about doing friendly, happy things, the actions will be things like hugs_given, crayon_drawings_gifted, lunches_bought_for_others, pep_talks, and positive_affirmations. There will be some actions that have extra associated metrics, like hug_warmth and affirmation_positivity and pep_talk_peppiness. There will be some conditions where we just check for a Boolean answer, like sculpting_class_taken?, which just has a true or false answer. We need to calculate all these metrics for all the players across different periods of time: the past 30 days, the past 90 days, the past year, and the entire period the player has existed. Then we need to store them in a database so another job that runs later can read them and distribute prizes to the players that reach certain goals, and messages of encouragement and hope to the others.

Let’s go into some aspects of our data that we’ll have to model. We have a ton of different actions–let’s say there are about 25 total. It’s not a fixed set; there will be new metrics if the developers add new actions to the game, and they might also remove actions, so we would lose the metrics associated with those. That means making a separate table for each metric is doable, but a huge pain. Every time we add a new metric, we have to add a new table. Reading all the metrics for a player means querying 25 tables. A simple manual check for data consistency between the main game’s database and ours requires us to write a select command that includes 25 tables. It seems there must be an easier way.

However, one table also clearly won’t work unless we store all the values as strings. We need a value column to be able to store integers (hugs_given), floats (average_positivity), Booleans (is_wearing_floppy_hat?), and strings (floppy_hat_color). So another option would be one table per value type: int_valued_metrics, real_valued_metrics, boolean_valued_metrics, and string_valued_metrics. But that’s pretty tacky. Your data model is supposed to model the data, like it says in the name. The data itself doesn’t naturally divide into these different types; they’re all just metrics, with the same fundamental features: a name, a time period, a user, and a value. Our database engine needing to store the value as an integer or float or whatever is an implementation detail, not an actual part of the domain, and it feels unnatural to spread the data model for the single metric concept across four tables (and counting, if you add types later—maybe at some point you have tuple-valued metrics, or BLOB-valued). This is better than 25 tables, and it’s better than one table, but it’s still not good.

At this point, things look pretty dire. Relational data modeling has failed us. But wait, you realize. Every programming language nowadays, well except for Java anyway, has a powerful JSON parsing engine built in that can turn a string of JSON into a native map, with type conversions and everything! So, why don’t we just store metrics as a string of JSON per user?

Strings of JSON have lots of attractive qualities. They’re easy to understand; JSON is dead simple, by construction, whereas SQL and relational data models are complicated and hard to understand. Strings of JSON are easy to parse into whatever kind of map or hash thingy your language supports (unless you’re using Java or Scala), and once you’ve done that, you know exactly what you can do with them. Reading from a relational database usually requires some annoying library that binds the data to native data types, or else you can make raw SQL calls, get handed back the most useless, primitive thing your language supports, and do the conversion yourself. JSON is dynamically typed, so you don’t have to worry about finding the correct data type; just store the string, and when you read it back the parser will figure out what types everything should be. JSON is extremely flexible, letting you store anything from a simple glob of key-value pairs to a bramble of arbitrarily nested lists and maps, whereas with relational data models you’re stuck with two dimensional tables.

So, let’s just forget all this relational modeling stuff and make a table that maps a user id to a string of JSON that looks like this:

{
    “hugs_given”: {“30_days”: 5, “90_days”: 10, “1_year”: 85, “all_time”: 345},
    “has_taken_respect_seminar?”: true,
    // ...
}

Now it’s easy to look up a user’s stats. Just select the_json_string from metrics where user_id = 394. Now you have a JSON string with all the metrics. Now, whenever you add a new metric, you just have your code calculate it and stick it in the JSON object before writing it back as a string. No schema changes, no data model cabal to get through, just code sticking things in JSON. If you decide integer-valued stats should also support a count for the last 42 days, just calculate it and stick it in everyone’s JSON.

But here’s a problem: suppose you want to find all the users with more than 26 hugs given in the last 30 days. How do you query this data? SQL can’t read JSON, so you can’t do select * from metrics where the_json_string.hugs_given.30_days > 26. That JSON type is opaque to the relational database. (Unless you’re using PostgreSQL. We’ll pretend you’re not. The real project I worked on used MySQL, but I believe Oracle and MS SQL Server also lack a JSON type.) So how do you write this query? You can write your own terrible ad hoc JSON parser using regexes and LIKE right there in your database console, or you can write a script in some language like Python or Ruby that reads in the data, parses the JSON string, and does the check for you. Writing this script can also get surprisingly tricky. If the entire table doesn’t fit in memory, you can’t just read all the rows and loop over them. The simple approach would be to get the maximum id of the table and then select each row by id, skipping over misses, until you’re done. But this incurs a disk read on each iteration of the loop, plus possibly a network call if you’re not allowed to run the script right on the database server, so it’ll be agonizingly slow. So you’ll probably read in batches. Now you have to mess with the logic around batch sizes and offsets. It probably won’t take more than an hour or so to write this script. Still, a simple select is something you can dash off about as fast as you can type it in. And even if you write the script generically enough that you now have a general purpose script for querying based on values in the JSON string, you still don’t have all the tools available in one place. If you’re working in the database console and suddenly realize you need to query some data in the JSON string to proceed, you have to stop, go to another window, and run the script. There’s extra friction if you need to feed that data into an SQL query at the database console to move on. There’s this gaping hole where you can’t use SQL to find data that’s stored in your SQL database.

Putting data in strings also kills the validation benefits that a statically typed data model gives you. If someone accidentally writes {"hugs_given": "fish"} to the database, the database doesn’t know that that’s not how you measure hugs. If someone accidentally writes {"hugs_given": {"30_day": 5, "30_day": 10} to the database, the database doesn’t know or care that you repeated a key.

How to Proceed?

So now we know both the strengths and weaknesses of storing JSON in string columns. It’s flexible, easy to change, easy to write, easy to understand, easy to work with in code. On the other hand, it’s meaningless to the SQL database (unless you’re fortunate enough to be working in Postgres). It’s just this string of characters that might as well be the complete text of the classic novel Lady Chatterley’s Lover by DH Lawrence. You can’t query it from SQL in any meaningful way. You can’t put indexes on it. (Well, you could put an index, but it would be massive and not very helpful.) You lose the data validation benefits that a relational schema gives you. What do we do?

It’s possible that you feel you can live with the limitations of JSON strings in database columns. Maybe you can. I have a feeling they’ll eventually get in your way, though. They certainly got in mine.

If you have the option of switching data stores, you could just use Postgres, or some JSON-oriented document store like MongoDB, although I have reservations about that in general. In Crayon Art Online, as in the real application that inspired it, metrics are calculated by a separate process and stored in a different database than the main application, so even if your main application is on MySQL, it might be doable to switch your metrics application to Postgres. (It wasn’t for me, for a variety of boring non-technical reasons.)

But suppose you’re convinced that storing an opaque string of JSON won’t work for you, and you don’t have the option of switching data stores. You’re left with trying to find a better data model to fit this into a relational schema. In this particular case, where the nesting isn’t that deep, it’s not actually that hard to come up with an acceptable solution that’s better than JSON string in a column.

To start with, I’d represent each individual calculated metric as a row with a user id, type tag (hugs_given, positivity_quotient, etc.), a date range tag (30_day, 90_day, etc.), a value, and a created_at date to record when we calculated it.

In my experience, rolling date ranges are a lot simpler than actual dates, so even though it might be tempting to get rid of the string keys for 30_day, 60_day etc. and come up with some way of storing the date range based on actual dates, it could also complicate things a lot. If you need to know the actual date range this covers, you can count backwards from the created_at. If you expect to need this a lot, it might make sense to store the date keys as integers that represent the number of days covered by this metric, so we’d have a column days_covered, and then we can find the start date of this period by subtracting that from the created_at.

The main hurdle with representing this in a schema is the fact that the value type can be an integer, real number, Boolean, string, or possibly other things. Making separate tables for each value type is tacky in my opinion, but you could go this route if you want. Then you could use some kind of wrapper at the data access layer to hide the four classes from your code and just present one consistent interface for them. An even more tacky variant would be to make your value column store a foreign key into another table, and make four value tables that hold your values. So you’d have the five tables metric, int_value, boolean_value, real_value, and string_value, and metric would have a type column that tells it which table to look for its values in, and a value_id column that points into one of the value tables. Not only does this have the same disadvantages as the four metrics tables, it’s also weird and complicated to understand. Another option: store the value, and only the value, as a string, and implement some kind of conversion to the correct type at the data access layer. You can still run queries on the numeric values by casting the string to a number, so you’re not as out of luck as you were with opaque JSON strings. None of these are great solutions. This was the original hurdle that led us to putting JSON in a text column, and I don’t know of any really good way out of it. But I do think there are better ways than JSON in text columns.

This is not a perfect, beautiful schema. It’s not a work of art. It’s not going to get written up in textbooks or research papers as amazing. But it is a real relational database schema which will work better for you in the long run than shoving a string of JSON into a column. (Or a string of CSV, or XML, or YAML, or any other kind of structured text format that isn’t supported natively by the database type system.) You can do queries on it, you can analyze it, you can put indexes on it, you can implement certain data consistency standards on it (e.g. a unique key on type, days_covered, and created_at that prevents you from accidentally inserting the same start for the same date range more than once). There are also opportunities to make things better as you discover problems. With text in a column, there’s not a lot else you can do with it when you discover problems.

This schema is also more flexible and manageable than separate tables for each metric. If you need to add a new stat, just start calculating it and add a new tag for it. If you want to support a new period of time, just make a tag for it and start storing it. No need for schema changes.

The Takeaway

If you came into this as an expert who loves relational databases and strongly typed schemas, I hope you can understand a little more the kind of problems that seem too hard to fit into the strictures of such a schema, driving beginners to want to put JSON in string columns.

If you came into this as someone who’s occasionally—or more than occasionally—thought the answer to all your problems was shoving JSON in a text column, I hope you now understand why that’s not really the answer to your problems. Or you learned that you need to switch to another data store that supports shoving JSON into text columns better, and now every company you work at, you’ll lobby to switch to Postgres or Mongo. Someday that might even lead you to a job as a MongoDB ambassador who gets to travel around to conferences telling everyone why shoving JSON in text columns is terrible, and they should instead shove JSON in MongoDB to solve all their problems. Just think; I could be starting careers here today.

Advertisements