Why Putting JSON in a Database Column is a Bad Idea. Or, Why Would You Ever Put JSON in a Database Column?

When I was a beginner at databases, I was tempted to put JSON in a string column. It’s a weirdly common thing for beginners to relational databases to want to do. A lot of tutorials aimed at beginners suggest doing it for some reason. When people show up on Stack Overflow wanting to know why their code isn’t working, and that code happens to be putting JSON in a relational database column, sometimes someone will show up and tell them to quit doing that, but it’s only like a 50/50 shot, whereas pretty much every other questionable beginner practice will attract hordes of smartasses, some of them barely more than beginners themselves, who will strongly and sometimes stridently censure the beginner for contemplating it. I attribute the popularity of Mongo and other NoSQL databases in part to this instinct: Mongo is basically a database with nothing but string columns that have JSON in them, and some search functionality for JSON in those string columns.

If you’re sometimes tempted to put JSON in a string column, I hope I can explain today why it’s a bad idea and you shouldn’t do it, in a way that will make sense. On the other hand, maybe you’re one of those relational database savants who understood third normal form right away, and you don’t understand why anyone would ever want to put JSON in a string column instead of having a real schema. If so, I hope I can explain to you why someone might want to put JSON in a string column.

Strong Static Typing, Relational Data Modeling, and Escape Hatches

In “Is Weak Typing Strong Enough?”, Steve Yegge describes (buried somewhere in the point he was actually making about programming languages) how teams that he worked with at Amazon were too boxed in by the strong statically typed strict schema the relational data models imposed on them, and resorted to tactics like an untyped name / value system and passing an XML parameter as a string to a CORBA interface. (CORBA was a standard for an early style of distributed services that let you treat objects running on remote servers as if they were in the same address space and could call each others’ methods. It was also language agnostic, so you could run Java on one server and APL on another server and they could use their CORBA implementations to call each others methods. At least, that’s what Wikipedia said; CORBA predates my programming experience by some years.)

The point here is that relational database models are a sort of strong static typing. No matter how much you love strong static typing, sometimes it’s not flexible enough, and you need escape hatches. In 2005 when Steve Yegge wrote his piece, it was XML strings through CORBA. On a programming language level, it’s things like downcasting from Object in Java or using the dynamic keyword in C#. Storing JSON in a string column is another one of these escape hatches. What you get is flexibility. Let’s keep this in mind as we go into our big example, which I hope will demonstrate both why you shouldn’t put JSON in string columns and why you might want to sometimes.

Hugs and Warm Fuzzies: An Example

This example is loosely based on a real system I worked on that had JSON in string columns.

Let’s say we’re working on an MMORPG called Crayon Art Online. Unlike most MMORPGs, which are all about killing things, Crayon Art Online is all about love and friendship and positivity.

Every few hours, one of our servers needs to kick off a job that will read a bunch of player actions from the game’s main database and calculate metrics on them, storing them in a separate metrics database. Since the game is all about doing friendly, happy things, the actions will be things like hugs_given, crayon_drawings_gifted, lunches_bought_for_others, pep_talks, and positive_affirmations. There will be some actions that have extra associated metrics, like hug_warmth and affirmation_positivity and pep_talk_peppiness. There will be some conditions where we just check for a Boolean answer, like sculpting_class_taken?, which just has a true or false answer. We need to calculate all these metrics for all the players across different periods of time: the past 30 days, the past 90 days, the past year, and the entire period the player has existed. Then we need to store them in a database so another job that runs later can read them and distribute prizes to the players that reach certain goals, and messages of encouragement and hope to the others.

Let’s go into some aspects of our data that we’ll have to model. We have a ton of different actions–let’s say there are about 25 total. It’s not a fixed set; there will be new metrics if the developers add new actions to the game, and they might also remove actions, so we would lose the metrics associated with those. That means making a separate table for each metric is doable, but a huge pain. Every time we add a new metric, we have to add a new table. Reading all the metrics for a player means querying 25 tables. A simple manual check for data consistency between the main game’s database and ours requires us to write a select command that includes 25 tables. It seems there must be an easier way.

However, one table also clearly won’t work unless we store all the values as strings. We need a value column to be able to store integers (hugs_given), floats (average_positivity), Booleans (is_wearing_floppy_hat?), and strings (floppy_hat_color). So another option would be one table per value type: int_valued_metrics, real_valued_metrics, boolean_valued_metrics, and string_valued_metrics. But that’s pretty tacky. Your data model is supposed to model the data, like it says in the name. The data itself doesn’t naturally divide into these different types; they’re all just metrics, with the same fundamental features: a name, a time period, a user, and a value. Our database engine needing to store the value as an integer or float or whatever is an implementation detail, not an actual part of the domain, and it feels unnatural to spread the data model for the single metric concept across four tables (and counting, if you add types later—maybe at some point you have tuple-valued metrics, or BLOB-valued). This is better than 25 tables, and it’s better than one table, but it’s still not good.

At this point, things look pretty dire. Relational data modeling has failed us. But wait, you realize. Every programming language nowadays, well except for Java anyway, has a powerful JSON parsing engine built in that can turn a string of JSON into a native map, with type conversions and everything! So, why don’t we just store metrics as a string of JSON per user?

Strings of JSON have lots of attractive qualities. They’re easy to understand; JSON is dead simple, by construction, whereas SQL and relational data models are complicated and hard to understand. Strings of JSON are easy to parse into whatever kind of map or hash thingy your language supports (unless you’re using Java or Scala), and once you’ve done that, you know exactly what you can do with them. Reading from a relational database usually requires some annoying library that binds the data to native data types, or else you can make raw SQL calls, get handed back the most useless, primitive thing your language supports, and do the conversion yourself. JSON is dynamically typed, so you don’t have to worry about finding the correct data type; just store the string, and when you read it back the parser will figure out what types everything should be. JSON is extremely flexible, letting you store anything from a simple glob of key-value pairs to a bramble of arbitrarily nested lists and maps, whereas with relational data models you’re stuck with two dimensional tables.

So, let’s just forget all this relational modeling stuff and make a table that maps a user id to a string of JSON that looks like this:

{
    “hugs_given”: {“30_days”: 5, “90_days”: 10, “1_year”: 85, “all_time”: 345},
    “has_taken_respect_seminar?”: true,
    // ...
}

Now it’s easy to look up a user’s stats. Just select the_json_string from metrics where user_id = 394. Now you have a JSON string with all the metrics. Now, whenever you add a new metric, you just have your code calculate it and stick it in the JSON object before writing it back as a string. No schema changes, no data model cabal to get through, just code sticking things in JSON. If you decide integer-valued stats should also support a count for the last 42 days, just calculate it and stick it in everyone’s JSON.

But here’s a problem: suppose you want to find all the users with more than 26 hugs given in the last 30 days. How do you query this data? SQL can’t read JSON, so you can’t do select * from metrics where the_json_string.hugs_given.30_days > 26. That JSON type is opaque to the relational database. (Unless you’re using PostgreSQL. We’ll pretend you’re not. The real project I worked on used MySQL, but I believe Oracle and MS SQL Server also lack a JSON type.) So how do you write this query? You can write your own terrible ad hoc JSON parser using regexes and LIKE right there in your database console, or you can write a script in some language like Python or Ruby that reads in the data, parses the JSON string, and does the check for you. Writing this script can also get surprisingly tricky. If the entire table doesn’t fit in memory, you can’t just read all the rows and loop over them. The simple approach would be to get the maximum id of the table and then select each row by id, skipping over misses, until you’re done. But this incurs a disk read on each iteration of the loop, plus possibly a network call if you’re not allowed to run the script right on the database server, so it’ll be agonizingly slow. So you’ll probably read in batches. Now you have to mess with the logic around batch sizes and offsets. It probably won’t take more than an hour or so to write this script. Still, a simple select is something you can dash off about as fast as you can type it in. And even if you write the script generically enough that you now have a general purpose script for querying based on values in the JSON string, you still don’t have all the tools available in one place. If you’re working in the database console and suddenly realize you need to query some data in the JSON string to proceed, you have to stop, go to another window, and run the script. There’s extra friction if you need to feed that data into an SQL query at the database console to move on. There’s this gaping hole where you can’t use SQL to find data that’s stored in your SQL database.

Putting data in strings also kills the validation benefits that a statically typed data model gives you. If someone accidentally writes {"hugs_given": "fish"} to the database, the database doesn’t know that that’s not how you measure hugs. If someone accidentally writes {"hugs_given": {"30_day": 5, "30_day": 10} to the database, the database doesn’t know or care that you repeated a key.

How to Proceed?

So now we know both the strengths and weaknesses of storing JSON in string columns. It’s flexible, easy to change, easy to write, easy to understand, easy to work with in code. On the other hand, it’s meaningless to the SQL database (unless you’re fortunate enough to be working in Postgres). It’s just this string of characters that might as well be the complete text of the classic novel Lady Chatterley’s Lover by DH Lawrence. You can’t query it from SQL in any meaningful way. You can’t put indexes on it. (Well, you could put an index, but it would be massive and not very helpful.) You lose the data validation benefits that a relational schema gives you. What do we do?

It’s possible that you feel you can live with the limitations of JSON strings in database columns. Maybe you can. I have a feeling they’ll eventually get in your way, though. They certainly got in mine.

If you have the option of switching data stores, you could just use Postgres, or some JSON-oriented document store like MongoDB, although I have reservations about that in general. In Crayon Art Online, as in the real application that inspired it, metrics are calculated by a separate process and stored in a different database than the main application, so even if your main application is on MySQL, it might be doable to switch your metrics application to Postgres. (It wasn’t for me, for a variety of boring non-technical reasons.)

But suppose you’re convinced that storing an opaque string of JSON won’t work for you, and you don’t have the option of switching data stores. You’re left with trying to find a better data model to fit this into a relational schema. In this particular case, where the nesting isn’t that deep, it’s not actually that hard to come up with an acceptable solution that’s better than JSON string in a column.

To start with, I’d represent each individual calculated metric as a row with a user id, type tag (hugs_given, positivity_quotient, etc.), a date range tag (30_day, 90_day, etc.), a value, and a created_at date to record when we calculated it.

In my experience, rolling date ranges are a lot simpler than actual dates, so even though it might be tempting to get rid of the string keys for 30_day, 60_day etc. and come up with some way of storing the date range based on actual dates, it could also complicate things a lot. If you need to know the actual date range this covers, you can count backwards from the created_at. If you expect to need this a lot, it might make sense to store the date keys as integers that represent the number of days covered by this metric, so we’d have a column days_covered, and then we can find the start date of this period by subtracting that from the created_at.

The main hurdle with representing this in a schema is the fact that the value type can be an integer, real number, Boolean, string, or possibly other things. Making separate tables for each value type is tacky in my opinion, but you could go this route if you want. Then you could use some kind of wrapper at the data access layer to hide the four classes from your code and just present one consistent interface for them. An even more tacky variant would be to make your value column store a foreign key into another table, and make four value tables that hold your values. So you’d have the five tables metric, int_value, boolean_value, real_value, and string_value, and metric would have a type column that tells it which table to look for its values in, and a value_id column that points into one of the value tables. Not only does this have the same disadvantages as the four metrics tables, it’s also weird and complicated to understand. Another option: store the value, and only the value, as a string, and implement some kind of conversion to the correct type at the data access layer. You can still run queries on the numeric values by casting the string to a number, so you’re not as out of luck as you were with opaque JSON strings. None of these are great solutions. This was the original hurdle that led us to putting JSON in a text column, and I don’t know of any really good way out of it. But I do think there are better ways than JSON in text columns.

This is not a perfect, beautiful schema. It’s not a work of art. It’s not going to get written up in textbooks or research papers as amazing. But it is a real relational database schema which will work better for you in the long run than shoving a string of JSON into a column. (Or a string of CSV, or XML, or YAML, or any other kind of structured text format that isn’t supported natively by the database type system.) You can do queries on it, you can analyze it, you can put indexes on it, you can implement certain data consistency standards on it (e.g. a unique key on type, days_covered, and created_at that prevents you from accidentally inserting the same start for the same date range more than once). There are also opportunities to make things better as you discover problems. With text in a column, there’s not a lot else you can do with it when you discover problems.

This schema is also more flexible and manageable than separate tables for each metric. If you need to add a new stat, just start calculating it and add a new tag for it. If you want to support a new period of time, just make a tag for it and start storing it. No need for schema changes.

The Takeaway

If you came into this as an expert who loves relational databases and strongly typed schemas, I hope you can understand a little more the kind of problems that seem too hard to fit into the strictures of such a schema, driving beginners to want to put JSON in string columns.

If you came into this as someone who’s occasionally—or more than occasionally—thought the answer to all your problems was shoving JSON in a text column, I hope you now understand why that’s not really the answer to your problems. Or you learned that you need to switch to another data store that supports shoving JSON into text columns better, and now every company you work at, you’ll lobby to switch to Postgres or Mongo. Someday that might even lead you to a job as a MongoDB ambassador who gets to travel around to conferences telling everyone why shoving JSON in text columns is terrible, and they should instead shove JSON in MongoDB to solve all their problems. Just think; I could be starting careers here today.

Advertisements

JSON in Java: A Case Study in Pain

Dealing with JSON in most dynamically typed languages is dead simple: you have a function that can inflate a string into the language’s native hash map type, and you have another function that can serialize the language’s native hash map type into a string of JSON. These languages all support nested maps, some sort of list, and strings, numbers, and booleans well enough to automatically map from JSON’s simple data types into the language’s native types.

Dealing with JSON in the emphatically statically typed, the holdover that’s still hanging around like a bad hangover, the one, the only Java will eventually make you wish you were dead.

I’ll be up front with you: this isn’t going to be a deep or insightful read. It’s mostly going to be a rant, with lots of hate for Java and static typing, but mostly static typing as seen in Java, because I’ve never tried to work with JSON in languages like Haskell with more sophisticated type systems, so you Haskell lovers are free to stay smug in the knowledge that you work with a better class of type system, nay, not a class in sight, but rather a category. If you love Java and static typing, you might want to tune out, because I am not at all interested in hearing for the hundredth time how dynamic languages force you to write unit tests for the stuff that static languages check at compile time. This is also not meant as a top-to-bottom indictment of static typing; it’s an examination of how static typing utterly fails to solve one specific problem in an elegant way.

To start with, I’m going to present the problem. Then I’ll examine what I find to be the root causes of this problem, and then rant a little about some features Java will never get that I think could help this problem.

The Problem

There are two major things you need to be able to do with JSON in your programming language: turn a string of it into your language’s native data types so you can work with the data, and turn your language’s native data types into a string of it so you can send it across a network.

JSON—Javascript Object Notation—is based on a restriction of Javascript’s native data types. Looking at the standard, we see that it supports objects, arrays, strings, numbers, booleans, and null. On the face of it, it seems like we should be able to represent these types with native Java types: maps for objects, lists for arrays, and Java’s string, number, boolean, and null for those types. But maps and lists in Java have to be restricted to a single type, so to represent JSON values like {"status": true, "message": "Status normal"} or [23, 'hello', false], we have to use a Map<String, Object> and a List<Object>.

Unfortunately, this almost immediately leads to problems. Java lists and maps don’t support being stringified as valid JSON, so you have to write custom code to do it. Actually writing this code isn’t too hard as long as you remember that JSON supports arbitrarily nested lists and objects, so you have to make sure your serialization is applied recursively until you bottom out. But it has all the usual disadvantages that come when you write custom code for fundamental operations instead of getting it from a library: you have to test it, support it, and maintain it yourself, and it’s harder to bring into new projects since you can’t just let your build tool grab it from Maven Central.

Working with Objects in Java is rather painful; if you want to do anything with them, you have to cast them. This requires you, the programmer, to know what type every one of those objects actually is. Look at this rather ugly bit of code:

Map<String, Object> jsonMap = getJsonMap();
List<Object> jsonList = jsonMap.get("theList");
if ((Boolean)jsonList.get(2)) {
    Integer idNumber = (Integer)jsonList.get(0);
    User user = getUserById(idNumber);
    user.setPreferredGreeting((String)jsonList.get(1));
    user.save();
}

Not only is this code ugly, it also loses the much-vaunted type safety of Java. You could get runtime type errors if it turns out that the first item of jsonList is not actually an Integer or the second item is not actually a String.

We can solve the ugliness problem a bit by throwing out the built-in Java map and list and using a special set of classes to represent JSON data types. This is what org.json or Jackson and GSON’s tree structure use. Instead of a Map, we’ll use a JsonObject; instead of a List, we’ll use a JsonArray; and instead of using Java strings, numbers, and booleans, we’ll use a JsonPrimitive. We’ll make all of these classes implement the JsonType interface.

The JsonObject class will behave like a Map<String, JsonType>. The JsonArray will behave like a List<JsonType>. The JsonPrimitive will store its value as a string, or possibly an Object. It will have a set of three methods: getAsNumber, getAsString, getAsBoolean, and possibly corresponding isNumber, isString, isBoolean. We glossed over the extra complication that the Javascript/JSON number type can be either integral or floating point; we also glossed over the extra complication that JSON numbers don’t have a specified size, so what we get could potentially overflow an Integer or Double. We could handle those problems here either by making getAsNumber return a BigDecimal or making separate getAsInteger, getAsDouble, getAsBigInteger, getAsBigDecimal methods. Or we could decide that overflow will be rare enough that we’ll tolerate errors when it happens, and make getAsNumber return a Double.

With this new set of classes, we can rewrite the above example:

JsonObject root = getJsonMap();
JsonArray array = root.get("theList");
if (array.get(2).getAsBoolean()) {
    User user = getUserById(array.get(0).getAsInteger());
    user.setPreferredGreeting(array.get(1).getAsString());
    user.save();
}

It’s at least less ugly without all those casts and temporary variables, and since the types make it obvious that we’re dealing with JSON, we don’t have to name everything jsonWhatever. But we’ve still lost type safety; if we do array.get(2).getAsBoolean() and that item isn’t a boolean, we’ll still get a runtime type error. And these custom JSON types are inevitably less convenient to work with than the standard library data types. What options do we have if we want to iterate over a JsonArray? Does it implement Iterable<T>, which will let us use the “smart for-loop”, or are we stuck with the C-style for-loop? Can we get the entry set of a JsonObject and iterate over all its keys and values if we want to? Do these classes implement Collection<T>, making them streamable?

This solution with separate JsonWhatever classes is the quick and dirty way of solving the JSON problem in Java. Imagine that your car breaks down in the middle of the desert. You have a tarp, a roll of duct tape, and a long pole in the back, so you tape the tarp to the pole, tape the pole to your car, and you’ve got a sail. The JsonWhatever classes are the homemade car sail of JSON handling. It does nothing whatsoever to fix the fundamental problem, but it does kind of get you moving again, in a way that’s slightly better than walking or pushing the car.

Most high-powered JSON libraries in Java go with a more sophisticated approach, borrowed from ORMs such as Hibernate. The Java language really really wants you to specify everything precisely, categorically, and at a low level: this class has these exact five fields with this five exact names, and this one’s an Integer, and this one’s a String, and this one’s a NosehairInspectorFactory, and so on. So why not create a class that specifies what your JSON will look like, precisely, categorically, and at a low level?

This is what Jackson and GSON do. They include a class, called ObjectMapper in Jackson, that can take a normal Java object and turn it into a string of JSON, or take a string of JSON and a normal Java class and create an object of that class from the string. Here’s an example:

class MyObject {
    private Integer myId;
    private String myPreferredGreeting;
    private Boolean isHasBeenGreeted; // Note, this is what you get when you insist on starting all boolean names with "is".
    private MyOtherObject myOtherObject;
    private List<Integer> allMyBuddiesIds;

    // All that getter and setter nonsense that Java insists on...
}

Jackson can map between this JSON and an object of that class:

{
    "myId": 23,
    "myPreferredGreeting": "Whassup dorks?",
    "isHasBeenGreeted": false,
    "myOtherObject": {
        "myFavoriteDrink": "Rolling Rock",
        "myFavoriteFood": "Cheese pizza with the crust cut off",
        "myLifePhilosophy": "YOLO bros"
    },
    allMyBuddiesIds: [21, 33, 49]
}

You might be wondering exactly how it does that. The answer is magic. As previously mentioned, I hate magic. But Jackson is magic from top to bottom. Except when it’s not.

The basic idea, to give a little more detail, is that Jackson can parse a JSON string into some intermediate form (which may or may not be the tree model it also offers, its version of our JsonWhatever types above; I haven’t looked into Jackson’s source code far enough to know). From there it can use reflection, possibly assisted by annotations you put on your class, to look for a way to map from the JSON object to your class. If you’re familiar with dependency injection, it’s pretty much the same idea: Jackson looks for a constructor or some setter methods it can use to inject the values from the JSON into an object of the class you provide. Then we have a regular Java object, referred to in this context as a POJO (“Plain old Java object”) since it’s not a special JSON-whatever object.

This sounds amazing. And it is pretty amazing, for JSON with all primitive types and no nesting. Unfortunately, Jackson doesn’t play well with generic types, so you end up having to use raw types for maps and lists, bringing us back to our original problem where we have to know what types are stored in maps and lists and cast them when we extract them.

The other problem with this is that some APIs return JSON responses that look like this:

{
  "statuses": [
    {
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Mon Sep 24 03:35:21 +0000 2012",
      "id_str": "250075927172759552",
      "entities": {
        "urls": [

        ],
        "hashtags": [
          {
            "text": "freebandnames",
            "indices": [
              20,
              34
            ]
          }
        ],
        "user_mentions": [

        ]
      },
      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "Aggressive Ponytail #freebandnames",
      "metadata": {
        "iso_language_code": "en",
        "result_type": "recent"
      },
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 250075927172759552,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_sidebar_border_color": "C0DEED",
        "profile_background_tile": false,
        "name": "Sean Cummings",
        "profile_image_url": "http://a0.twimg.com/profile_images/2359746665/1v6zfgqo8g0d3mk7ii5s_normal.jpeg",
        "created_at": "Mon Apr 26 06:01:55 +0000 2010",
        "location": "LA, CA",
        "follow_request_sent": null,
        "profile_link_color": "0084B4",
        "is_translator": false,
        "id_str": "137238150",
        "entities": {
          "url": {
            "urls": [
              {
                "expanded_url": null,
                "url": "",
                "indices": [
                  0,
                  0
                ]
              }
            ]
          },
          "description": {
            "urls": [

            ]
          }
        },
        "default_profile": true,
        "contributors_enabled": false,
        "favourites_count": 0,
        "url": null,
        "profile_image_url_https": "https://si0.twimg.com/profile_images/2359746665/1v6zfgqo8g0d3mk7ii5s_normal.jpeg",
        "utc_offset": -28800,
        "id": 137238150,
        "profile_use_background_image": true,
        "listed_count": 2,
        "profile_text_color": "333333",
        "lang": "en",
        "followers_count": 70,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme1/bg.png",
        "profile_background_color": "C0DEED",
        "verified": false,
        "geo_enabled": true,
        "time_zone": "Pacific Time (US & Canada)",
        "description": "Born 330 Live 310",
        "default_profile_image": false,
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png",
        "statuses_count": 579,
        "friends_count": 110,
        "following": null,
        "show_all_inline_media": false,
        "screen_name": "sean_cummings"
      },
      "in_reply_to_screen_name": null,
      "source": "Twitter for Mac",
      "in_reply_to_status_id": null
    },
    {
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Fri Sep 21 23:40:54 +0000 2012",
      "id_str": "249292149810667520",
      "entities": {
        "urls": [

        ],
        "hashtags": [
          {
            "text": "FreeBandNames",
            "indices": [
              20,
              34
            ]
          }
        ],
        "user_mentions": [

        ]
      },
      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "Thee Namaste Nerdz. #FreeBandNames",
      "metadata": {
        "iso_language_code": "pl",
        "result_type": "recent"
      },
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 249292149810667520,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "DDFFCC",
        "profile_sidebar_border_color": "BDDCAD",
        "profile_background_tile": true,
        "name": "Chaz Martenstein",
        "profile_image_url": "http://a0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
        "created_at": "Tue Apr 07 19:05:07 +0000 2009",
        "location": "Durham, NC",
        "follow_request_sent": null,
        "profile_link_color": "0084B4",
        "is_translator": false,
        "id_str": "29516238",
        "entities": {
          "url": {
            "urls": [
              {
                "expanded_url": null,
                "url": "http://bullcityrecords.com/wnng/",
                "indices": [
                  0,
                  32
                ]
              }
            ]
          },
          "description": {
            "urls": [

            ]
          }
        },
        "default_profile": false,
        "contributors_enabled": false,
        "favourites_count": 8,
        "url": "http://bullcityrecords.com/wnng/",
        "profile_image_url_https": "https://si0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
        "utc_offset": -18000,
        "id": 29516238,
        "profile_use_background_image": true,
        "listed_count": 118,
        "profile_text_color": "333333",
        "lang": "en",
        "followers_count": 2052,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/9423277/background_tile.bmp",
        "profile_background_color": "9AE4E8",
        "verified": false,
        "geo_enabled": false,
        "time_zone": "Eastern Time (US & Canada)",
        "description": "You will come to Durham, North Carolina. I will sell you some records then, here in Durham, North Carolina. Fun will happen.",
        "default_profile_image": false,
        "profile_background_image_url": "http://a0.twimg.com/profile_background_images/9423277/background_tile.bmp",
        "statuses_count": 7579,
        "friends_count": 348,
        "following": null,
        "show_all_inline_media": true,
        "screen_name": "bullcityrecords"
      },
      "in_reply_to_screen_name": null,
      "source": "web",
      "in_reply_to_status_id": null
    },
    {
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Fri Sep 21 23:30:20 +0000 2012",
      "id_str": "249289491129438208",
      "entities": {
        "urls": [

        ],
        "hashtags": [
          {
            "text": "freebandnames",
            "indices": [
              29,
              43
            ]
          }
        ],
        "user_mentions": [

        ]
      },
      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "Mexican Heaven, Mexican Hell #freebandnames",
      "metadata": {
        "iso_language_code": "en",
        "result_type": "recent"
      },
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 249289491129438208,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "99CC33",
        "profile_sidebar_border_color": "829D5E",
        "profile_background_tile": false,
        "name": "Thomas John Wakeman",
        "profile_image_url": "http://a0.twimg.com/profile_images/2219333930/Froggystyle_normal.png",
        "created_at": "Tue Sep 01 21:21:35 +0000 2009",
        "location": "Kingston New York",
        "follow_request_sent": null,
        "profile_link_color": "D02B55",
        "is_translator": false,
        "id_str": "70789458",
        "entities": {
          "url": {
            "urls": [
              {
                "expanded_url": null,
                "url": "",
                "indices": [
                  0,
                  0
                ]
              }
            ]
          },
          "description": {
            "urls": [

            ]
          }
        },
        "default_profile": false,
        "contributors_enabled": false,
        "favourites_count": 19,
        "url": null,
        "profile_image_url_https": "https://si0.twimg.com/profile_images/2219333930/Froggystyle_normal.png",
        "utc_offset": -18000,
        "id": 70789458,
        "profile_use_background_image": true,
        "listed_count": 1,
        "profile_text_color": "3E4415",
        "lang": "en",
        "followers_count": 63,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme5/bg.gif",
        "profile_background_color": "352726",
        "verified": false,
        "geo_enabled": false,
        "time_zone": "Eastern Time (US & Canada)",
        "description": "Science Fiction Writer, sort of. Likes Superheroes, Mole People, Alt. Timelines.",
        "default_profile_image": false,
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme5/bg.gif",
        "statuses_count": 1048,
        "friends_count": 63,
        "following": null,
        "show_all_inline_media": false,
        "screen_name": "MonkiesFist"
      },
      "in_reply_to_screen_name": null,
      "source": "web",
      "in_reply_to_status_id": null
    },
    {
      "coordinates": null,
      "favorited": false,
      "truncated": false,
      "created_at": "Fri Sep 21 22:51:18 +0000 2012",
      "id_str": "249279667666817024",
      "entities": {
        "urls": [

        ],
        "hashtags": [
          {
            "text": "freebandnames",
            "indices": [
              20,
              34
            ]
          }
        ],
        "user_mentions": [

        ]
      },
      "in_reply_to_user_id_str": null,
      "contributors": null,
      "text": "The Foolish Mortals #freebandnames",
      "metadata": {
        "iso_language_code": "en",
        "result_type": "recent"
      },
      "retweet_count": 0,
      "in_reply_to_status_id_str": null,
      "id": 249279667666817024,
      "geo": null,
      "retweeted": false,
      "in_reply_to_user_id": null,
      "place": null,
      "user": {
        "profile_sidebar_fill_color": "BFAC83",
        "profile_sidebar_border_color": "615A44",
        "profile_background_tile": true,
        "name": "Marty Elmer",
        "profile_image_url": "http://a0.twimg.com/profile_images/1629790393/shrinker_2000_trans_normal.png",
        "created_at": "Mon May 04 00:05:00 +0000 2009",
        "location": "Wisconsin, USA",
        "follow_request_sent": null,
        "profile_link_color": "3B2A26",
        "is_translator": false,
        "id_str": "37539828",
        "entities": {
          "url": {
            "urls": [
              {
                "expanded_url": null,
                "url": "http://www.omnitarian.me",
                "indices": [
                  0,
                  24
                ]
              }
            ]
          },
          "description": {
            "urls": [

            ]
          }
        },
        "default_profile": false,
        "contributors_enabled": false,
        "favourites_count": 647,
        "url": "http://www.omnitarian.me",
        "profile_image_url_https": "https://si0.twimg.com/profile_images/1629790393/shrinker_2000_trans_normal.png",
        "utc_offset": -21600,
        "id": 37539828,
        "profile_use_background_image": true,
        "listed_count": 52,
        "profile_text_color": "000000",
        "lang": "en",
        "followers_count": 608,
        "protected": false,
        "notifications": null,
        "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/106455659/rect6056-9.png",
        "profile_background_color": "EEE3C4",
        "verified": false,
        "geo_enabled": false,
        "time_zone": "Central Time (US & Canada)",
        "description": "Cartoonist, Illustrator, and T-Shirt connoisseur",
        "default_profile_image": false,
        "profile_background_image_url": "http://a0.twimg.com/profile_background_images/106455659/rect6056-9.png",
        "statuses_count": 3575,
        "friends_count": 249,
        "following": null,
        "show_all_inline_media": true,
        "screen_name": "Omnitarian"
      },
      "in_reply_to_screen_name": null,
      "source": "Twitter for iPhone",
      "in_reply_to_status_id": null
    }
  ],
  "search_metadata": {
    "max_id": 250126199840518145,
    "since_id": 24012619984051000,
    "refresh_url": "?since_id=250126199840518145&q=%23freebandnames&result_type=mixed&include_entities=1",
    "next_results": "?max_id=249279667666817023&q=%23freebandnames&count=4&include_entities=1&result_type=mixed",
    "count": 4,
    "completed_in": 0.035,
    "since_id_str": "24012619984051000",
    "query": "%23freebandnames",
    "max_id_str": "250126199840518145"
  }
}

That’s from Twitter’s API reference. Now imagine writing a Java class to represent that response. Now imagine doing it again for whatever other insane responses Twitter wants to fling at you.

Jackson has what feels like hundreds of bizarre annotations and configuration options that let you control the translation process between a JSON string and your class, or get around problems like JSON responses suddenly getting new fields and mutually referential objects causing stack overflow errors due to infinite recursion. It’s a huge, complex library, and while there’s plenty of information out there, it can be hard to word a search query in the right way to find a solution to your specific problem. At my company, we briefly looked at migrating from org.json to Jackson, but it was going to be a massive amount of work to figure out the right magical incantations to make Jackson not break and then either go mark up all our database objects with the right annotations or write new view-model classes to use for JSON translation. And since none of us were Jackson experts, we had a steep learning curve ahead of us.

Imagine your car breaks down in the middle of the desert, but you happen to be right next to an abandoned auto shop with a set of books and tools. You don’t know the first thing about cars, other than where the gas goes, but you figure you’ve got a good brain, so you sit down with a book and start reading. After six months of alternately reading and tinkering on some of the forgotten wrecks left on the hydraulic lifts of this abandoned auto shop, you’re ready to start on your car. You eventually get it kind of working, though it can’t go above 37 mph, makes a weird stuttering noise as you drive, and you have to stop every 15 miles and go tighten some bolts, and also you just spent six months camped out in an abandoned auto shop in the middle of the desert. Here you’ve addressed the fundamental problem that your car is broken, and what you came up with is definitely better than a homemade car sail, but was it really worth that six months you spent in that abandoned auto shop?

The root cause, and how to fix it

The root cause of all these problems is Java’s static typing. Well, to be fair, it’s also Javascript’s dynamic typing. JSON is based on Javascript, so it’s perfectly fine with lists of mixed types, maps of string to whatever you please, and arbitrarily nested self-referential objects. Languages like Python, Ruby, PHP, and Clojure, which are dynamically typed and have a similar set of primitives to Javascript, are also fine with that, so you can trivially map any JSON object to a native data type in those languages.

Java is not fine with any of that. It demands lists of a single type and maps of strings to a single type, and it asks that you please tell it beforehand just how deep you’re going to nest those self-referential objects. You can cheat your away around this by declaring the type as Object, but working with Objects is tedious, ugly, and loses you the only advantage to having all these restrictions.

How can we fix this? Here a few features that would definitely help. They’re listed in order of how likely they are to be added to the Java language, starting at “Maybe in Java 26” and ending at “You’ve got a better chance of convincing your company to rewrite everything in Erlang”.

Optional dynamic typing

C# supports a type keyword dynamic. This makes a variable dynamically typed; the compiler basically pretends that it can be any type and supports any operation, and allows a runtime exception to be thrown if you do the wrong thing with it.

This would, at the very least, make the JsonWhatever strategy a lot easier: instead of needing a JsonPrimitive with all those conversion methods, just make the JsonObject and JsonArray store values of type dynamic.

This would also make Jackson’s problem with generic types easier to deal with: instead of binding to lists of Object or maps of string to Object, it can bind to a list of dynamic or map of string to dynamic.

This approach still gives up type safety, but it makes giving up type safety a lot easier to work with and less painful.

Union types

Languages like Haskell and ML support something called union types. Essentially you can define a new type which can hold a value of any of a given set of types.

Imagine that Java had a syntax like type JsonVal = Map<String, JsonVal> | List<JsonVal> | Integer | Double | String | Boolean | null to define a new type called JsonVal. This declares that a JsonVal can be any of those listed types, but no other. If our code ever tries to assign a FizzingWhizbeeFactory to a JsonVal, the compiler will complain.

This can buy us both freedom from casting and arbitrary nesting. The compiler has a list of possible types, so it doesn’t need us to tell it exactly which one a given value is; if we call a method on a JsonVal, it knows where it can look for that method, and if it can’t find the method anywhere, it can flag a compile-time error. It doesn’t need us to cast for it. And the JsonVal type is defined recursively: a JsonVal can be a Map<String, JsonVal> or a List<JsonVal>. This gets us the arbitrary nesting we need to accurately represent JSON.

There are other challenges to getting union types right in Java. For instance, if I have a union type type MyType = A | B and C is a subtype of B, can MyType be an instance of C? Can I cast an instance of C to B and assign it to MyType? Can I then downcast an instance of MyType which is assigned an instance of B to C if that’s what it really is? Or do we just do what generics did, throw the whole mess out, and say that any type which isn’t explicitly listed in the declaration of the union type is invalid? But if we could get union types in Java, it would go a long way towards solving our JSON problem in a type-safe way.

A Dedicated JSON type

This is the easiest option for programmers, and the most deranged from a language design standpoint.

Scala, Visual Basic, and Kawa (a dialect of Scheme on the JVM) have XML literals. In Scala, for example, you can write this:

val people = 
    <people>
        <person firstName="Ted" 
                lastName="Bundy" 
                gender="M" />
        <person firstName="Jeffery" 
                lastName="Dahmer" 
                gender="M" />
        <person firstName="Charles" 
                lastName="Manson" 
                gender="M" />
        <person firstName="Rasmus" 
                lastName="Lerdorf" 
                gender="M" />
    </people>

So in Java, we could just modify the language to let you write this:

Json someJson = {
    "people": [
        { "firstName": "Adolf", "lastName": "Hitler" },
        { "firstName": "Idi", "lastName": "Amin" },
        { "firstName": "Pol", "lastName": "Pot"},
        { "firstName": "Hiro", "lastName": "Mashima"}
    ]
}

The Json type can have a static method, Json.fromString, that inflates a string into JSON, and its toString method can be overridden to serialize a Json into a string.

Scala can sort of do XML literals as part of the normal language, because Scala’s design is already deranged. In Java, the language would have to be modified. But there is precedent for this: arrays in Java are already just random tacked on garbage with special syntax and special behavior that applies to no other part of the language. And Java’s principal use today is as a server-side web programming language; given how important JSON is to web programming, it might be worth modifying the core language to add this.

Still, this is the least likely of the options. Adding lambdas and method references was a massive deal, and those had already proven their worth in dozens of other languages over a period of almost fifty years. Java is not a language that likes revolutionary features; it prefers to wait until every language in the universe, even C++, has a feature, and then panic and squeeze something on top of the existing language features that’s just good enough to disunite the opposition.

Conclusion

Dealing with JSON in Java is quite painful, and it’s because JSON is inherently dynamically typed due to its heritage in Javascript’s native object type, and Java’s static type system is too rigid to accommodate it in an elegant way.

Since JSON is used for communicating between systems and components of a system, the difficulty of working with JSON in Java can affect how you choose to design your API responses. You might choose to favor flat objects over nested, as we used to do at my company. Instead of something like this:

{
    "room": {
        "name": "library",
        "carpetType": "Persian rug"
    },
    "killer": {
        "name": "Colonel Mustard",
        "weapon": "lead pipe"
    }
}

You might choose this:

{
    "roomName": "library",
    "roomCarpetType": "Persian rug",
    "killerName": "Colonel Mustard",
    "killerWeapon": "lead pipe"
}

In Java, with Jackson, the second form requires a single class with four strings. The first form requires a class with two subclasses, each with two strings, and probably a few annotations to make sure everything happens correctly when you bind from a string. So up front, the first form is more complicated to work with in Java, and this extra complexity just gets more pronounced as your JSON responses get more complex.

But using the second form causes other problems. Since JSON keys have no guaranteed order, you could send the response to a client like a browser and have it print with the keys all randomized, which makes the response much harder to read if you need to debug. If a client of your API only cares about the room, they have to loop over all the keys and find the ones that start with room instead of being able to just grab the room object. You can write that annoying code every time, or you can write some stupid little helper function that you have to document and test to loop over a JSON object and find all the keys with a certain prefix. You’ll probably also have to duplicate this function across various code bases in your stack. This tradeoff might be worth it if you just have a bunch of Java code communicating (e.g. an Android app and a Spring API), but if you have other languages in your stack, like Javascript or Python, you’re forcing them to conform to the lowest common denominator.

If you’re writing web code nowadays, JSON is a fact of life, and it’s a great transmission format: it’s simple, readable, and powerful, easy to parse, able to represent almost any kind of data in human readable format. It’s elegant and easy to use, a big improvement in a lot of ways over RPC formats like COM and CORBA, unreadable binary formats, and XML. It’s a travesty that Java has such terrible support for something so elegant, widespread, and fundamental.