In a separate article, I explored Google Cloud’s implementation of a “native” JSON “data type” in its BigQuery data warehouse. In that article, I explained that BigQuery’s JSON “data type” seems to rely on an expansive definition of this term (i.e., “data type”), at least as it is nominally used in connection with programming languages or databases. I then described what BigQuery’s JSON “data type” is, what it is useful for, and how it differs from (and, in fact, is novel with respect to) competitive approaches.
My priority was to relate this feature to the core issues of value or usefulness as understood by customers. However, it seems to me that the criteria I refine and bring to bear in that analysis could be generally useful to customers as they struggle to evaluate the claims made by analytic DBMS vendors, or to make informed decisions in their selection of an analytic database.
In this article, I outline a proposed framework for doing just this. A methodological note: I use BigQuery’s JSON “data type” as an object lesson by which to exhibit and explicate this framework.
“Native” Is a Marketing Term
To start off, Google Cloud says BigQuery supports JSON “natively” by means of a JSON “data type.”
What does “native” support for a feature (a mark-up language, a data interchange format, etc.) actually mean as it relates to an analytic database?
For all intents and purposes, native support means that the database implements support for a feature in such a way that the feature is integral to the system.
That is the deadpan (or completely useless) answer to this question. The actual answer is that the term “native” is not especially useful as a criterion for evaluating the strengths or weaknesses of a particular feature as it is implemented in a particular analytic database. In point of fact, the “native-ness” of a feature does not tell us much, if anything, about its practical usefulness.
In marketing-speak, both the term “native” and arguments as to the “native-ness” of a specific feature often amount to misdirection -- and I mean this precisely in the magician’s technical sense of the term.
The upshot is that it is necessary to know not just how a feature is implemented in a DBMS, but how it is likely to be used. When I evaluate DBMS features and capabilities, I ask the following questions:
- Does the feature permit me to do useful work that I could not otherwise do?
- How does the way the DBMS feature is implemented compare with competitive implementations? What, if anything, is novel or different about it?
- Is the way the feature is implemented consistent with how the RDBMS platform is commonly used, i.e., with its standard constraints and conventions?
- If the implementation is inconsistent with the way the platform is used, can the vendor demonstrate a compelling reason as to why it has implemented the feature in this way?
- What is the practical takeaway for different kinds of customers? (For existing customers? For customers who use competitive products? For greenfield customers?)
Evaluation Framework for DBMS Features
The following will expand on each of the questions listed above.
1. Does the feature permit me to do useful work that I could not otherwise do?
In this specific instance, the “feature” is BigQuery’s implementation of a “native” JSON “data type.”
How is typing useful?
- First, because it reduces the amount of work the user must perform to access and perform operations on data in a database. Thanks to data types, the (R)DBMS engine “knows” what it can do with NUMBER, FLOAT, INT, CHAR, BLOB, etc., data. Moreover, it “knows” whether the results of the operations it has performed on data of a specific type are valid. This goes to the canonical definition of the data type as such: Typing is a labor-saving innovation par excellence.
- Second, because data types have historically been used to accommodate untraditional data formats. So, for example, the BLOB data type covers a broad category of arbitrarily structured binary files, such as images, audio files, video files, documents (including, notionally, JSON objects), and so on.
Among RDBMSs, both SQL Server and Snowflake use a kind of umbrella data type, VARIANT, to store data of virtually any type. The labor-saving dimension of typing is much less important here. For example, in the case of the VARIANT type, the database must usually be told what to do with this data.
The emphasis in this definition of data type goes to the issue of convenience: BLOB and similar types are primarily useful as a means to store data in the RDBMS irrespective of the data’s structure.
Google Cloud’s implementation of a JSON “data type” in BigQuery ticks both these boxes. First, it is labor-saving, in that BigQuery knows what to do with JSON data according to its type. Second, it is convenient, in that it gives customers a means to preserve and perform operations on data serialized in JSON objects. The implemenation permits an organization to ingest JSON-formatted messages into the RDBMS (BigQuery) and to preserve them intact. Access to raw JSON data could be valuable for future use cases. It also makes it much easier for users to access and manipulate this data. In these senses, it makes it possible for customers to perform useful work they could not otherwise perform.
The problem is that the boxes that BigQuery ticks tell us little about how its implementation (native or otherwise) compares to those of its competitors. Its support for this feature is only useful in context.
2. How does the way the DBMS feature is implemented compare with competitive implementations? What, if anything, is novel or different about it?
In the companion article referenced earlier, I described BigQuery’s implementation of a JSON “data type” as analogous to an improved version of the generic SQL VARIANT type, which Snowflake implements in its analytic RDBMS. The improvement comes by way of the fact that BigQuery knows what to do with the different data types embedded in the JSON objects it ingests and stores via its JSON “data type.” It “knows” to treat FLOAT data differently than INT data and to treat INT data differently than NCHAR data.
Under BigQuery’s covers, Google Cloud had to do extra work to implement its “native” JSON “data type.”
First, it had to create logic that permits BigQuery to track and manage each of the possible data types that can be serialized in a JSON object. Also, it had to create logic that automates what BigQuery does when it is told to operate on each of these data types. (This might simply involve automatically converting them to / declaring them as their original types.) Otherwise, BigQuery’s “native” JSON “data type” would be no more useful than the JSON implementations of some of its competitors.
That said, BigQuery JSON “data type” also imposes nonstandard constraints as to its usage.
For example, users must query JSON-type data using proprietary functions (such as JSON_QUERY and JSON_VALUE). (See my companion article for more on this.) This presupposes that the user has foreknowledge about the data she is querying -- namely, that it is BigQuery’s JSON “data type.”
What might Google Cloud have done differently?
The most extreme counterpoint to BigQuery’s implementation is IBM’s with its DB2 database. Instead of ingesting JSON objects as-is, DB2 first parses them and “shreds” the data it contains into rows and columns in one or more database tables. The benefit of this arrangement is that data shredded from JSON objects is (a) typed appropriately and (b) accessible via standard SQL query. The drawback is that DB2 discards the raw JSON object. (I say more about this in the final section of this article.)
Teradata’s implementation is a little different. It ingests and persists JSON objects, which it also makes accessible via standard SQL query functions. Teradata’s implementation, too, requires that users have some foreknowledge of the structure of the JSON data -- namely, that it contains fields that may be nested. (Hence, users must employ SQL dot notation -- e.g., “Table.foo” -- to go inside the fields in JSON column “foo” in a table called “Table.”) Teradata’s arrangement is faster than DB2 at ingest -- Teradata isn’t parsing and shredding JSON objects -- and also preserves the JSON message itself, but it is less transparent from the point of view of the SQL-query use case. Again, benefits and drawbacks.
3. Is the way the feature is implemented consistent with how the RDBMS platform is commonly used, i.e., with its standard constraints and conventions?
In BigQuery’s case, no. Users must know ahead of time that they are querying against BigQuery’s JSON “data type.” In itself, this is not unprecedented: Most implementations of in-database JSON presuppose that users have some foreknowledge that they are querying against JSON data.
However, BigQuery expects users to employ non-standard SQL functions, such as “JSON_VALUE,” to query against its JSON “data type.” In this sense, Google’s implementation breaks basic compatibility with extant technologies that expect to use standard SQL functions to manipulate data.
These include SQL query interfaces and code generators; business intelligence (BI), analytic discovery, and data visualization tools; as well as portals, web services, software libraries, application frameworks, software development kits, and so on. In other words, Google Cloud’s implementation of a “native” JSON “data type” breaks compatibility with the bread-and-butter SQL query use case. The lesson here is it is necessary to know not just how a feature is implemented in an analytic database, but how it is likely to be used.
In this sense, what Google Cloud has done with BigQuery’s “native” JSON “data type” is akin to what a vendor, provider, or project maintainer does when it makes fundamental changes to software APIs. On the plus side, tools vendors, cloud providers, project maintainers and others will catch up and adapt to this change, so this breakage isn’t permanent. But it will confuse and frustrate users in the short term.
4. If the feature’s implementation is inconsistent with how the platform is used, can the vendor demonstrate a compelling reason as to why it has implemented it in this way?
My sense is that, yes, Google Cloud can do this -- from a certain perspective.
In the first place, Google’s implementation gives its customers a new, notionally improved means of accommodating JSON data in BigQuery. This is its utilitarian justification.
In the second place, Google’s implementation gives its customers a new, notionally improved means of preserving raw JSON objects in BigQuery. This is its ideal justification.
Google’s decision to treat JSON-formatted objects as a distinct “data type” might infuriate technologists as incoherent (see my accompanying article), but it might also be linked to something else: namely, an applied, demonstrative argument that it is useful to expand the definition of the term “data type.” And this argument might be bound up with another, even more fundamental argument: namely, that it is useful to expand the scope, the capabilities, and the remit of the RDBMS itself.
Remember that, in the previous section, I contrasted Google’s approach with BigQuery’s JSON “data type” and the very different approach IBM pursued in accommodating JSON in DB2? What I didn’t say is that the divergence between these vendors on the question of in-database JSON gets at a fundamental schism in database design, analytics, and -- more basic still -- information management.
Ask yourself: Why would a company with IBM’s engineering pedigree implement JSON support in its flagship RDBMS in a way that discards the raw JSON object itself? One answer to this question is that the people who engineer DB2 (like good SQL purists) believe raw JSON objects do not belong in the RDBMS. Full stop. From the SQL purist’s perspective, if users must preserve raw JSON objects, the best (and cheapest) place to put them is in a document DBMS or in scalable object storage.
This issue is too complex, with too many nuances, to explore in this context, but let’s describe this as a division between two camps: the utilitarians and the idealists. Utilitarians think it is important to engineer features that address the ways in which people expect to work with data (and databases) today. They say features should be implemented such that they are consistent with the conventions and constraints of the platform -- an RDBMS -- and, in the main, support its most common use cases.
On these terms, the utilitarians look at JSON and see a format that is used to interchange data between (dis)similar schemas. Instead of ingesting this format (i.e., raw JSON messages) into the database and storing it as-is, utilitarians recognize that the data serialized in JSON messages can usually be reduced to a tabular schema. So, why not do what IBM does with DB2 -- i.e., “shred” each JSON message, extract its tabular data, and store this data in rows/columns in the RDBMS?
The idealistic camp recognizes this imperative but also emphasizes another imperative: namely, the importance of accommodating innovation (in this case, a new/different data interchange format) on its own terms. Yes, the different types of data in a JSON message can usually be mapped to a tabular schema, which today makes it possible to ingest and store this data in an RDBMS. But what about tomorrow? After all, idealists point out, JSON messages contain other data that gets discarded, and which is lost forever, in this process. This data, although not useful today, could be invaluable in future, especially from the perspective of experimental or exploratory users, such as data scientists, machine learning and AI engineers, etc.
On these terms, IBM’s implementation of in-database JSON is utilitarian, Google Cloud’s is idealistic. Either implementation is justifiable. Their value depends on what you need (or expect) to do with them.
5. What is the practical takeaway for different kinds of customers? (For existing customers? For customers who use competitive products? For greenfield customers?)
What does this mean for BigQuery customers? It’s a good thing, on balance, although some might legitimately kvetch that Google Cloud could have (or should have) done things differently.
What does it mean for users of competitive RDBMS offerings? Very little, so far as I can determine. Google Cloud’s implementation of in-database JSON is not in any sense a game-changer. It is “novel,” but this novelty comes at the cost of breaking compatibility with prevailing RDBMS conventions.
For greenfield users, BigQuery’s implementation of a distinct JSON “data type” could be attractive in that it gives them a cost-effective way to preserve raw JSON data in the data warehouse. This is in contradistinction to, for example, an approach such as IBM’s, which discards JSON messages. If DB2 users want to ingest JSON using IBM’s preferred implementation and preserve raw JSON messages in the DB2 context, they do have the option of storing them as CLOB, NCLOB, or even as BLOB types -- with the downside of having to pay for extra storage to manage mostly redundant data sets.
On the one hand, Google Cloud’s implementation is arguably inferior to that of (e.g.) Teradata, which supports transparent SQL query against raw, in-database JSON objects. On the other hand, most greenfield customers who look at this will compare BigQuery’s pricing to Teradata’s and decide … no contest.
I have suggested that you should cut through the marketing hype and focus your evaluation of DBMS features and capabilities on a handful of basic questions. These include: What does the DBMS feature enable you to do that is useful? Specifically, does the feature enable you to do new work or to improve upon how you do your existing work?
This leads to the question of context: What concrete differences exist between the way the feature is implemented in Database A as distinct to Databases B or C? Is the way Database A implements the feature itself novel -- i.e., inconsistent with how most people would expect to use any in-database feature? Can any conceivable “user” (not just ad hoc query users, but machine users, such as commercial and custom-built software/services) straightforwardly exploit the feature? If not, why not?
In other words, can the DBMS vendor demonstrate a compelling reason as to why it has implemented the feature in a novel way that breaks compatibility with standard or prevailing usage conventions? Especially if (as in this case) competitive implementations maintain compatibility with these conventions? Finally, is the vendor’s reason for doing this compelling to you, the customer?
These questions go to what philosopher William James would call the “cash value” of the feature.
In context, the answers to these questions determine whether the way a DBMS vendor implements one or more features is important enough to tip the scales -- or is, rather, an inert factor in a calculus.