Exploring Google BigQuery’s Implementation of a JSON ‘Data Type’

The backstory to this article began when Google Cloud announced that its BigQuery data warehouse-as-a-service offering “natively supports JSON data using the JSON data type.”

On LinkedIn, James Malone, product lead with platform-as-a-service data warehousing provider Snowflake, used the occasion of Google Cloud’s announcement to claim that Snowflake has offered support for in-database JSON “since day one.” Malone’s implication was that Google and BigQuery are the equivalent of JSON-come-latelies.

This prompted Google product manager Christopher Crosbie to return fire. In the comments of Malone’s LinkedIn post, Crosbie argued that “there is a difference between JSON as a native data type vs support via generic types like VARIANT (Snowflake), SUPER (Redshift), or the JSON abstraction that [BigQuery] offers today.”

Crosbie didn’t unpack this difference, so I thought I’d do it for him.

By the way, you will notice that I put “data type” in quotes when I refer to what Google Cloud has done in BigQuery. I do this to denote its unorthodoxy. As I explain in Section 3 of this article, Google Cloud is using an … expansive definition of the term. This isn’t pernicious, but it is something people should be aware of.

**1. What Does “Native” Support for a Database Feature Mean? What Should It Mean?**

Google Cloud says that BigQuery “natively” supports JSON data. It seems to make the case that BigQuery’s “native” support is somehow bound up with its implementation of a JSON “data type.”

Does this mean that competitive offerings that do not implement a JSON “data type” cannot claim to offer “native” support for JSON? The unequivocal answer to this question is, “[Censored] no.”

As I explain in the companion piece to this article, native is basically a marketing term and, hence, meaningless. Ironically, and from the point of view of traditionalist practitioners, BigQuery’s implementation of a “native” JSON “data type” is, in a sense, non-native: It does not work the way one would expect it to work (i.e., in an RDBMS) because it requires the use of non-standard SQL functions to query JSON data. Let’s put a pin in this last claim for the moment; I will return to it in Section 4.

That said, if “native” were a useful term with a precise, specific meaning in tech, what would it be?

Today, basically any RDBMS can ingest JSON data and store it as an instance of a CLOB data type. The problem with this is that the RDBMS does not “know” what to do with this data -- i.e., from the perspective of performing numerical mathematical operations on it, as distinct to set-based relational algebraic operations. The RDBMS does not “know” that the numerical values typed as a CLOB are FLOAT or INT values. As a result, if a user wants to access and perform numerical mathematical operations on this data, she must explicitly express (in SQL) what it is she wants the RDBMS to do.

This goes to Google’s Crosbie’s critique of the way some competitive RDBMS offerings implement in-database support for JSON -- i.e., as a generic type such as BLOB, or via the VARIANT type, which is supported in Snowflake, SQL Server, and Teradata, among others, or by means of SQL character types, such as VARCHAR, CLOB, NCLOB. (Note: Teradata does not use the VARIANT data type to accommodate in-database JSON. See my companion article for more on this.)

The thing to note is that BLOB and VARIANT are nebulous data types. For example, to perform numerical mathematical operations on data that is stored as a BLOB or VARIANT type, the user must tell the RDBMS what the data is -- that is, convert the data to its original type or express (in SQL) the operations the RDBMS is to perform on it.

If this counts as “in-database” support for JSON, it comes with an important caveat.

2. “Native” or Not, What Is Useful About Google’s Implementation of a JSON Data Type?

In theory, BigQuery’s implementation of a JSON “data type” addresses this issue. On the plus side, it makes it possible for BigQuery to ingest and preserve raw JSON objects, without (as with at least one competitive implementation) “shredding” data out of them and discarding them. It should expose the JSON data for standard RDBMS use cases, such as ad hoc query. What is more important, the JSON “data type” should abstract most of or all the complexity involved in declaring what the data is (INT, FLOAT, etc.), what characteristics it must have, and what operations BigQuery can perform on it.

Also, the user need not give BigQuery explicit instructions to access data of the JSON “type.”

Actually, this last bit isn’t quite true, because there is a caveat with BigQuery’s implementation of a JSON “data type.” You might call it a Catch-22. Spoiler alert: JSON query isn’t completely transparent.

But before I explain what this Catch-22 entails, let me say more about data types and why they are important. This next section unpacks what is so unorthodox (some might say tendentious!) about Google’s use of the term “data type” to describe its implementation of JSON in BigQuery. If this is not of interest to you, feel free to skip ahead to Section 4, ‘The Catch-22.’

3. Types, Typing, and Working With Data

It seems to me that what Google Cloud has done is a tendentious thing. At the very least, its implementation of a JSON “data type” expands on the canonical definition of that term. Again, this is not necessarily pernicious. But it is confusing in view of how typing is commonly understood -- and why it is useful.

Let me explain what I mean. In the first case, JSON is not a data type. Rather, it is a data interchange format. In the second case, the JSON format can be used to exchange or interchange data of many types between similar or dissimilar schemas. You can serialize FLOAT, INT, CHAR, and other data types "in” a JSON object. To classify a JSON object as a “data type” is roughly analogous to classifying a book as a “data type.” The book, too, is a format for exchanging and interchanging data. It is not a data type. To change analogies, JSON -- a data interchange format -- is a “type” in the same way that the set of all possible sets is itself … a set. JSON is a useful container in which to encapsulate data of different types; the universal set is a conceptual container in which to encapsulate any possible object.

Technologists who believe precise terminology is important will likely recoil at Google’s usage.

Think about it: No RDBMS vendors (or project maintainers) created a new HTML data type to accommodate hypertext data. Yes, at least one vendor (namely, IBM) did create a new RDBMS-specific XML “data type” implementation, but, alas, this did not catch on. (You could attribute this to the fact that, as with JSON, XML is a format, not a data type. But you could also attribute it to the fact that XML essentially fizzled out as a means of data interchange.) And it is worth noting that IBM did not do the same thing (make the same mistake?) again with JSON, which gives no indication of fizzling out.

That said, an expanded definition of the kind Google uses is attested both historically (e.g., IBM’s implementation of a DB2-specific XML data type) and in a few generic SQL data types. For example, several DBMS vendors implement the VARIANT data type, which can be used to store data of any type. (As Crosbie noted, Snowflake implements VARIANT.) The problem, as I have said, is that the RDBMS does not know what to do with VARIANT data. The user must either convert data stored as VARIANT to its original type before it does anything with it (the easiest route) or create SQL to tell the RDBMS engine what the data is, what kinds of operations it is permitted to perform on this data, what rules it must enforce in performing these operations, what the results of these operations must look like, etc.

This gets at what is most useful about typing: Namely, data types make it possible for the RDBMS to perform work in the absence of explicit, often detailed, instructions from the user. The same is true in programming, where typing permits a compiler to perform work in the absence of explicit, detailed instructions from the programmer. In both cases, typing functions as an essential labor-saving device.

With BigQuery’s JSON “data type,” it seems to me Google Cloud has implemented a kind of improved VARIANT type. As with the generic VARIANT type, you can use BigQuery’s JSON “data type” to ingest and store raw JSON objects, without regard for the specific data types (FLOAT, INT, CHAR, NCHAR, etc.) serialized in that object. Unlike the generic VARIANT type, BigQuery also tracks and manages the specific data types it ingests as JSON “type” data -- e.g., enforcing rules as to which operations can be performed on which data according to its type.

There is more to it than this, of course, but, as a general description of what Google Cloud has done, I think this is accurate enough.

Now, on to BigQuery’s JSON “type” and …

4. The Catch-22

This Catch-22 is that the BigQuery user must know she is querying against a JSON data type.

BigQuery documentation is clear on this. Depending on the user’s requirements, the user must use type-specific functions to query against JSON data in BigQuery.

For example, say that a user has a BigQuery table with a JSON-type column containing the names of a company’s salespeople, along with each of the cities in which they are based. Now, say the user wants to find out how many salespeople the company has in Detroit. To answer this question, the user would create the following query:

SELECT * FROM Table.foo WHERE JSON_VALUE(Based_city) = “Detroit”

(Note: Emphasis in “JSON_VALUE” is mine.)

In addition to JSON_VALUE, BigQuery specifies the following JSON-specific functions:

JSON_QUERY, JSON_QUERY_ARRAY, and JSON_VALUE_ARRAY

In other words, the user cannot use ANSI-standard SQL to query against this data.

As a counterpoint to this, consider IBM’s implementation of in-database JSON in DB2.

In the first place, DB2 does not ingest JSON objects as JSON objects. Instead, it parses and shreds the contents of each JSON object into the appropriate SQL data types. (So, floating-point data is typed as FLOAT, integer data as INT, and so on.) In the second place, DB2 does not put this data into a JSON-specific column, but, rather, into rows and/or columns in the appropriate RDBMS tables.

This invites the question: How does one query against data derived from JSON objects in DB2?

The answer is by constructing a standard SQL query:

SELECT * FROM Table.foo WHERE Based_city = “Detroit”

This SQL-purist approach is the exact reverse of what Google Cloud has done in BigQuery. Not only does it not define a JSON-specific data type, for better or worse, it discards the raw JSON object -- along with, notionally, whatever moiety of JSON data has not been shredded into the database.

DB2’s implementation is “lossy” in the same way that MP3 encoding is lossy: It discards information it does not determine to be significant. The downside to this lossy-ness is that this discarded data is gone forever. The upside is that DB2 users can use standard SQL to query against data derived from JSON sources. They need not know that the data they are querying against is of a JSON “type.”

The upside to Google Cloud’s implementation of a JSON “data type” in BigQuery is that it preserves the raw JSON object and makes the data that is serialized in it accessible via SQL query. (Also, the JSON object and its raw data may prove to be useful in future analyses.) The downside is that it specifies non-standard query functions. As I explain in the companion article, at least one of Google Cloud’s competitors -- Teradata -- implements in-database JSON in a way that (a) preserves JSON objects and (b) facilitates standard SQL query of the data nested in these objects. So, Google could have done this differently.

On balance, Google Cloud can probably make several points -- some compelling, some not s compelling -- in defense of its implementation of a JSON “data type” in BigQuery. It really depends on what one needs/wants from an RDBMS. I unpack this problem in considerable detail in my companion article.

Comments

Plain text