Category Archives: google cloud

MongoDB to Google BigQuery ETL Engine now available to public

At OneFold, we used Google BigQuery from day one. We chose Google BigQuery for the following reasons:

  1. Cost. OneFold is a startup and for that, it has very limited budget. We simply don’t have the money and the admin overhead to manage a 100-node Hadoop cluster. With BigQuery, we get access to thousands of machines while paying for only what we use.
  2. Speed. The speed of execution is phenomenal. Some of our clients have upwards of 20TB of data, and running complex queries on the entire dataset usually takes seconds. This is because of the unique architecture of Google BigQuery which splits the job load over thousands of machines.
  3. Ease-of-Use. BigQuery was super easy to get started and learning curve is low. We were able to load a sizable data collection and start writing queries within minutes. More importantly, since BigQuery is a managed service, we don’t have to spend cycles maintaining any hardware. The data is just there when we need it.

A few months ago, we at OneFold open-sourced its MongoDB to Hive ETL Engine. Today, we are happy to announce that we have now open-sourced our MongoDB to BigQuery ETL Engine also. The feature set for both of them are similar in that they solve similar challenges when one tried to move unstructured data like JSON or BSON to structured data warehouse like Hive or Google BigQuery. These challenges can be summarized as:

  1. JSON has weak data type. An attribute “zipcode” can have integer in one JSON object, and string in another JSON object. With data warehouses, the user needs to define the data type associated with a column ahead of time, and usually hard to change afterwards.
  2. JSON doesn’t have a schema. Because of this flexibility, new attributes are added over time, and data engineers usually need to play catch-up to add new columns to accomodate these new attributes.
  3. Nested and Array data types. JSON has nested and array structure that doesn’t translate well into a typical data warehouse table schema.

The ETL engine performs a scan over the data collection and creates a schema automatically. For Google BigQuery, since it supports certain nested and array data type, the ETL engine can either split those data into child tables or keep them inline.

You can download our MongoDB to Google BigQuery ETL engine here. Let us know what you think and how it can be improved.