is a company that has been making people’s lives easier and more convenient in the past 4 years. The idea is simple: you order a boost via the mobile app and you get high-quality gas at your parking spot while you are away. This is the wonderful team of engineers who make all this possible:
In the beginning, we had to make a choice about the DBMS around which we would build Booster: we needed flexibility, high availability and easy development. All of these were offered by MongoDB for free. To make our loyal users happy, we constantly improve our applications which sometimes causes changes in our data structures. MongoDB handles these situations well and lets us focus on different problems.
Booster quickly became part of people’s everyday lives in Silicon Valley, resulting with rapid growth which has not stopped since then.
Need for PostgreSQL
Just like many other companies, we recognized the importance of data analysis as we grew. Data analysis helps us increase our productivity and make better decisions for the future. We needed the ability to perform joins which means that we needed a relational system. Our data engineers were mostly familiar with SQL and we just needed to pick a RDBMS for our data warehouse.
Finding the right RDBMS is not an easy task and there are many database vendors to choose from. We wanted to keep our costs as low as possible.
We chose Postgres since it offered the most important things we were looking for:
- flexible (the JSON support lets us go schema-less whenever we want to)
Adventures with MoSQL
We used an ETL tool calledto transfer our data from MongoDB to Postgres. MoSQL was created by Stripe and announced in 2013: free, open-source and easy-to-use. It transfers all of the collections defined in a collection map and syncs with MongoDB for changes. We loved MoSQL but unfortunately it is no longer maintained. In 2018 we faced multiple issues: MoSQL crashed when there were schema changes and connectivity issues. This resulted with reimporting millions of documents from the beginning on a weekly basis. Before the reimporting has been started, all of the tables were dropped by MoSQL, making us unable to generate reports and make plans for the next days and weeks when necessary. A full reimport took more than 5 hours. As time passed, the issues were more and more frequent and we were left with error messages which did not describe what went wrong. We recognized the need to take things into our own hands.
A Solution that Works
We needed a service our analytics team could rely on. We created Purrito to do what MoSQL did right and handle the issues MoSQL could not handle anymore. We did not want to make drastic changes in our app logic because we wanted the trasition to be as smooth as possible. We kept the collection map and generated an output similar to the one that MoSQL generates. This way we could avoid refactoring most of our SQL queries and integrate our solution with a only a few changes. We also kept in mind the connectivity issues MoSQL could not always handle.
Comparison: Purrito vs MoSQL
- transfering data to a custom schema
- generate the schema based on MongoDB collections
- when restarting, update, drop or trucate the existing tables, create or empty the whole schema etc.
- in case of connectivity issues wait for a while, attempt to reconnect to PostgreSQL or MongoDB and start tailing the oplog again from the timestamp saved in the database
- update collection map without dropping or stopping anything
(cca 50GB data, 10 000 000 rows)
|~ 3.5 h||~ 5 h|
|Size||x MB||y MB|
|Schema changes coming from MongoDB||logs an error and continues tailing||-|
|Database unavailable||tries to reconnect to the database||-|
|Schema changes||update the collection map in the database||
update the collection map file and restart
tailing from any date and time from the past or
from the current timestamp
|tail from the current timestamp|
create a collection map or
choose auto typecheck
|create a collection|