Documentation

Purrito v1

1. Introduction

Purrito is an easy to use, open-source ETL tool which transfers data from MongoDB to PostgreSQL.

You can download it from PyPI if you want to

  • migrate from MongoDB to PostgreSQL
  • use BI tools like Periscope but want to keep using MongoDB at the same time

Purrito is released under the terms of the MIT License.

1.1 Features

  • transfer collections from MongoDB to Postgres tables
  • continue syncing with MongoDB after data transfer is finished
  • generate the schema based on existing MongoDB collections
  • update the schema while Purrito is syncing PostgreSQL with MongoDB
  • transfer data to a custom schema in Postgres
  • when (re)starting:
    • update, drop or trucate the existing tables,
    • create or empty the whole schema etc.
    • start syncing with MongoDB from a specific timestamp
  • view collection transfer statistics

2. Download & Install

The current stable release is v1.0

Install using pip

Purrito is available on PyPI. Make sure to have an up-to-date pip:

pip install -U pip

Install Purrito:

pip install purrito

Installing from source

You can also download Purrito from Github and install from source:

  • clone therepo
  • in the same folder where the setup.py file is located, type:

    python3 setup.py install

After the installation is done, check if you have the latest version of Purrito:

purrito -v

3. Quickstart

With Docker

We created an example for those who have Docker installed. Download the example and in purrito-examples/example1/ type

docker compose up to start the entire app

docker compose down to stop the app

Without Docker

  1. Install Purrito using pip:

    pip install purrito
  2. Start MongoDB as a replica set

    Note: this step is necessary if we want Purrito to sync the two databases. Otherwise Purrito will stop after transfering all the collections from the collection map.

  3. Generate the collection map that will describe the database schema:

    purrito -m -mdb mongodb://localhost:27017 -n db_name
  4. Start PostgreSQL
  5. To start the data transfer, type:

    purrito -cf collections.yml -pg postgres://localhost:5432/postgres -mdb mongodb://localhost:27017 -n db_name -t

Purrito transfers all collections to the public schema and starts tailing the oplog.

Changes to any documents in MongoDB should be applied to the corresponding rows in PostgreSQL.

4. The Collection Map

-cf, --collection-file

The collection map is a YAML file which contains information about the database and the collections that will be transferred. The collections that are not part of the map will be ignored when loading data into Postgres.

The map should have the following structure:

              mongo_database_name:
                Collection1:
                  :columns:
                    - column1:
                      :source: field1
                      :type: TEXT
                    - column2:
                      :source: field2
                      :type: DOUBLE PRECISION
                  :meta:
                    :table: collection1
                    :extra_props: JSONB
                Collection2:
                  :columns:
                    - column3:
                      :source: field3
                      :type: TEXT
                    - column4:
                      :source: field4
                      :type: TIMESTAMP
                    - ...
                  :meta:
                    :table: collection2
                    :extra_props: JSONB
            
Example setup.yml
              test_db:
                Cat:
                  :columns:
                    - id:
                      :source: _id
                      :type: TEXT
                    - name:
                      :source: name
                      :type: TEXT
                    - active:
                      :source: active
                      :type: BOOLEAN
                    - number_of_legs:
                      :source: numberOfLegs
                      :type: DOUBLE PRECISION
                    - created_at:
                      :source: createdAt
                      :type: TIMESTAMP
                  :meta:
                    :table: cat
                    :extra_props: JSONB
            

Explanation: When connecting to the MongoDB instance, Purrito looks for the database name in the collection map (my_mongo_database). Collection Cat will be transferred to table cat (described under :meta -> :table). Table cat will have 6 columns with the following types:

              id: text
              name: text
              active: boolean
              domains: jsonb
              created_at: text
              extra_props: jsonb
            

Data Types

Purrito uses 5 different data types when creating rows for a table:

  • boolean
  • double precision
  • text
  • timestamp
  • jsonb
Type Conversion

When starting Purrito, the collection map is inserted into a table called purr_transfer_info. After changing any column type in the map, Purrito tries to apply the changes to the tables in Postgres. This is sometimes rejected by Postgres and in that case, an error will be logged.

The following table shows which conversions can happen without an error

boolean double precision text timestamp jsonb
boolean
double precision
text
timestamp
jsonb

Conversions that have ✔, can be easily handled by Postgres.

Extra properties

It is not necessary to have definition for every key in the collection map. In that case, a column named _extra_properties with type JSONB can be created in PostgreSQL. Every key and value that lacks the definition can be merged together forming a new value and inserted into this column. Purrito does not transfer extra properties by default because it slows down the data transfer.

To include extra properties, start Purrito with option -ex.

Generating the map

Writing a collection map might be exhausting, especially when there are lots of collections to transfer. Generating the collection map can save a lot of time. Here is how it works:

When starting Purrito, by adding the flag -m (--map) and the Mongo connection information:

purrito -m -mdb mongodb://localhost:27017 -n db_name

Purrito generates the YAML file with the collection definitions. In case the data file can be changed before starting the data transfer Here are some things to keep in mind for generating a collection map:

  • Purrito expects collection and field names in camelCase and outputs them in snake_case. This may cause some issues when a document has _id and id fields, both will be converted to id. Having two columns with the same name is not possible in Postgres, therefore duplicates are not allowed in the collection map and any reoccurences of the same column names will be ignored.
                        name_db:
                          name_coll:
                            :columns:
                            - :source: _id
                              :type: TEXT
                              id: null
                            - :source: jellyFish
                              :type: TEXT
                              jelly_fish: null
                      
  • there is no definition for extra properties when generating the map

    Note: we may add them later

  • a generated collection map will have a column name equal to null which can be left out when writing the collection map alone:
                      name_db:
                        name_coll:
                          :columns:
                          - :source: _id
                            :type: TEXT
                            id: null
                    
  • collections with no documents are skipped
  • Purrito takes a collection sample which consists of the last 100 documents of each collection and checks each key and value it finds. After that, it determines the type for each value in a document and records it. If inconsistent data was found and values with different types belong to the same key, Purrito will choose the type which occured the most frequently.

After generating the collection map, start the data tranfer as explained earlier.

Updating the map

Sooner or later you will need to change your collection map. If you restart Purrito and keep your existing schema, your tables' fields in PG will be compared to the ones in your collection map.

  • Adding a new attribute will make Purrito update your entire collection.
  • Removing an attribute drops the column.
  • Changing a column name will drop the old column, create the new one and update your entire collection.
  • Changing a column type will result with an attempt to ALTER the column. If the attempt was unsuccessful, Purrito will try to convert your data to JSONB.

What about the tables in my Postgres database that were in the schema before using Purrito?

If you have any existing tables in your schema that you want to be left alone, make sure that you

  • don't include those tables in your collection map as a target (:meta: :table: cat)
  • set schema_reset to false
  • set table_drop to false
  • set table_truncate to false

5. Setup

There are multiple options which can be set on the command line but things can easily get messy. In some situations, it is better to organize these settings into a separate file.

Setup file

-h, --help show help message -sf, --setup-file

The setup file is a YAML file that contains:

  • settings for PostgreSQL
  • settings for MongoDB
  • general settings

Settings for PostgreSQL

  • db_name: name of database
  • connection: connection string to database
  • schema_name: name of schema where the collections will be transfered to
  • schema_reset:
    • true: drop the existing schema and create a new one with the same name
    • false: keep the existing schema
  • table_truncate: truncate all tables in the collection map before starting data transfer
  • table_drop: drop table before starting data transfer

Settings for MongoDB

  • db_name: name of database
  • connection: connection string to database

General Settings

  • tailing: keep tailing the oplog after collection transfer is finished
  • include_extra_props: include properties which are not described in the collection map

purrito -sf path/to/setup.yml -cf path/to/collections.yml

Command line options

You can set all the variables from the previous section using the command line. Passing connection strings to MongoDB and Postgres is mandatory.

  • -sf or --setup-file: path to the setup file if exists
  • -cf or --collection-file: path to the collection file if exists
  • -td or --table-drop: defaults to false
  • -tt or --table-truncate: defaults to false
  • -sr or --schema-reset: defaults to false
  • -sn or --schema-name: defaults to public
  • -pg or --pg-connection: connection string to PG database (*)
  • -mdb or --mongo-connection: connection string to Mongo database (*)
  • -n or --mongo-db-name: equivalent of db_name for MongoDB (*)
  • -t or --tail: equivalent of tailing; defaults to false
  • -s or --start: equivalent of tailing_from; type should be date, defaults to None
  • -tsdb or --start-from-tsdb: equivalent of tailing_from_db; defaults to false
  • -ex or --include-extra-properties: defaults to false

Starting Purrito without using a setup file:

purrito -cf path/to/collections.yml -pg postgres://127.0.0.1:5432/postgres -mdb mongodb://localhost:27017 -n mongo_db_name -t

Variables followed by (*) are mandatory.

Example setup.yml
               postgres: 
                  db_name: my_pg_database
                  connection: postgres://127.0.0.1:5432/postgres
                  schema_name: maine_coon
                  schema_reset: false
                  table_truncate: false
                  table_drop: false
               mongo:
                  db_name: my_mongo_database
                  connection: mongodb://localhost:27017
               tailing: true
               include_extra_props: true
              

6. Tailing

There are two types of tailing:

  • tailing which is preceded by collection transfer
  • tailing from specific timestamp (only if collections exist)

Tailing without collection transfer must have a starting point defined by the user. This starting point can be a timestamp which can be passed using -ts datetime. Purrito creates a table called purr_info where it saves the current Unix epoch time. This timestamp is updated every couple of minutes. The update happens only if the current document is successfully transfered. When an error occurs, you can be sure that Purrito saved the latest successful timestamp to purr_info and can continue tailing the oplog from -tsdb. purr_info is always created when you first start Purrito. If you do not add any timestamps to -s:

  • the current timestamp will be inserted
  • it will read the latest timestamp and check if the oplog has any new entries
  • if the timestamp is "too old", Purrito updates all collections

if the timestamp is "too old", Purrito updates all collections

  • the latest timestamp will be inserted into purr_info

7. Output

Logs

Purrito logs have the following form: [type] + [component] + message . Type can be info, warning, error, etc. Component is the part of the application that the log "belongs to".

Stat tables and recovery tables

Purrito creates several tables. These tables contain valuable information about the data transfer and in some cases help Purrito recover itself in case of a crash.

purr_info

Holds the timestamp of the latest successfully synced document and lets Purrito recover in case of a crash.

purr_oplog

This table records the documents that were tailed. Contains the operation, timestamp, and the document(s) which were synced after the collection transfer.

purr_error

  • purr_error contains detailed information about errors
  • which component of Purrito detected the error
  • message
  • timestamp
  • This table contains errors that happened during data transfer.

    purr_transfer_stats

    This table helps us see how our data grows over time, it contains information about the collection trasfer:

    • name of the relation,
    • action (INSERT, UPSERT or FULL VACUUM),
    • number of rows affected,
    • start and end of the transfer

    purr_collection_map

    Contains information about each relation:

    • name of the collection in MongoDB
    • name of the relation in PostgreSQL
    • column names and corresponding types
    • timestamp of the insert/update of the current row
    • the query that can be used to update the table definition

    Query Example
                SELECT 
                  relation, 
                  action, 
                  TO_TIMESTAMP(ts_end) - TO_TIMESTAMP(ts_start) duration
                FROM purr_transfer_stats 
                ORDER BY ts_start DESC;
              

    Connectivity issues

    When starting for the first time, Purrito creates a table called purr_info. This table contains a timestamp which is refreshed every couple of minutes in case of a successful transfer. If Purrito is disconnected from the database, it waits a couple of seconds before attempting to reconnect. If succeeded, Purrito first checks purr_info for the latest timestamp it managed to save and continues tailing from there.

    8. Contribute

    Purrito is an open-source project that was built using Python. If you would like to contribute, check out our guidelines

    Our most valuable contributors

    Towelie

    helps us think out of the box

    BIO

    Kenny McCormick

    specializes in removing dead code

    BIO

    Kyle Broflovski

    makes sure we learn something every day

    BIO

    Guidelines

    Contributing to Purrito is a great way to learn more about MongoDB, Python and PostgreSQL

    PEP 8 Conventions

    To make sure you follow PEP 8, use pycodestyle (previously pep8 ). Pycodestyle is a tool that checks your code against style conventions in PEP 8.

    Cleanup

    We like to clean up our code to make sure that no unused functions were left behind. We expect you to do the same so the code can be cleaner, easier to understand and also easier to maintain. Vulture is a great tool to eliminate unused code. Remember, YAGNI :)

                    def unused_function():
                      print("Delete me!")                     
                

    Code Metrics

    Radon is an awesome tool which computes various code metrics like raw metrics, cyclomatic complexity, etc. Check out their docs and see how you can write code that is easier to read and maintain.

    Commit Messages

    When it comes to commit messages, we love to follow rules defined by Angular so our commit messages make more sense and also easier to read:

                  feat(extractor): adding slack integration 
                  test(tailer): updating test for checking types
                

    Submitting a PR

    ...

    Plans

    In the future, we would like to add some new features to Purrito.

    • slack integration
    • data transfer from multiple MongoDB instances
    • exposing an API
    • ...