Documentation

Purrito v1

1. Introduction

Purrito is an easy to use, open-source ETL tool which transfers data from MongoDB to PostgreSQL.

You can download it from PyPI if you want to

migrate from MongoDB to PostgreSQL
use BI tools like Periscope but want to keep using MongoDB at the same time

Purrito is released under the terms of the MIT License.

1.1 Features

transfer collections from MongoDB to Postgres tables
continue syncing with MongoDB after data transfer is finished
generate the schema based on existing MongoDB collections
update the schema while Purrito is syncing PostgreSQL with MongoDB
transfer data to a custom schema in Postgres
when (re)starting:
- update, drop or trucate the existing tables,
- create or empty the whole schema etc.
- start syncing with MongoDB from a specific timestamp
view collection transfer statistics

2. Download & Install

The current stable release is v1.0

Install using pip

Purrito is available on PyPI. Make sure to have an up-to-date pip:


              pip install -U pip

Install Purrito:


              pip install purrito

Installing from source

You can also download Purrito from Github and install from source:

clone therepo
in the same folder where the setup.py file is located, type:
python3 setup.py install

After the installation is done, check if you have the latest version of Purrito:

purrito -v

3. Quickstart

With Docker

We created an example for those who have Docker installed. Download the example and in purrito-examples/example1/ type

docker compose up to start the entire app

docker compose down to stop the app

Without Docker

Install Purrito using pip:
pip install purrito
Start MongoDB as a replica set
Note: this step is necessary if we want Purrito to sync the two databases. Otherwise Purrito will stop after transfering all the collections from the collection map.
Generate the collection map that will describe the database schema:
purrito -m -mdb mongodb://localhost:27017 -n db_name
Start PostgreSQL
To start the data transfer, type:
purrito -cf collections.yml -pg postgres://localhost:5432/postgres -mdb mongodb://localhost:27017 -n db_name -t

Purrito transfers all collections to the public schema and starts tailing the oplog.

Changes to any documents in MongoDB should be applied to the corresponding rows in PostgreSQL.

4. The Collection Map


              -cf, --collection-file

The collection map is a YAML file which contains information about the database and the collections that will be transferred. The collections that are not part of the map will be ignored when loading data into Postgres.

The map should have the following structure:

              mongo_database_name:
                Collection1:
                  :columns:
                    - column1:
                      :source: field1
                      :type: TEXT
                    - column2:
                      :source: field2
                      :type: DOUBLE PRECISION
                  :meta:
                    :table: collection1
                    :extra_props: JSONB
                Collection2:
                  :columns:
                    - column3:
                      :source: field3
                      :type: TEXT
                    - column4:
                      :source: field4
                      :type: TIMESTAMP
                    - ...
                  :meta:
                    :table: collection2
                    :extra_props: JSONB

Example setup.yml

              test_db:
                Cat:
                  :columns:
                    - id:
                      :source: _id
                      :type: TEXT
                    - name:
                      :source: name
                      :type: TEXT
                    - active:
                      :source: active
                      :type: BOOLEAN
                    - number_of_legs:
                      :source: numberOfLegs
                      :type: DOUBLE PRECISION
                    - created_at:
                      :source: createdAt
                      :type: TIMESTAMP
                  :meta:
                    :table: cat
                    :extra_props: JSONB

Explanation: When connecting to the MongoDB instance, Purrito looks for the database name in the collection map (my_mongo_database). Collection Cat will be transferred to table cat (described under :meta -> :table). Table cat will have 6 columns with the following types:

              id: text
              name: text
              active: boolean
              domains: jsonb
              created_at: text
              extra_props: jsonb

Data Types

Purrito uses 5 different data types when creating rows for a table:

boolean
double precision
text
timestamp
jsonb

Type Conversion

When starting Purrito, the collection map is inserted into a table called purr_transfer_info. After changing any column type in the map, Purrito tries to apply the changes to the tables in Postgres. This is sometimes rejected by Postgres and in that case, an error will be logged.

The following table shows which conversions can happen without an error

	boolean	double precision	text	timestamp	jsonb
boolean	✓		✓		✓
double precision		✓	✓		✓
text			✓		✓
timestamp			✓	✓	✓
jsonb			✓		✓

Conversions that have ✔, can be easily handled by Postgres.

Extra properties

It is not necessary to have definition for every key in the collection map. In that case, a column named _extra_properties with type JSONB can be created in PostgreSQL. Every key and value that lacks the definition can be merged together forming a new value and inserted into this column. Purrito does not transfer extra properties by default because it slows down the data transfer.

To include extra properties, start Purrito with option -ex.

Generating the map

Writing a collection map might be exhausting, especially when there are lots of collections to transfer. Generating the collection map can save a lot of time. Here is how it works:

When starting Purrito, by adding the flag -m (--map) and the Mongo connection information:

purrito -m -mdb mongodb://localhost:27017 -n db_name

Purrito generates the YAML file with the collection definitions. In case the data file can be changed before starting the data transfer Here are some things to keep in mind for generating a collection map:

Purrito expects collection and field names in camelCase and outputs them in snake_case. This may cause some issues when a document has _id and id fields, both will be converted to id. Having two columns with the same name is not possible in Postgres, therefore duplicates are not allowed in the collection map and any reoccurences of the same column names will be ignored.
```
                    name_db:
                      name_coll:
                        :columns:
                        - :source: _id
                          :type: TEXT
                          id: null
                        - :source: jellyFish
                          :type: TEXT
                          jelly_fish: null
                  
```
there is no definition for extra properties when generating the map
Note: we may add them later

a generated collection map will have a column name equal to null which can be left out when writing the collection map alone:

                  name_db:
                    name_coll:
                      :columns:
                      - :source: _id
                        :type: TEXT
                        id: null

collections with no documents are skipped
Purrito takes a collection sample which consists of the last 100 documents of each collection and checks each key and value it finds. After that, it determines the type for each value in a document and records it. If inconsistent data was found and values with different types belong to the same key, Purrito will choose the type which occured the most frequently.

After generating the collection map, start the data tranfer as explained earlier.

Updating the map

Sooner or later you will need to change your collection map. If you restart Purrito and keep your existing schema, your tables' fields in PG will be compared to the ones in your collection map.

Adding a new attribute will make Purrito update your entire collection.
Removing an attribute drops the column.
Changing a column name will drop the old column, create the new one and update your entire collection.
Changing a column type will result with an attempt to ALTER the column. If the attempt was unsuccessful, Purrito will try to convert your data to JSONB.

What about the tables in my Postgres database that were in the schema before using Purrito?

If you have any existing tables in your schema that you want to be left alone, make sure that you

don't include those tables in your collection map as a target (:meta: :table: cat)
set schema_reset to false
set table_drop to false
set table_truncate to false

5. Setup

There are multiple options which can be set on the command line but things can easily get messy. In some situations, it is better to organize these settings into a separate file.

Setup file

-h, --help show help message -sf, --setup-file

The setup file is a YAML file that contains:

settings for PostgreSQL
settings for MongoDB
general settings

Settings for PostgreSQL

db_name: name of database
connection: connection string to database
schema_name: name of schema where the collections will be transfered to
schema_reset:
- true: drop the existing schema and create a new one with the same name
- false: keep the existing schema
table_truncate: truncate all tables in the collection map before starting data transfer
table_drop: drop table before starting data transfer

Settings for MongoDB

db_name: name of database
connection: connection string to database

General Settings

tailing: keep tailing the oplog after collection transfer is finished
include_extra_props: include properties which are not described in the collection map

purrito -sf path/to/setup.yml -cf path/to/collections.yml

Command line options

You can set all the variables from the previous section using the command line. Passing connection strings to MongoDB and Postgres is mandatory.

-sf or --setup-file: path to the setup file if exists
-cf or --collection-file: path to the collection file if exists
-td or --table-drop: defaults to false
-tt or --table-truncate: defaults to false
-sr or --schema-reset: defaults to false
-sn or --schema-name: defaults to public
-pg or --pg-connection: connection string to PG database (*)
-mdb or --mongo-connection: connection string to Mongo database (*)
-n or --mongo-db-name: equivalent of db_name for MongoDB (*)
-t or --tail: equivalent of tailing; defaults to false
-s or --start: equivalent of tailing_from; type should be date, defaults to None
-tsdb or --start-from-tsdb: equivalent of tailing_from_db; defaults to false
-ex or --include-extra-properties: defaults to false

Starting Purrito without using a setup file:

purrito -cf path/to/collections.yml -pg
              postgres://127.0.0.1:5432/postgres -mdb mongodb://localhost:27017
              -n mongo_db_name -t

Variables followed by (*) are mandatory.

Example setup.yml

               postgres: 
                  db_name: my_pg_database
                  connection: postgres://127.0.0.1:5432/postgres
                  schema_name: maine_coon
                  schema_reset: false
                  table_truncate: false
                  table_drop: false
               mongo:
                  db_name: my_mongo_database
                  connection: mongodb://localhost:27017
               tailing: true
               include_extra_props: true

6. Tailing

There are two types of tailing:

tailing which is preceded by collection transfer
tailing from specific timestamp (only if collections exist)

Tailing without collection transfer must have a starting point defined by the user. This starting point can be a timestamp which can be passed using -ts datetime. Purrito creates a table called purr_info where it saves the current Unix epoch time. This timestamp is updated every couple of minutes. The update happens only if the current document is successfully transfered. When an error occurs, you can be sure that Purrito saved the latest successful timestamp to purr_info and can continue tailing the oplog from -tsdb. purr_info is always created when you first start Purrito. If you do not add any timestamps to -s:

the current timestamp will be inserted
it will read the latest timestamp and check if the oplog has any new entries
if the timestamp is "too old", Purrito updates all collections

if the timestamp is "too old", Purrito updates all collections

the latest timestamp will be inserted into purr_info

7. Output

Logs

Purrito logs have the following form: [type] + [component] + message . Type can be info, warning, error, etc. Component is the part of the application that the log "belongs to".

Stat tables and recovery tables

Purrito creates several tables. These tables contain valuable information about the data transfer and in some cases help Purrito recover itself in case of a crash.

purr_info

Holds the timestamp of the latest successfully synced document and lets Purrito recover in case of a crash.

purr_oplog

This table records the documents that were tailed. Contains the operation, timestamp, and the document(s) which were synced after the collection transfer.

purr_error

purr_error contains detailed information about errors

which component of Purrito detected the error

message

timestamp

This table contains errors that happened during data transfer.

purr_transfer_stats

This table helps us see how our data grows over time, it contains information about the collection trasfer:

name of the relation,
action (INSERT, UPSERT or FULL VACUUM),
number of rows affected,
start and end of the transfer

purr_collection_map

Contains information about each relation:

name of the collection in MongoDB
name of the relation in PostgreSQL
column names and corresponding types
timestamp of the insert/update of the current row
the query that can be used to update the table definition

Query Example

            SELECT 
              relation, 
              action, 
              TO_TIMESTAMP(ts_end) - TO_TIMESTAMP(ts_start) duration
            FROM purr_transfer_stats 
            ORDER BY ts_start DESC;

Connectivity issues

When starting for the first time, Purrito creates a table called purr_info. This table contains a timestamp which is refreshed every couple of minutes in case of a successful transfer. If Purrito is disconnected from the database, it waits a couple of seconds before attempting to reconnect. If succeeded, Purrito first checks purr_info for the latest timestamp it managed to save and continues tailing from there.

8. Contribute

Purrito is an open-source project that was built using Python. If you would like to contribute, check out our guidelines

Our most valuable contributors

Towelie

helps us think out of the box

BIO

Kenny McCormick

specializes in removing dead code

BIO

Kyle Broflovski

makes sure we learn something every day

BIO

Guidelines

Contributing to Purrito is a great way to learn more about MongoDB, Python and PostgreSQL

PEP 8 Conventions

To make sure you follow PEP 8, use pycodestyle (previously pep8 ). Pycodestyle is a tool that checks your code against style conventions in PEP 8.

Cleanup

We like to clean up our code to make sure that no unused functions were left behind. We expect you to do the same so the code can be cleaner, easier to understand and also easier to maintain. Vulture is a great tool to eliminate unused code. Remember, YAGNI :)

                def unused_function():
                  print("Delete me!")

Code Metrics

Radon is an awesome tool which computes various code metrics like raw metrics, cyclomatic complexity, etc. Check out their docs and see how you can write code that is easier to read and maintain.

Commit Messages

When it comes to commit messages, we love to follow rules defined by Angular so our commit messages make more sense and also easier to read:

              feat(extractor): adding slack integration 
              test(tailer): updating test for checking types

Submitting a PR

...

Plans

In the future, we would like to add some new features to Purrito.

slack integration
data transfer from multiple MongoDB instances
exposing an API
...