Documentation
Purrito v1
1. Introduction
Purrito is an easy to use, open-source ETL tool which transfers data from MongoDB to PostgreSQL.
You can download it from PyPI if you want to
- migrate from MongoDB to PostgreSQL
- use BI tools like Periscope but want to keep using MongoDB at the same time
Purrito is released under the terms of the MIT License.
1.1 Features
- transfer collections from MongoDB to Postgres tables
- continue syncing with MongoDB after data transfer is finished
- generate the schema based on existing MongoDB collections
- update the schema while Purrito is syncing PostgreSQL with MongoDB
- transfer data to a custom schema in Postgres
-
when (re)starting:
- update, drop or trucate the existing tables,
- create or empty the whole schema etc.
- start syncing with MongoDB from a specific timestamp
- view collection transfer statistics

2. Download & Install
The current stable release is v1.0
Install using pip
Purrito is available on PyPI. Make sure to have an up-to-date pip:
pip install -U pip
Install Purrito:
pip install purrito
Installing from source
You can also download Purrito from Github and install from source:
- clone therepo
-
in the same folder where the setup.py file is located, type:
python3 setup.py install
After the installation is done, check if you have the latest version of Purrito:
purrito -v

3. Quickstart
With Docker
purrito-examples/example1/
type
docker compose up
to start the entire app
docker compose down
to stop the app
Without Docker
-
Install Purrito using pip:
pip install purrito
-
Start MongoDB as a
replica set
Note: this step is necessary if we want Purrito to sync the two databases. Otherwise Purrito will stop after transfering all the collections from the collection map.
-
Generate the collection map that will describe the database schema:
purrito -m -mdb mongodb://localhost:27017 -n db_name
- Start PostgreSQL
-
To start the data transfer, type:
purrito -cf collections.yml -pg postgres://localhost:5432/postgres -mdb mongodb://localhost:27017 -n db_name -t
Purrito transfers all collections to the public
schema
and starts tailing the oplog.
Changes to any documents in MongoDB should be applied to the corresponding rows in PostgreSQL.

4. The Collection Map
-cf, --collection-file
The collection map is a YAML file which contains information about the database and the collections that will be transferred. The collections that are not part of the map will be ignored when loading data into Postgres.
The map should have the following structure:
mongo_database_name: Collection1: :columns: - column1: :source: field1 :type: TEXT - column2: :source: field2 :type: DOUBLE PRECISION :meta: :table: collection1 :extra_props: JSONB Collection2: :columns: - column3: :source: field3 :type: TEXT - column4: :source: field4 :type: TIMESTAMP - ... :meta: :table: collection2 :extra_props: JSONB
Example setup.yml
test_db: Cat: :columns: - id: :source: _id :type: TEXT - name: :source: name :type: TEXT - active: :source: active :type: BOOLEAN - number_of_legs: :source: numberOfLegs :type: DOUBLE PRECISION - created_at: :source: createdAt :type: TIMESTAMP :meta: :table: cat :extra_props: JSONB
Explanation: When connecting to the MongoDB instance, Purrito looks for the database name in the collection map (my_mongo_database). Collection Cat will be transferred to table cat (described under :meta -> :table). Table cat will have 6 columns with the following types:
id: text name: text active: boolean domains: jsonb created_at: text extra_props: jsonb
Data Types
Purrito uses 5 different data types when creating rows for a table:
- boolean
- double precision
- text
- timestamp
- jsonb
Type Conversion
When starting Purrito, the collection map is inserted into a table
called purr_transfer_info
. After changing any column
type in the map, Purrito tries to apply the changes to the tables in
Postgres.
This is sometimes rejected by Postgres and in that case, an error
will be logged.
The following table shows which conversions can happen without an error
boolean | double precision | text | timestamp | jsonb | |
---|---|---|---|---|---|
boolean | ✓ | ✓ | ✓ | ||
double precision | ✓ | ✓ | ✓ | ||
text | ✓ | ✓ | |||
timestamp | ✓ | ✓ | ✓ | ||
jsonb | ✓ | ✓ |
Conversions that have ✔, can be easily handled by Postgres.
Extra properties
It is not necessary to have definition for every key in the
collection map. In that case, a column named
_extra_properties
with type JSONB
can be
created in PostgreSQL. Every key and value that lacks the
definition can be merged together forming a new value and inserted
into this column.
Purrito does not transfer extra properties by default because it
slows down the data transfer.
To include extra properties, start Purrito with option
-ex
.
Generating the map
Writing a collection map might be exhausting, especially when there are lots of collections to transfer. Generating the collection map can save a lot of time. Here is how it works:
When starting Purrito, by adding the flag
-m
(--map
) and the Mongo connection
information:
purrito -m -mdb mongodb://localhost:27017 -n db_name
Purrito generates the YAML file with the collection definitions. In case the data file can be changed before starting the data transfer Here are some things to keep in mind for generating a collection map:
-
Purrito expects collection and field names in camelCase and outputs
them in snake_case. This may cause some issues when a document
has
_id
andid
fields, both will be converted toid
. Having two columns with the same name is not possible in Postgres, therefore duplicates are not allowed in the collection map and any reoccurences of the same column names will be ignored.name_db: name_coll: :columns: - :source: _id :type: TEXT id: null - :source: jellyFish :type: TEXT jelly_fish: null
-
there is no definition for extra properties when generating the
map
Note: we may add them later
-
a generated collection map will have a column name equal to
null
which can be left out when writing the collection map alone:name_db: name_coll: :columns: - :source: _id :type: TEXT id: null
- collections with no documents are skipped
- Purrito takes a collection sample which consists of the last 100 documents of each collection and checks each key and value it finds. After that, it determines the type for each value in a document and records it. If inconsistent data was found and values with different types belong to the same key, Purrito will choose the type which occured the most frequently.
After generating the collection map, start the data tranfer as explained earlier.
Updating the map
Sooner or later you will need to change your collection map. If you restart Purrito and keep your existing schema, your tables' fields in PG will be compared to the ones in your collection map.
- Adding a new attribute will make Purrito update your entire collection.
- Removing an attribute drops the column.
- Changing a column name will drop the old column, create the new one and update your entire collection.
- Changing a column type will result with an attempt to ALTER the column. If the attempt was unsuccessful, Purrito will try to convert your data to JSONB.
What about the tables in my Postgres database that were in the schema before using Purrito?
If you have any existing tables in your schema that you want to be left alone, make sure that you
- don't include those tables in your collection map as a target (:meta: :table: cat)
- set schema_reset to false
- set table_drop to false
- set table_truncate to false

5. Setup
There are multiple options which can be set on the command line but things can easily get messy. In some situations, it is better to organize these settings into a separate file.
Setup file
-h, --help
show help message
-sf, --setup-file
The setup file is a YAML file that contains:
- settings for PostgreSQL
- settings for MongoDB
- general settings
Settings for PostgreSQL
db_name
: name of databaseconnection
: connection string to database-
schema_name
: name of schema where the collections will be transfered to -
schema_reset
:- true: drop the existing schema and create a new one with the same name
- false: keep the existing schema
-
table_truncate
: truncate all tables in the collection map before starting data transfer -
table_drop
: drop table before starting data transfer
Settings for MongoDB
db_name
: name of databaseconnection
: connection string to database
General Settings
-
tailing
: keep tailing the oplog after collection transfer is finished -
include_extra_props
: include properties which are not described in the collection map
purrito -sf path/to/setup.yml -cf path/to/collections.yml
Command line options
You can set all the variables from the previous section using the command line. Passing connection strings to MongoDB and Postgres is mandatory.
-
-sf
or--setup-file
: path to the setup file if exists -
-cf
or--collection-file
: path to the collection file if exists -
-td
or--table-drop
: defaults to false -
-tt
or--table-truncate
: defaults to false -
-sr
or--schema-reset
: defaults to false -
-sn
or--schema-name
: defaults to public -
-pg
or--pg-connection
: connection string to PG database (*) -
-mdb
or--mongo-connection
: connection string to Mongo database (*) -
-n
or--mongo-db-name
: equivalent of db_name for MongoDB (*) -
-t
or--tail
: equivalent of tailing; defaults to false -
-s
or--start
: equivalent of tailing_from; type should be date, defaults to None -
-tsdb
or--start-from-tsdb
: equivalent of tailing_from_db; defaults to false -
-ex
or--include-extra-properties
: defaults to false
Starting Purrito without using a setup file:
purrito -cf path/to/collections.yml -pg
postgres://127.0.0.1:5432/postgres -mdb mongodb://localhost:27017
-n mongo_db_name -t
Variables followed by (*) are mandatory.
Example setup.yml
postgres: db_name: my_pg_database connection: postgres://127.0.0.1:5432/postgres schema_name: maine_coon schema_reset: false table_truncate: false table_drop: false mongo: db_name: my_mongo_database connection: mongodb://localhost:27017 tailing: true include_extra_props: true

6. Tailing
There are two types of tailing:
- tailing which is preceded by collection transfer
- tailing from specific timestamp (only if collections exist)
Tailing without collection transfer must have a starting point
defined by the user. This starting point can be a timestamp which
can be passed using -ts
datetime. Purrito creates a
table called purr_info where it saves the current Unix epoch time.
This timestamp is updated every couple of minutes. The update
happens only if the current document is successfully transfered.
When an error occurs, you can be sure that Purrito saved the latest
successful timestamp to purr_info and can continue tailing the
oplog from -tsdb. purr_info is always created when you first start
Purrito. If you do not add any timestamps to -s:
- the current timestamp will be inserted
- it will read the latest timestamp and check if the oplog has any new entries
- if the timestamp is "too old", Purrito updates all collections
if the timestamp is "too old", Purrito updates all collections
- the latest timestamp will be inserted into purr_info

7. Output
Logs
Purrito logs have the following form: [type] + [component] + message . Type can be info, warning, error, etc. Component is the part of the application that the log "belongs to".
Stat tables and recovery tables
Purrito creates several tables. These tables contain valuable information about the data transfer and in some cases help Purrito recover itself in case of a crash.
purr_info
Holds the timestamp of the latest successfully synced document and lets Purrito recover in case of a crash.
purr_oplog
This table records the documents that were tailed. Contains the operation, timestamp, and the document(s) which were synced after the collection transfer.
purr_error
This table contains errors that happened during data transfer.
purr_transfer_stats
This table helps us see how our data grows over time, it contains information about the collection trasfer:
- name of the relation,
- action (INSERT, UPSERT or FULL VACUUM),
- number of rows affected,
- start and end of the transfer
purr_collection_map
Contains information about each relation:
- name of the collection in MongoDB
- name of the relation in PostgreSQL
- column names and corresponding types
- timestamp of the insert/update of the current row
- the query that can be used to update the table definition
Query Example
SELECT relation, action, TO_TIMESTAMP(ts_end) - TO_TIMESTAMP(ts_start) duration FROM purr_transfer_stats ORDER BY ts_start DESC;
Connectivity issues
When starting for the first time, Purrito creates a table called purr_info. This table contains a timestamp which is refreshed every couple of minutes in case of a successful transfer. If Purrito is disconnected from the database, it waits a couple of seconds before attempting to reconnect. If succeeded, Purrito first checks purr_info for the latest timestamp it managed to save and continues tailing from there.

8. Contribute
Purrito is an open-source project that was built using Python. If you would like to contribute, check out our guidelinesOur most valuable contributors
Towelie
helps us think out of the box

Kenny McCormick
specializes in removing dead code

Kyle Broflovski
makes sure we learn something every day

Guidelines
Contributing to Purrito is a great way to learn more about MongoDB, Python and PostgreSQL
PEP 8 Conventions
To make sure you follow PEP 8, use pycodestyle (previously pep8 ). Pycodestyle is a tool that checks your code against style conventions in PEP 8.
Cleanup
We like to clean up our code to make sure that no unused functions were left behind. We expect you to do the same so the code can be cleaner, easier to understand and also easier to maintain. Vulture is a great tool to eliminate unused code. Remember, YAGNI :)
def unused_function(): print("Delete me!")
Code Metrics
Radon is an awesome tool which computes various code metrics like raw metrics, cyclomatic complexity, etc. Check out their docs and see how you can write code that is easier to read and maintain.
Commit Messages
When it comes to commit messages, we love to follow rules defined by Angular so our commit messages make more sense and also easier to read:
feat(extractor): adding slack integration test(tailer): updating test for checking types
Submitting a PR
...
Plans
In the future, we would like to add some new features to Purrito.
- slack integration
- data transfer from multiple MongoDB instances
- exposing an API
- ...
