Sunday, March 31, 2024

The Docker Compose of ETL: Meerschaum Compose | by Bennett Meares | Jun, 2023

Must read


An exampe Meerschaum Compose file.
An instance Meerschaum Compose undertaking for ETL on climate knowledge.

Pipes

The common-or-garden pipe is Meerschaum’s abstraction for incremental ETL. Pipes have enter and output connectors and retailer parameters to configure the habits of their syncing processes. This can be so simple as a SQL question or might embrace customized keys to be used in your plugins.

Meerschaum pipes created by Meerschaum Compose.
Pipes from the above Compose undertaking displayed by the net UI

As a result of pipes’ metadata are saved alongside their tables, they’re simply editable (whether or not by way of edit pipes or on the internet UI), which facilitates prototyping. However this dynamic nature introduces the identical drawback described at first of this text: with the intention to scale growth, a Compose file is required to outline a undertaking’s elements in a means that may be simply version-controlled.

In line with the Meerschaum Compose specification, pipes are outlined in a listing below the keys sync:pipes. Every merchandise defines the keys and parameters wanted to assemble the pipe, like a blueprint for what you count on the pipes within the database to mirror.

For instance, the next snippet would outline a pipe that will sync a desk climate from a distant PostgreSQL database (outlined under as sql:supply) to a neighborhood SQLite file (sql:dest on this undertaking).

sync:
pipes:
- connector: "sql:supply"
metric: "climate"
goal: "climate"
columns:
datetime: "timestamp"
station: "station"
parameters:
fetch:
backtrack_minutes: 1440
question: |-
SELECT timestamp, station, temperature
FROM climate

config:
meerschaum:
occasion: "sql:dest"
connectors:
sql:
supply: "postgresql://person:go@host:5432/db"
dest: "sqlite:////tmp/dest.db"

This instance would incrementally replace a desk named climate utilizing the datetime axis timestamp for vary bounding (1 day backtracking), and this column plus the ID column station collectively would make up a composite major key used for de-duplication.

The URI is written actually simply for instance; if you’re committing a compose file, both reference an surroundings variable (e.g. $SECRET_URI) or your host Meerschaum configuration (e.g. MRSM{meerschaum:connectors:sql:supply}).

Connectors

First, a fast refresher on Meerschaum connectors: you’ll be able to outline connectors via a number of methods, the preferred of which being via surroundings variables. Suppose you outline your connection secrets and techniques in an surroundings file:

export MRSM_SQL_REMOTE='postgresql://person:go@host:5432/db'
export MRSM_FOO_BAR='{
"person": "abc",
"password": "def"
}'

The primary surroundings variable MRSM_SQL_REMOTE would outline the connector sql:distant. If you happen to sourced this file, you may confirm this connector with the command mrsm present connectors sql:distant.

The second variable is an instance of the best way to outline a customized FooConnector, which you may create utilizing the @make_connector decorator in a plugin. Customized connectors are a robust instrument, however for now, right here’s the essential construction:

from meerschaum.connectors import make_connector, Connector

@make_connector
class FooConnector(Connector):
REQUIRED_ATTRIBUTES = ['username', 'password']

def fetch(pipe, **kwargs):
docs = []
return docs

So we’ve simply reviewed the best way to outline connectors in our host surroundings. Let’s see the best way to make these host connectors accessible in a Meerschaum undertaking. Within the compose file, all the connectors we’d like for our undertaking are outlined below config:meerschaum:connectors. Use the MRSM{} syntax to reference the keys out of your host surroundings and go them into the undertaking.

config:
meerschaum:
occasion: "sql:app"
connectors:
sql:
app: MRSM{meerschaum:connectors:sql:distant}
foo:
bar: MRSM{meerschaum:connectors:foo:bar}

Plugins

Meerschaum is definitely extendable by way of plugins, that are Python modules. Plugins might fetch knowledge, implement customized connectors, and/or lengthen Meerschaum (e.g. customized actions, flags, API endpoints, and many others.).

Meerschaum helps a number of plugins directories (by way of MRSM_PLUGINS_DIR), which can be set below the plugins_dir key in mrsm-compose.yaml (the default is a listing plugins).

Storing your plugins inside a Compose undertaking makes it clear the way you count on your plugins for use. For instance, the Compose file throughout the MongoDBConnector undertaking demonstrates how the customized connector is used as each a connector and for example.

Package deal Administration

Whenever you first begin utilizing Meerschaum Compose, the very first thing you’ll discover is that it’ll begin putting in a good quantity of Python packages. Don’t fear about your surroundings ― all the pieces is put in into digital environments inside your undertaking’s root subdirectory (a bit ironic, proper?). You possibly can set up your plugins’ dependencies with mrsm compose init.

To share packages between tasks, set the important thing root_dir in mrsm-compose.yml to a brand new path. Deleting this root listing will successfully uninstall all the packages that Compose downloaded, protecting your host surroundings intact.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article