Tech, Cloud and Programming

AWS Glue with custom Python libraries

|

Glue Jobs are an great way to run serverless ETL jobs in AWS. The job runs on PySpark to provide to ability to run jobs in parallel. The base is a just a Python environment. (Glue 0.9 = Python 2, Glue 2.0 and 3.0 are Python 3)

Glue provides a set of pre-installed python packages like boto3, pandas. The full-list can be found at the aws website.

Custom Packages

There are multiple ways to add your down custom packages. The first one is the build-in installer, with the following argument to the job:

--additional-python-modules = wantedmodule==1.1.1,anothermodule

This will call internally pip install with the packages as parameter

Advanced options

In situations where you want to use your own pypi.org mirror, or need some specific command-line option from pip. There a another argument with little documentation:

--python-modules-installer-option = --index-url=https://my.local.mirror.com/simple/ --extra-index-url=https://my.local.mirror.com/sdp-python-snapshots/simple/ --trusted-host=my.local.mirror.com

This will use the local repo: my.local.mirror.com. useful in-case of egress limitations

Pre-package

If you want to control the whole process without external services, it's possible to package all the required dependencies into a zip file and place this on an s3 bucket.

Example to download all the packages and create a modules.zip

requirements.txt

wantedmodule==1.1.1
anothermodule
pip -q install --prefer-binary --no-deps --isolated --only-binary=:all: --platform linux_x86_86 --platform manylinux_2_17_x86_64 --platform manylinux_2_5_x86_64 --platform manylinux2014_x86_64 --platform any  --python-version 37  -r requirements.txt -t dependency/
cd dependency && zip -qr9 ../artifacts/deps.zip * -x *.whl -x **/__pycache__/*
aws s3 cp ../artifacts/deps.zip s3://my-deploy-bucket/python_modules/modules.zip

Then provide the following argument to the glue job:

--extra-py-files = `s3://my-deploy-bucket/python_modules/modules.zip"

The zip file will be extracted in /tmp/modules.zip/ and is available in the python path. It works file for source packages, but i had an issue loading binary libraries from the /tmp path.