AWS Glue with custom Python libraries


Glue Jobs are an great way to run serverless ETL jobs in AWS. The job runs on PySpark to provide to ability to run jobs in parallel. The base is a just a Python environment. (Glue 0.9 = Python 2, Glue 2.0 and 3.0 are Python 3)

Glue provides a set of pre-installed python packages like boto3, pandas. The full-list can be found at the aws website.

Custom Packages

There are multiple ways to add your down custom packages. The first one is the build-in installer, with the following argument to the job:

--additional-python-modules = wantedmodule==1.1.1,anothermodule

This will call internally pip install with the packages as parameter

Advanced options

In situations where you want to use your own pypi.org mirror, or need some specific command-line option from pip. There a another argument with little documentation:

--python-modules-installer-option = --index-url=https://my.local.mirror.com/simple/ --extra-index-url=https://my.local.mirror.com/sdp-python-snapshots/simple/ --trusted-host=my.local.mirror.com

This will use the local repo: my.local.mirror.com. useful in-case of egress limitations

Pre-package

If you want to control the whole process without external services, it’s possible to package all the required dependencies into a zip file and place this on an s3 bucket.

Example to download all the packages and create a modules.zip

requirements.txt

wantedmodule==1.1.1
anothermodule
Terminal window
pip -q install --prefer-binary --no-deps --isolated --only-binary=:all: --platform linux_x86_86 --platform manylinux_2_17_x86_64 --platform manylinux_2_5_x86_64 --platform manylinux2014_x86_64 --platform any --python-version 37 -r requirements.txt -t dependency/
cd dependency && zip -qr9 ../artifacts/deps.zip * -x *.whl -x **/__pycache__/*
aws s3 cp ../artifacts/deps.zip s3://my-deploy-bucket/python_modules/modules.zip

Then provide the following argument to the glue job:

--extra-py-files = `s3://my-deploy-bucket/python_modules/modules.zip”

The zip file will be extracted in /tmp/modules.zip/ and is available in the python path. It works file for source packages, but i had an issue loading binary libraries from the /tmp path.