AWS Glue with custom Python libraries
Glue Jobs are an great way to run serverless ETL jobs in AWS. The job runs on PySpark to provide to ability to run jobs in parallel. The base is a just a Python environment. (Glue 0.9 = Python 2, Glue 2.0 and 3.0 are Python 3)
Glue provides a set of pre-installed python packages like boto3, pandas. The full-list can be found at the aws website.
Custom Packages
There are multiple ways to add your down custom packages. The first one is the build-in installer, with the following argument to the job:
--additional-python-modules
= wantedmodule==1.1.1,anothermodule
This will call internally pip install
with the packages as parameter
Advanced options
In situations where you want to use your own pypi.org mirror, or need some specific command-line option from pip. There a another argument with little documentation:
--python-modules-installer-option
= --index-url=https://my.local.mirror.com/simple/ --extra-index-url=https://my.local.mirror.com/sdp-python-snapshots/simple/ --trusted-host=my.local.mirror.com
This will use the local repo: my.local.mirror.com. useful in-case of egress limitations
Pre-package
If you want to control the whole process without external services, it’s possible to package all the required dependencies into a zip file and place this on an s3 bucket.
Example to download all the packages and create a modules.zip
requirements.txt
Then provide the following argument to the glue job:
--extra-py-files
= `s3://my-deploy-bucket/python_modules/modules.zip”
The zip file will be extracted in /tmp/modules.zip/
and is available in the python path.
It works file for source packages, but i had an issue loading binary libraries from the /tmp path.