Python and modules for ETL jobs on AWS
Python is a powerfull tool to run ETL jobs on AWS. The 3 main ways are Lambda, Glue (pyspark) and Glue Pythonshell. Lambda can run max 15 minutes. Glue Pythonshell is a simple environment for run python scripts, the Glue Spark provides a full serverless PySpark environment to run complex scripts.
Other systems
There are many ways to run scripts on AWS, there are out of scope for now.
- AWS Batch
- Fargate
- EKS
- Amazon Managed Workflows for Apache Airflow
- anything custom.
Adding packages options
While python is powerfull, you need additional libraries for your code. The most common for ETL processing are NumPy and Pandas. But many others are available for retrieving api’s, parsing files, etc.
The installation differs among the job types. On your local development you might use pip or poetry. This will download the files from the PyPi repo and in some cases this comes with precompiled libraries or c/rust code that’s required to compile.
These are the available methods:
Type | Description |
---|---|
Zip File | Package additional libraries as a zip file |
C/Binary | Add Packages with Binaries/Compiled (like TeraData or pandas) |
pip install | Run a pip install before start |
pip install private repo | Use Pip install with a private repo |
Library | Add a managed library with a set of package |
This table shows the support options:
Type | Zip file | C/binary | install | install private repo | Library |
---|---|---|---|---|---|
Lambda | :white_check_mark: | :white_check_mark: | :x: | :x: | any layer |
Glue 2 | :white_check_mark: | :x: | :white_check_mark: | :white_check_mark: | |
Glue 3 | :white_check_mark: | :x: | :white_check_mark: | :white_check_mark: | |
Glue Pythonshell 3.6 | :white_check_mark: | :x: | :white_check_mark: | :x: | |
Glue Pythonshell 3.9 | :white_check_mark: | :x: | :white_check_mark: | :x: | analytics |
Available packages
Luckily some packages are installed by default. I compiled an overview for each of the types:
As per 28th Oct 2022 the following packages are available in the runtimes by default. AWSSDKPandas is a lambda layer (formally AwsWrangler) that is now available as an AWSLayer.
Package | glue2 | glue3 | glue pythonshell | glue pythonshell + analytics | lambda3.7 | lambda3.8 | lambda3.9 | AWSSDKPandas-2.17 |
---|---|---|---|---|---|---|---|---|
Python | 3.7 | 3.7 | 3.6 | 3.9 | 3.7 | 3.8 | 3.9 | |
Cython | 0.29.15 | |||||||
PyMySQL | 0.9.3 | 1.0.2 | 1.0.2 | |||||
SQLAlchemy | 1.4.36 | |||||||
aenum | 3.1.11 | |||||||
aiobotocore | 1.4.2 | |||||||
aiohttp | 3.8.1 | 3.8.1 | ||||||
aioitertools | 0.10.0 | |||||||
aiosignal | 1.2.0 | 1.2.0 | ||||||
asn1crypto | 1.5.1 | |||||||
async-timeout | 4.0.2 | 4.0.2 | ||||||
asynctest | 0.13.0 | |||||||
attrs | 21.4.0 | 22.1.0 | ||||||
avro | 1.11.0 | |||||||
avro-python3 | 1.10.2 | |||||||
awscli | 1.23.5 | 1.23.5 | ||||||
awsgluecustomconnectorpython | 1.0 | |||||||
awsgluedataplanepython | 1.0 | |||||||
awsgluemlentitydetectorwrapperpython | 1.0 | |||||||
awswrangler | 2.15.1 | 2.17.0 | ||||||
backoff | 2.1.2 | |||||||
beautifulsoup4 | 4.11.1 | |||||||
boto3 | 1.12.4 | 1.18.50 | 1.22.5 | 10.20.32 | 10.20.32 | 10.20.32 | ||
botocore | 1.15.4 | 1.21.50 | 1.23.5 | 1.23.5 | 1.23.32 | 1.23.32 | 1.23.32 | |
certifi | 2019.11.28 | 2021.5.30 | 2022.9.14 | |||||
chardet | 3.0.4 | 3.0.4 | ||||||
charset-normalizer | 2.1.0 | 2.0.12 | ||||||
click | 8.1.3 | |||||||
cycler | 0.10.0 | 0.10.0 | ||||||
cython | 0.29.4 | |||||||
decorator | 5.1.1 | |||||||
docutils | 0.15.2 | 0.17.1 | ||||||
elasticsearch | 8.2.0 | |||||||
enum34 | 1.1.9 | 1.1.10 | ||||||
et-xmlfile | 1.1.0 | |||||||
frozenlist | 1.3.0 | 1.3.1 | ||||||
fsspec | 0.6.2 | 2021.8.1 | ||||||
gremlinpython | 3.6.1 | |||||||
idna | 2.9 | 2.10 | 3.4 | |||||
importlib-metadata | 4.12.0 | |||||||
isodate | 0.6.1 | |||||||
jmespath | 0.9.4 | 0.10.0 | 1.0.1 | |||||
joblib | 0.14.1 | 1.0.1 | ||||||
jsonpath-ng | 1.5.3 | |||||||
kiwisolver | 1.1.0 | 1.3.2 | ||||||
lxml | 4.9.1 | |||||||
matplotlib | 3.1.3 | 3.4.3 | ||||||
mpmath | 1.1.0 | 1.2.1 | ||||||
multidict | 6.0.2 | 6.0.2 | ||||||
nest-asyncio | 1.5.5 | |||||||
nltk | 3.6.3 | |||||||
numpy | 1.18.1 | 1.19.5 | 1.22.3 | 1.23.3 | ||||
openpyxl | 3.0.10 | |||||||
opensearch-py | 2.0.0 | |||||||
packaging | 21.3 | 21.3 | ||||||
pandas | 1.0.1 | 1.3.2 | 1.4.2 | 1.5.0 | ||||
patsy | 0.5.1 | 0.5.1 | ||||||
pg8000 | 1.29.1 | |||||||
pillow | 9.1.1 | |||||||
pip | 22.1.2 | |||||||
ply | 3.11 | |||||||
pmdarima | 1.5.3 | 1.8.2 | ||||||
progressbar2 | 4.0.0 | |||||||
psycopg2 | 2.9.3 | |||||||
ptvsd | 4.3.2 | 4.3.2 | ||||||
pyarrow | 0.16.0 | 5.0.0 | 8.0.0 | |||||
pyathena | 2.5.3 | |||||||
pydevd | 1.9.0 | 2.5.0 | ||||||
pyhocon | 0.3.54 | 0.3.58 | ||||||
pymysql | 1.0.2 | |||||||
pyodbc | 4.0.32 | |||||||
pyorc | 0.6.0 | |||||||
pyparsing | 2.4.6 | 2.4.7 | 3.0.9 | |||||
python-dateutil | 2.8.1 | 2.8.2 | 2.8.2 | |||||
python-utils | 3.3.3 | |||||||
pytz | 2019.3 | 2021.1 | 2022.1 | |||||
pyyaml | 5.4.1 | |||||||
redshift-connector | 2.0.907 | 2.0.908 | ||||||
regex | 2022.6.2 | |||||||
requests | 2.23.0 | 2.23.0 | 2.27.1 | 2.28.0 | ||||
requests-aws4auth | 1.1.2 | |||||||
s3fs | 0.4.0 | 2021.8.1 | 2022.3.0 | |||||
s3transfer | 0.3.3 | 0.5.0 | 0.6.0 | |||||
scikit-learn | 0.22.1 | 0.24.2 | 1.0.2 | |||||
scipy | 1.4.1 | 1.7.1 | 1.8.0 | |||||
scramp | 1.4.1 | |||||||
setuptools | 45.2.0 | 49.1.3 | ||||||
six | 1.14.0 | 1.16.0 | 1.16.0 | |||||
soupsieve | 2.3.2.post1 | |||||||
spark | 1.0 | |||||||
statsmodels | 0.11.1 | 0.12.2 | ||||||
subprocess32 | 3.5.4 | 3.5.4 | ||||||
sympy | 1.5.1 | 1.8 | ||||||
tbats | 1.0.9 | 1.1.0 | ||||||
threadpoolctl | 3.1.0 | |||||||
tqdm | 4.64.0 | |||||||
typing-extensions | 4.2.0 | |||||||
urllib3 | 1.25.8 | 1.25.11 | 1.26.12 | |||||
wheel | 0.37.0 | |||||||
wrapt | 1.14.1 | |||||||
yarl | 1.7.2 | 1.8.1 | ||||||
zipp | 3.8.0 |
Documentation:
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
- https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html
- https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html
- https://aws-sdk-pandas.readthedocs.io/en/2.17.0/install.html
Adding packages
Glue
—additional-python-modules
Both glue types support the parameter —additional-python-modules, this installs the python module before executing the code. It support the pip format for package.
Example: --additional-python-modules zipp,scikit-learn==0.21.3
:exclamation: in the aws documentation it mention also the option to install an S3 packages with this parameter, but this isn’t supported.
--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl
This results in can’t install /tmp/pyglue/s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl
—python-modules-installer-option
This supported all the paramaters from pip. It give the option to use a private pip repo.
Example: --python-modules-installer-option --index-url=https://user:[email protected]/artifactory/api/pypi/python-hosted/simple --extra-index-url=https://user:[email protected]/artifactory/api/pypi/python-hosted/simple --trusted-host=repo.local
:exclamation: this is only supported on Glue ETL jobs.
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library
library-set
library-set
, no --
prefix, is a new parameter, for pythonshell jobs 3.9.
The only available library at the moment is: analytics
Modules.zip
Create a requirements.txt
Run this to get and build the packages wil the linux dependencies for python 3.9 If you run the pip install on a mac without docker, you end up with mac binaries that don’t work in the etl job.
Lambda Layers
AWSSDKPandas formally AWSWrangler is now available to select as a default AWSLayer, available in all regions.
Custom Layers
todo.