Tech, Cloud and Programming

Python and modules for ETL jobs on AWS

|

Python is a powerfull tool to run ETL jobs on AWS. The 3 main ways are Lambda, Glue (pyspark) and Glue Pythonshell. Lambda can run max 15 minutes. Glue Pythonshell is a simple environment for run python scripts, the Glue Spark provides a full serverless PySpark environment to run complex scripts.

Other systems

There are many ways to run scripts on AWS, there are out of scope for now.

  • AWS Batch
  • Fargate
  • EKS
  • Amazon Managed Workflows for Apache Airflow
  • anything custom.

Adding packages options

While python is powerfull, you need additional libraries for your code. The most common for ETL processing are NumPy and Pandas. But many others are available for retrieving api's, parsing files, etc.

The installation differs among the job types. On your local development you might use pip or poetry. This will download the files from the PyPi repo and in some cases this comes with precompiled libraries or c/rust code that's required to compile.

These are the available methods:

TypeDescription
Zip FilePackage additional libraries as a zip file
C/BinaryAdd Packages with Binaries/Compiled (like TeraData or pandas)
pip installRun a pip install before start
pip install private repoUse Pip install with a private repo
LibraryAdd a managed library with a set of package

This table shows the support options:

TypeZip fileC/binaryinstallinstall private repoLibrary
Lambdaany layer
Glue 2
Glue 3
Glue Pythonshell 3.6
Glue Pythonshell 3.9analytics

Available packages

Luckily some packages are installed by default. I compiled an overview for each of the types:

As per 28th Oct 2022 the following packages are available in the runtimes by default. AWSSDKPandas is a lambda layer (formally AwsWrangler) that is now available as an AWSLayer.

Packageglue2glue3glue pythonshellglue pythonshell + analyticslambda3.7lambda3.8lambda3.9AWSSDKPandas-2.17
Python3.73.73.63.93.73.83.9
Cython0.29.15
PyMySQL0.9.31.0.21.0.2
SQLAlchemy1.4.36
aenum3.1.11
aiobotocore1.4.2
aiohttp3.8.13.8.1
aioitertools0.10.0
aiosignal1.2.01.2.0
asn1crypto1.5.1
async-timeout4.0.24.0.2
asynctest0.13.0
attrs21.4.022.1.0
avro1.11.0
avro-python31.10.2
awscli1.23.51.23.5
awsgluecustomconnectorpython1.0
awsgluedataplanepython1.0
awsgluemlentitydetectorwrapperpython1.0
awswrangler2.15.12.17.0
backoff2.1.2
beautifulsoup44.11.1
boto31.12.41.18.501.22.510.20.3210.20.3210.20.32
botocore1.15.41.21.501.23.51.23.51.23.321.23.321.23.32
certifi2019.11.282021.5.302022.9.14
chardet3.0.43.0.4
charset-normalizer2.1.02.0.12
click8.1.3
cycler0.10.00.10.0
cython0.29.4
decorator5.1.1
docutils0.15.20.17.1
elasticsearch8.2.0
enum341.1.91.1.10
et-xmlfile1.1.0
frozenlist1.3.01.3.1
fsspec0.6.22021.8.1
gremlinpython3.6.1
idna2.92.103.4
importlib-metadata4.12.0
isodate0.6.1
jmespath0.9.40.10.01.0.1
joblib0.14.11.0.1
jsonpath-ng1.5.3
kiwisolver1.1.01.3.2
lxml4.9.1
matplotlib3.1.33.4.3
mpmath1.1.01.2.1
multidict6.0.26.0.2
nest-asyncio1.5.5
nltk3.6.3
numpy1.18.11.19.51.22.31.23.3
openpyxl3.0.10
opensearch-py2.0.0
packaging21.321.3
pandas1.0.11.3.21.4.21.5.0
patsy0.5.10.5.1
pg80001.29.1
pillow9.1.1
pip22.1.2
ply3.11
pmdarima1.5.31.8.2
progressbar24.0.0
psycopg22.9.3
ptvsd4.3.24.3.2
pyarrow0.16.05.0.08.0.0
pyathena2.5.3
pydevd1.9.02.5.0
pyhocon0.3.540.3.58
pymysql1.0.2
pyodbc4.0.32
pyorc0.6.0
pyparsing2.4.62.4.73.0.9
python-dateutil2.8.12.8.22.8.2
python-utils3.3.3
pytz2019.32021.12022.1
pyyaml5.4.1
redshift-connector2.0.9072.0.908
regex2022.6.2
requests2.23.02.23.02.27.12.28.0
requests-aws4auth1.1.2
s3fs0.4.02021.8.12022.3.0
s3transfer0.3.30.5.00.6.0
scikit-learn0.22.10.24.21.0.2
scipy1.4.11.7.11.8.0
scramp1.4.1
setuptools45.2.049.1.3
six1.14.01.16.01.16.0
soupsieve2.3.2.post1
spark1.0
statsmodels0.11.10.12.2
subprocess323.5.43.5.4
sympy1.5.11.8
tbats1.0.91.1.0
threadpoolctl3.1.0
tqdm4.64.0
typing-extensions4.2.0
urllib31.25.81.25.111.26.12
wheel0.37.0
wrapt1.14.1
yarl1.7.21.8.1
zipp3.8.0

Documentation:

Adding packages

Glue

--additional-python-modules

Both glue types support the parameter --additional-python-modules, this installs the python module before executing the code. It support the pip format for package.

Example: --additional-python-modules zipp,scikit-learn==0.21.3

❗ in the aws documentation it mention also the option to install an S3 packages with this parameter, but this isn't supported. --additional-python-modules s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl This results in can't install /tmp/pyglue/s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl

--python-modules-installer-option

This supported all the paramaters from pip. It give the option to use a private pip repo.

Example: --python-modules-installer-option --index-url=https://user:[email protected]/artifactory/api/pypi/python-hosted/simple --extra-index-url=https://user:[email protected]/artifactory/api/pypi/python-hosted/simple --trusted-host=repo.local

❗ this is only supported on Glue ETL jobs.

https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library

library-set

library-set , no -- prefix, is a new parameter, for pythonshell jobs 3.9. The only available library at the moment is: analytics

Modules.zip

Create a requirements.txt

pandas==1.4.2

Run this to get and build the packages wil the linux dependencies for python 3.9 If you run the pip install on a mac without docker, you end up with mac binaries that don't work in the etl job.

docker run --rm -v ${PWD}:/var/task \
  public.ecr.aws/sam/build-python3.9 \
  sh -c "pip install --no-deps -r requirements.txt -t python/lib/python3.9/site-packages/; exit"

zip -r ../module.zip python

Lambda Layers

AWSSDKPandas formally AWSWrangler is now available to select as a default AWSLayer, available in all regions.

AWS layer

Custom Layers

todo.