AWS Glue vulnerabilities in default packages


Securing AWS Glue: A Guide to Identifying and Fixing Python Package Vulnerabilities

Introduction

Did you know that the default Python packages in AWS Glue contain a number of known vulnerabilities? While instances, containers, and Lambda functions are often scanned by tools like AWS Inspector, Trivy, and Snyk, data pipelines are frequently overlooked. Whether by accident or design, many data pipelines—often laden with Python code—interact with external systems and APIs to ingest data. As such, securing these pipelines is just as important as securing any other part of your infrastructure.

In this post, I’ll walk you through how to enhance the security of your AWS Glue data pipelines. The first issue I encountered is that these pipelines often combine system and runtime dependencies with application code. AWS Glue and Apache Airflow both provide Python environments with pre-installed packages, along with the option to add custom ones.

AWS Glue

For this post, I’ll focus specifically on the AWS Glue environment.

AWS Glue allows you to create three types of jobs:

  1. Glue ETL (PySpark):
    Glue ETL Python Libraries

  2. Python Shell:
    Python Shell Jobs in AWS Glue

  3. Ray (not supported):

The Glue ETL job spins up an on-demand Spark environment, while the Python Shell is more akin to a Lambda function. It doesn’t have the same 15-minute time limit but does have limited capacity.

Exporting System Requirements

While browsing the Glue documentation, I came across tables listing the pre-installed Python packages. I wrote a small program to parse these tables and export them to a requirements.txt file.

For a Python Shell job using Python 3.9, this is the output:

awscli==1.23.5
botocore==1.23.5

For Python Shell jobs, there’s also an option to set the library-set to analytics, which provides a set of commonly-used packages, including the useful AWS SDK for pandas. However, note that the version included is fairly outdated:

avro==1.11.0
awscli==1.23.5
awswrangler==2.15.1
botocore==1.24.21
boto3==1.21.21
elasticsearch==8.2.0
numpy==1.22.3
pandas==1.4.2
psycopg2==2.9.3
pyathena==2.5.3
PyMySQL==1.0.2
pyodbc==4.0.32
pyorc==0.6.0
redshift-connector==2.0.907
requests==2.27.1
scikit-learn==1.0.2
scipy==1.8.0
SQLAlchemy==1.4.36
s3fs==2022.3.0

Now we have the system dependencies in a workable format.

Run-Time Dependencies

AWS Glue also allows you to install additional packages at runtime using pip. You can extend or override the pre-installed Python packages as needed.

For more details, check the official AWS Glue Programming Python Libraries documentation.

Glue Inspector

With the above information, I created a tool called Glue Inspector. It downloads the AWS system dependencies, caches them locally, and then retrieves runtime dependencies. These are merged into a list and exported as a CycloneDX Software Bill of Materials (SBOM) in JSON format.

To use it:

  1. Set your AWS credentials in the environment.
  2. Run the following command to inspect a Glue job:
Terminal window
glue-inspector inspect mygluejob --output mygluejob-sbom.json

You can then use the resulting SBOM to manage the software supply chain with tools like DependencyTrack, or scan for vulnerabilities using tools like Trivy:

Terminal window
trivy sbom mygluejob-sbom.json --scanners vuln,license --list-all-pkgs -d --format cyclonedx --output mygluejob-sbom-trivy.json

I’ve just released version 0.2.0 of Glue Inspector.

AWS Vulnerabilities in Glue

While working on this tool, I was surprised by the number of critical and high-severity vulnerabilities present in the default packages. I filed a report with AWS Security, and after weeks of waiting, I was told that the runtime is isolated and therefore not considered an AWS system issue. However, users are encouraged to update their packages as needed.

I believe more awareness is needed in this area.

Glue Runtime Vulnerabilities

Here’s an overview of vulnerabilities in the Glue runtimes:

FilenameCriticalHighMediumLow
glueetl-2.0512121
glueetl-3.0416202
glueetl-4.0414182
glueetl-5.006113
pythonshell-3.61160
pythonshell-3.90000
pythonshell-3.9-analytics1130

Vulnerabilities in AWS Glue 5.0 GlueETL

Here are some critical and high-severity vulnerabilities in the newly released Glue ETL 5.0 runtime:

PackageSeverityIdInstalled VersionFixed VersionTitle
PygmentsMEDIUMCVE-2022-408962.7.42.15.0pygments: ReDoS in pygments
aiohttpMEDIUMCVE-2024-423673.10.13.10.2aiohttp: python-aiohttp: Compressed files as symlinks are not protected from path traversal
aiohttpMEDIUMCVE-2024-523043.10.13.10.11aiohttp: aiohttp vulnerable to request smuggling due to incorrect parsing of chunk extensions
cryptographyHIGHCVE-2023-028636.0.139.0.1openssl: X.400 address type confusion in X.509 GeneralName
cryptographyHIGHCVE-2023-5078236.0.142.0.0python-cryptography: Bleichenbacher timing oracle attack against RSA decryption - incomplete fix for CVE-2020-25659
cryptographyMEDIUMCVE-2023-2393136.0.139.0.1python-cryptography: memory corruption via immutable objects
cryptographyMEDIUMCVE-2023-4908336.0.141.0.6python-cryptography: NULL-dereference when loading PKCS7 certificates
cryptographyMEDIUMCVE-2024-072736.0.142.0.2openssl: denial of service via null dereference
cryptographyLOWGHSA-5cpq-8wj7-hf2v36.0.141.0.0Vulnerable OpenSSL included in cryptography wheels
cryptographyLOWGHSA-jm77-qphf-c4w836.0.141.0.3pyca/cryptography’s wheels include vulnerable OpenSSL
cryptographyLOWGHSA-v8gr-m533-ghj936.0.141.0.4Vulnerable OpenSSL included in cryptography wheels
idnaMEDIUMCVE-2024-36512.103.7python-idna: potential DoS via resource consumption via specially crafted inputs to idna.encode()
pipMEDIUMCVE-2023-575221.3.123.3pip: Mercurial configuration injectable in repo revision when installing via pip
pipMEDIUMCVE-2023-575222.3.123.3pip: Mercurial configuration injectable in repo revision when installing via pip
setuptoolsHIGHCVE-2022-4089759.6.065.5.1pypa-setuptools: Regular Expression Denial of Service (ReDoS) in package_index.py
setuptoolsHIGHCVE-2024-634559.6.070.0.0pypa/setuptools: Remote code execution via download functions in the package_index module in pypa/setuptools
urllib3HIGHCVE-2021-335031.25.101.26.5python-urllib3: ReDoS in the parsing of authority part of URL
urllib3HIGHCVE-2023-438041.25.102.0.6, 1.26.17python-urllib3: Cookie request header isn’t stripped during cross-origin redirects
urllib3MEDIUMCVE-2023-458031.25.102.0.7, 1.26.18urllib3: Request body not stripped after redirect from 303 status changes request method to GET
urllib3MEDIUMCVE-2024-378911.25.101.26.19, 2.2.2urllib3: proxy-authorization request header is not stripped during cross-origin redirects

Mitigating Vulnerabilities

If your Glue jobs access external resources, be sure to update the required packages using the runtime installation option. However, this could lead to a “dependency hell” situation, so use your favorite tools or something like pur to help update the requirements.

Here’s an overview of some key packages that are outdated:

Updated aiobotocore: 2.13.1 -> 2.16.1
Updated aiohappyeyeballs: 2.3.5 -> 2.4.4
Updated aiohttp: 3.10.1 -> 3.11.11
Updated aioitertools: 0.11.0 -> 0.12.0
Updated aiosignal: 1.3.1 -> 1.3.2
Updated async-timeout: 4.0.3 -> 5.0.1
Updated attrs: 24.2.0 -> 24.3.0
Updated awscrt: 0.19.19 -> 0.23.6
Updated boto3: 1.34.131 -> 1.35.92
Updated botocore: 1.34.131 -> 1.35.92
Updated certifi: 2024.7.4 -> 2024.12.14
Updated cffi: 1.14.5 -> 1.17.1
Updated charset-normalizer: 3.3.2 -> 3.4.1
Updated colorama: 0.4.4 -> 0.4.6
Updated contourpy: 1.2.1 -> 1.3.1
Updated cryptography: 36.0.1 -> 44.0.0
Updated distlib: 0.3.1 -> 0.3.9
Updated distro: 1.5.0 -> 1.9.0
Updated docutils: 0.16 -> 0.21.2
Updated filelock: 3.0.12 -> 3.16.1
Updated fonttools: 4.53.1 -> 4.55.3
Updated frozenlist: 1.4.1 -> 1.5.0
Updated fsspec: 2024.6.1 -> 2024.12.0
Updated idna: 2.10 -> 3.10
Updated importlib_resources: 6.4.0 -> 6.5.2
Updated jmespath: 0.10.0 -> 1.0.1
Updated kiwisolver: 1.4.5 -> 1.4.8
Updated libcomps: 0.1.20 -> 0.1.21.post1
Updated matplotlib: 3.9.0 -> 3.10.0
Updated multidict: 6.0.5 -> 6.1.0
Updated numpy: 1.26.4 -> 2.2.1
Updated packaging: 24.1 -> 24.2
Updated pandas: 2.2.2 -> 2.2.3
Updated pillow: 10.4.0 -> 11.1.0
Updated pip: 21.3.1 -> 24.3.1
Updated pip: 22.3.1 -> 24.3.1
Updated plotly: 5.23.0 -> 5.24.1
Updated prompt-toolkit: 3.0.24 -> 3.0.48
Updated pyarrow: 17.0.0 -> 18.1.0
Updated pycparser: 2.20 -> 2.22
Updated Pygments: 2.7.4 -> 2.19.0
Updated pyparsing: 3.1.2 -> 3.2.1
Updated pytz: 2024.1 -> 2024.2
Updated requests: 2.32.2 -> 2.32.3
Updated ruamel.yaml: 0.16.6 -> 0.18.9
Updated ruamel.yaml.clib: 0.1.2 -> 0.2.12
Updated s3fs: 2024.6.1 -> 2024.12.0
Updated s3transfer: 0.10.2 -> 0.10.4
Updated setuptools: 59.6.0 -> 75.7.0
Updated six: 1.16.0 -> 1.17.0
Updated tzdata: 2024.1 -> 2024.2
Updated urllib3: 1.25.10 -> 2.3.0
Updated virtualenv: 20.4.0 -> 20.28.1
Updated wcwidth: 0.2.5 -> 0.2.13
Updated wrapt: 1.16.0 -> 1.17.0
Updated yarl: 1.9.4 -> 1.18.3
Updated zipp: 3.19.2 -> 3.21.0

Luckily, Glue 5 now supports the use of a requirements.txt file uploaded to S3, which can be parsed by pip:

Add custom Python modules

This opens up the possibility of using local checks and tools like GitHub Dependabot to monitor your dependencies for vulnerabilities.

Conclusion

  1. Data pipelines are applications and need to be treated with the same level of scrutiny as any other software. Managing their lifecycle is critical for security.

  2. Be aware of vulnerabilities in default runtimes, whether using AWS Glue, Apache Airflow, or other similar tools.

  3. Use Glue Inspector to scan your Glue jobs and generate an SBOM for better software supply chain management. SBOMs are becoming an industry standard, with requirements from norms like DORA and U.S. government standards for critical infrastructure.