Tech, Cloud and Programming

Secrets and AWS GLUE Custom Connectors

|

For a project i had to retrieve data from Teradata using a Glue Job. A quick google gave me this: Teradata Blog This uses the JDBC Connector and it gets the credentials from SecretsManager using some boto3 api-calls.

However Glue is evolving and with the release of AWS Glue Studio, the old style JDBC aren't supported in it. It requires new Custom Connectors or Marketplace Connectors.

The documentation and many blogs still point to the old way, i wrote down here on how to use the new custom connectors in combination with SecretsManager. Hopefully saving other some time.

Custom Connector

First step, create a Custom Connector with a template configuration. Start by uploading your jdbc.jar file to a bucket. Then create the template configuration, the classname you can find in the documentation of the jdbc driver.

Create Connector

Connector Parameters

Custom Connection

Use this connector to create a Custom Connection with the database paramaters like username.

Create Connection

Fill Connection

Secrets

This was how you can define the connection with the parameters you need. However how much i tried i couldn't get the connection using the right fields from the secret in SecretsManager. After some extensive searching with google/github, i found some references to ${} and when re-reading the [documentation[(https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-market) i saw it ...

url – String, required, JDBC URL with placeholders (${}) which are used to build the connection to the data source. The placeholder ${secretKey} is replaced with the secret of the same name in AWS Secrets Manager. Refer to the data store documentation for more information about constructing the URL.

I updated the connector with the following url: jdbc:teradata://myserver/DATABASE=${database},user=${username},password=${password}. When using this url for a custom Connection it shows the data.

With Secrets

Make sure you create a secret before with the right fields: secretes

With the following pyspark code you can query the data source with all the parameters in secretsmanager. You can add any jdbc parameter for the connector in the template.

dynamic_dframe = glueContext.create_dynamic_frame.from_options(
    connection_type="custom.jdbc",
    connection_options={
        "dbTable": 'mytable',
        "connectionName": 'my-td-connection',
        "secretId": 'td-secret'
    },
    transformation_ctx="dynamic_df")

Note: The GlueJob Role needs s3 permissions read the JDBC Jar file and secretmanagers/kms to read the secret

Terraform

The Custom Connector and Connection aren't very well documented in Terraform. With some import and reverse engineering i created the following resources.

First create the connectors. This has a magic match_criteria of "template-connection"

In the URL $${database} is used because ${} is also a escape in terraform. using the double $$ will avoid this.

resource "aws_glue_connection" "connector_teradata_jdbc" {
  connection_type = "CUSTOM"

  name = "connector-teradata-jdbc"

  connection_properties = {
    CONNECTOR_CLASS_NAME = "com.teradata.jdbc.TeraDriver",
    CONNECTOR_TYPE       = "Jdbc",
    CONNECTOR_URL        = "s3://mybucket/jars/terajdbc4.jar",
    JDBC_CONNECTION_URL  = "[[\"default=jdbc:teradata://${var.teradata_hostname}/DATABASE=$${database},user=$${username},password=$${password}\"],\",\"]"
    # base jdbc url with the seperator ,
  }

  match_criteria = ["template-connection"]
}

Then to use the connector as a connection, use the same resource type: aws_glue_connection, but now match it the Connector.

resource "aws_glue_connection" "connection_teradata_jdbc" {
  connection_type = "CUSTOM"

  name = "connection-teradata-jdbc"

  # templates are filled in by the secrets, double $ for the terraform escape
  connection_properties = {
    CONNECTOR_CLASS_NAME = "com.teradata.jdbc.TeraDriver",
    CONNECTOR_TYPE       = "Jdbc",
    CONNECTOR_URL        = "s3://mybucket/jars/terajdbc4.jar",
    JDBC_CONNECTION_URL  = "jdbc:teradata://${var.teradata_hostname}/DATABASE=$${database},user=$${username},password=$${password}",
    SECRET_ID            = aws_secretsmanager_secret.teradata_secret.id
  }

  match_criteria = ["Connection", aws_glue_connection.connector_teradata_jdbc.name]

}

For VPC based access add the physical_connection_requirements to the connection to enable it. The security group needs all ports open for it self for ingress and egress to the database. Not every safe, but that's the only way according the documentation

Note you need to create a connection per az/subnet if you want failover.

  physical_connection_requirements {
    security_group_id_list = [aws_security_group.glue_connection.id]
    availability_zone      = each.key

    subnet_id = var.subnet_id
  }