In my previous post i must have hit some issue with ChatGPT. It's way more impressive then i thought..
can you write a pyspark job that inserts data into an iceberg table
Sure, here is an example of a PySpark job that inserts data into an iceberg table:
ChatGPT is out and everybody is trying it. Blogs, songs are easy, but can it help with Data Engineering?
i have a table named 'people' with fields: id, name, job_type and a table names 'jobs' with fields: id,job_name. Create a join on this two tables in sql
To create a join on two tables in SQL, you can use the JOIN keyword. Here's an example:
SELECT people.name, jobs.job_name
FROM people
INNER JOIN jobs ON people.job_type = jobs.id
This query will return a table with columns name and job_name, and each row will represent a person and the job they have. The JOIN keyword is used to specify the tables to be joined, and the ON keyword is used to specify the criteria for matching rows in the two tables. In this case, the ON clause specifies that rows in the people table should be matched with rows in the jobs table based on the value of job_type in the people table and the value of id in the jobs table.
An overview of all the releases sorted per group. This re:invent was very focused on Data & Analytics. With a total of 42 out of 83 this year.
Python is a powerfull tool to run ETL jobs on AWS. The 3 main ways are Lambda, Glue (pyspark) and Glue Pythonshell. Lambda can run max 15 minutes. Glue Pythonshell is a simple environment for run python scripts, the Glue Spark provides a full serverless PySpark environment to run complex scripts.
There are many ways to run scripts on AWS, there are out of scope for now.
While python is powerfull, you need additional libraries for your code. The most common for ETL processing are NumPy and Pandas. But many others are available for retrieving api's, parsing files, etc.
I was invited to give a talk at the awscommunityday.nl. I gave an overview of the Step Functions features released in the past years.
Great location at the rijtuigenloods in Amersfoort. Talk where held inside old rail carriages.
The presentation can be found there: Presentation
For a project i had to retrieve data from Teradata using a Glue Job. A quick google gave me this: Teradata Blog This uses the JDBC Connector and it gets the credentials from SecretsManager using some boto3 api-calls.
However Glue is evolving and with the release of AWS Glue Studio, the old style JDBC aren't supported in it. It requires new Custom Connectors or Marketplace Connectors.
The documentation and many blogs still point to the old way, i wrote down here on how to use the new custom connectors in combination with SecretsManager. Hopefully saving other some time.
It was great to give an in-person talk at the AWSug.nl. I gave an overview of the EventBridge features released in the past years.
The presentation can be found there: Presentation
If you want to remove Sagemaker Canvas after using it you find a unpleasant surprise in the webconsole. The button is grayed out, because you have active users. But users can't be removed in the webconsole.
However there is nice link to the documentation.
Docker is almost an essential tool for developing applications in a devops environment. It allows easy context switching and grouping. The standard way of installing it on the mac via Docker Desktop was free, but since 2021 it's a payed subscription model for companies > 250 workers with a minimum of $5/month.
I got invited in to participate in the AWS Community Builders program for the Data Category. Looking forward worker together with Jason Dunn and the rest of the builders to create and share content.
Today i saw a tweet from David Boyne announcing the book EventBridgeBook. I can't wait for it.
Events are getting more useful and powerful every day in Aws. Unfortunately the documentation lacks sometimes. The above book will help with the best practices and detailed informations. The following website with AWS lambda Events from Michael Brewer aims to collect lambda events.
It shows per Service the requests and responses for events.
CDK day is on 26th of May this year. It's free online event that can be followed on their youtube channel
CDK started out as AWS CDK a cross programming language tool to program your aws cloud infrastructure. Using typescript/python/java/dotnet you can program the infrastructure. The CDK compiler then generates a Cloudformation template to deploy in your cloud. Since then it also terraform-cdk to generate terraform code, cdk8s to create Kubernetes configuration. The last project is projen an advanced scaffolding/template tool for cdk projects.
Soon more CDK projects on this blog.
How to have views and votes on a static created website? In a server side website (php/react/etc) you would pull the counters from the database and inject in the code while serving. This website is static generated by Zola, every time i publish an article. Also cloudflare doesn't gives you access logs in the free version.
I could use Google Analytics to track people, but that would includes cookies and send a lot of data to Google. As a real engineer i wrote my own small version. This first article is about the client side, next part will be about the AWS backend.
Sharing interesting websites with AWS Tools and Resources. (Part 1)
Lately i'm doing a lot of Data Engineering in Python. Using Jupyter Notebooks is great for quick development and visualisation. However the default width is limited nowadays with the widescreen monitors. Here are some ways to use the full width of your browser.
For learning a new AWS Technology you can read the manual, find some Youtube video or do a course. But a relative unknown source is the AWS Workshop. Before these workshops were mostly exclusive to join during Re:invent in Las Vegas, but the workshops team at aws is collection and sharing them.
There are over 100 workshops on all kinds of topics and levels.
Being an engineer and a tech lead, meetings are always a necessary evil. I intend to minimize them, some can be a slack message, others an email. But for some cases a meeting is required. It's hard to plan, difficult to get the right engagement. Then i read this article: http://www.paulgraham.com/makersschedule.html it opened my eyes.
To make a good IAM policy you want it strict. With the ever growing functionality of AWS, this requires a lot of documentation checking. Now is the AWS Documentation extensive, but not very easy to search.
I saw this extension for VSCode by Sebastion Bille.
It improves the working with iam policies in Terraform, SAM and json policies.