Unity Iceberg Rest Api and PyIceberg
Access Unity tables via the Iceberg Rest Api
After working with glue catalog in the previous article, i wanted it to test with Unity Catalog. The opensource catalog from DataBricks.
Installing Unity Catalog
I followed the quickstart steps from: https://docs.unitycatalog.io/quickstart/. It needs java 17, my mac installation had 23 installed by default. I used the following with jenv to switch quickly
Install java 17 from brew with jenv
brew install microsoft-openjdk@17 jenvjenv add /opt/homebrew/opt/openjdk@17
Clone repo and setup java to 17
cd unitycatalogjenv local 17jenv rehash
Start (and build the first):
bin/start-uc-server
Use Uniform tables
Unity is not supporting native iceberg yet. But it does support the uniform delta format which can be used with delta and iceberg
To setup the test environment follow: https://docs.unitycatalog.io/usage/tables/uniform/
cp -r etc/data/external/unity/default/tables/marksheet_uniform /tmp/marksheet_uniform
Rest Api
The iceberg api is available from: http://127.0.0.1:8080/api/2.1/unity-catalog/iceberg/
According the documentation the tables can be used with the following format, but i had to change this
When querying Iceberg REST Catalog for Unity Catalog, tables are identified using the following pattern iceberg.
Code
By default the local unity catalog doesn’t use authentication. I’ll test this later.
from pyiceberg.catalog import load_catalogimport logging
def main(): rest_catalog = load_catalog( "databricks", **{ "type": "rest", "warehouse": "unity", "uri": "http://127.0.0.1:8080/api/2.1/unity-catalog/iceberg", } ) print(rest_catalog.list_namespaces()) print(rest_catalog.list_tables("default")) print(rest_catalog.load_table("default.marksheet_uniform").scan().to_pandas())
if __name__ == "__main__": main()
Output
2024-12-29 12:53:45,806 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): 127.0.0.1:80802024-12-29 12:53:45,809 - urllib3.connectionpool - DEBUG - http://127.0.0.1:8080 "GET /api/2.1/unity-catalog/iceberg/v1/config?warehouse=unity HTTP/1.1" 200 552024-12-29 12:53:45,810 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): 127.0.0.1:80802024-12-29 12:53:45,815 - urllib3.connectionpool - DEBUG - http://127.0.0.1:8080 "GET /api/2.1/unity-catalog/iceberg/v1/catalogs/unity/namespaces HTTP/1.1" 200 28[('default',)]2024-12-29 12:53:45,836 - urllib3.connectionpool - DEBUG - http://127.0.0.1:8080 "GET /api/2.1/unity-catalog/iceberg/v1/catalogs/unity/namespaces/default/tables HTTP/1.1" 200 70[('default', 'marksheet_uniform')]2024-12-29 12:53:45,853 - urllib3.connectionpool - DEBUG - http://127.0.0.1:8080 "GET /api/2.1/unity-catalog/iceberg/v1/catalogs/unity/namespaces/default/tables/marksheet_uniform HTTP/1.1" 200 22382024-12-29 12:53:45,853 - pyiceberg.io - INFO - Defaulting to PyArrow FileIO id name marks0 1 nWYHawtqUw 9301 2 uvOzzthsLV 1662 3 WIAehuXWkv 1703 4 wYCSvnJKTo 709
Difference
It seems the implementation is slightly different that the glue iceberg catalog and also not fully features.
Only “issue” is that the setup requests a parameter warehouse in the setup. I pass this in the config with "warehouse": "unity"
Where warehouse is the same as catalog.
After that it responds the same as the glue iceberg catalog rest api. I’m able to list the namespace, tables and view the data.
For the full source code, this can be found here: https://github.com/unitycatalog/unitycatalog/blob/main/server/src/main/java/io/unitycatalog/server/service/IcebergRestCatalogService.java
Next
- Writing data
- Trying real iceberg tables
- Comparing functionality off the rest api’s.