Amundsen at Optum

Integrating Lyft's Amundsen as a data dictionary for internal Optum Oracle Databases

Amundsen is a platform for open source data discovery and also acts a metadata engine. It is an open-source project maintained by Lyft. When adopting Amundsen for Optum, some of the difficulties we had in translating our existing pipelines were unformatted data for our Oracle databases. A lot of the data dictionary that was provided by our third party contractor was stored in an Access Database, which in turn had no easy way to publicly view.

Amundsen allows for a easy to read search engine for finding out what tables you are working on.

However, since these workloads were stored using flat files, we were able to parse Access Databases using Python 3’s pyodbc library. This allowed us to have official vendor definitions for fields and columns inside our databses. However, a lot of user defined or company defined tables using mandated queries needed to be processed and extracted

Amundsen allows for easy user management, and gives a high level overview on data ownership, categorization, and column definitions/datatypes.

We found the solution by pitching idea to senior management and finding a client who would be interested in defining the tables and columns necessary. In addition, to having a verbose and rich user interface, Amundsen has many ways to extract data and put it into graph/ElasticSearch indices to consume on the frontend. It allows its developers to deliver small data dictionaries in CSV format to working with SQL databases which contain relational data for all the tables/instance you may want to include. Overall, its a versatile tool to have at any company with Big Data as they expand their Data Science/Processing capabilities.