The State of AI in Data Analytics

Pang Jun Rong (Jayden)
7 min readApr 19, 2024

--

In this article, we take a peek at how AI is faring in data analytic tasks and why it is important to use the right tools.

Motivation

I recently started exploring the idea of an analytics dashboard equipped with the latest interactive AI, which would help users to streamline their insight discovery by providing answers to data-related questions.

The Analytics Dashboard for HDB Resale Data (WIP)

The use case is simple — when a quarterly analytics deliverable is served in a dashboard format to senior management of business teams, communication overheads arising from clarifications could lead to delayed data-driven decisions and consequently, an opportunity cost for the firm.

Hence by embedding LLMs with contextual knowledge of the data at hand, we enable ad-hoc data analysis to be performed natively within the dashboard. Perhaps more importantly, the user does not need to have a background in SQL, Python or even Tableau to create new insights; just simply ask about the data and receive an answer on-the-go.

This interest stems from my HDB Resale Flats Dashboard (which I will document in a future article) which uses a mono-programming language stack purely in Python. The main technologies used were ScyllaDB, PySpark and Streamlit — which was all containerized as the main design consideration was really about finding a good medium between project manageability and scalability. After everything was set up (including the data pipelines), I felt an immense urge to explore the integration of Generative AI. I justified the obvious act of over-engineering as self-learning and discovered some of the latest AI models capable of tabular data question answering.

The Generative AIs

These models are (largely) open-source and can be found on Hugging Face. I decided to explore 4 different models (of different intended use cases) — GPT 3.5, C4AI Command R+, TAPAS and Pandalyst. I wanted to ask ChatGPT to provide a synopsis of these models but they are so new that ChatGPT simply does not provide any insightful information… when asked about C4AI Command R+:

“C4AI Command R+” likely refers to a specific command or feature within the context of a software tool, application, or system related to artificial intelligence (AI) and possibly specifically designed by or associated with C4AI. Without further context or information about C4AI and the specific software or platform it refers to, it’s challenging to provide a precise definition or explanation of what “C4AI Command R+” entails. If you have additional context or details about the software or system in question, please provide them, and I’ll do my best to assist you further.

GPT 3.5 is self-explanatory but what exactly are the other models?

C4AI Command R+

C4AI Command R+ is an open weights research release of a 104B billion parameter model with highly advanced capabilities, this includes Retrieval Augmented Generation (RAG) and tool use to automate sophisticated tasks. The tool use in this model generation enables multi-step tool use which allows the model to combine multiple tools over multiple steps to accomplish difficult tasks.

TAPAS (Large Fine-Tuned WTQ)

TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. This means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. This way, the model learns an inner representation of the English language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed or refuted by the contents of a table. Fine-tuning is done by adding a cell selection head and aggregation head on top of the pre-trained model, and then jointly train these randomly initialized classification heads with the base model on SQa, WikiSQL and finally WTQ. This model is state-of-the-art in the table question answering space at the time of writing.

Pandalyst-7B-v1.2

Pandalyst is a general large language model specifically trained to process and analyze data using the pandas library. It has strong generalization capabilities for data tables in different fields and for different data analysis needs. This model was only released in October 2023 and is built on Llama2. There is not much information about the model in general, but its source code can be found on GitHub via here.

Analysis

Without diving too deep into understanding the exact capabilities of each model (this is a medium article after all), I used the models mentioned above to answer simple analytic questions based on the data found on my HDB Resale Data Dashboard. Specifically the data is the most recent 100 transactions made at the time of writing, in the following format:

I asked the following question: What is the average resale price for 3 room flats?

ChatGPT’s Response
GPT 3.5’s Response
C4AI Command R+’s Response
Pandalyst’s Response

You might have noticed that I’ve omitted TAPAS; at the moment, the model fails to run with the magnitude of data we are working with (100 rows, 10+ columns). From Hugging Face Spaces created on top of TAPAS, it seems that the model can only deal with datasets with <64 rows and <8 columns (about 500 cells in total)… it’s really not usable for any practical analysis.

The responses for the remaining models can be summarized as below:

  • Ground Truth: $416,575
  • GPT 3.5: $371,825
  • C4AI Command R+: $437,500
  • TAPAS: N.A.
  • Pandalyst: $416,575

It is evident from this simple question that Pandalyst is the only model returning the correct answer. On the other hand, the other models returned incorrect values which even included workings; which is extremely misleading for anyone looking for a quick insight into the data. This comes as no surprise however, as GPT 3.5 is known to perform poorly in math. C4AI Command R+ being one of the latest advanced models available publicly only reaffirms why AI is still not widely used in data analytics or rather, why it is so dangerous to do use them (even when they claim to be capable of complex tasks).

The use of Pandalyst on the other hand, provides us with the correct answer. This is because the model is specifically trained to perform data analysis using Pandas in Python (hence the name). The model interprets the prompt or question and leverages on Pandas to write code which can subsequently be executed by an “infer” command to generate an answer. In the case of our example above, the following code was generated:

def analyze_data(df: pd.DataFrame) ->str:
df_3room = df[df['flat_type'] == '3 ROOM']
avg_resale_price = df_3room['resale_price'].mean()
return f'The average resale price for 3 room flats is {avg_resale_price}'

After inference, we arrive at the correct answer of $416,575 shown on the frontend of my HDB Resale Data Dashboard. This reflects the importance of using the right tools for the task at hand and fits the use case outlined initially. Although, it did take over 2 minutes to run this simple analysis on my Nvidia RTX3080… so maybe it’s just easier if included interactive graphs using Streamlit.

Well, it’s pretty cool that Pandalyst works as expected but how capable is it in deliverable real insights? This time, my persona is a savvy buyer who is interested to find out about recent transactions with the lowest resale price per floor area (sqm) to understand what characteristics of a resale flat provides the best “bang for buck” or value.

I ask the question: Which block and street name has the lowest resale price per floor area sqm? How much is each floor area sqm for this transaction?

Pandalyst’s Response to a Harder Question

Without even manually performing the calculations to verify this answer, we know that it is incorrect. In our question, we asked for the transaction with the lowest resale price per floor area (sqm) with the expectation that the model will divide the resale price of each transaction with its own floor area (sqm) value. This was evidently not done as the model returned an average price of $350,000 per sqm for a transaction with a total resale price of the same amount.

In the logs, we observe that the function generated was incorrect and Pandlyst had not understood the question completely. The model simply took the transaction with the minimum resale price and returned it without dividing the value by the floor area (sqm).

def analyze_data(df: pd.DataFrame) ->str:
df_grouped = df.groupby(['block', 'street_name'])['resale_price'].mean().reset_index()
min_price_row = df_grouped[df_grouped['resale_price'] == df_grouped['resale_price'].min()]
result = f"The block and street name with the lowest resale price per floor area sqm is {min_price_row['block'].values[0]} and {min_price_row['street_name'].values[0]} with an average price of {min_price_row['resale_price'].values[0]} per sqm."
return result

Conclusion

Generative AI has a long way to go before it can start performing data analytics reliable. Although specially-built models such as Pandalyst has already shown its capability in perform simple analytical tasks (even with a meagre 7B parameters), it may take a few years before AI starts providing actionable-insights directly from data. Nevertheless, the “table question answering” subspace in AI is an interesting one which undoubtedly deserves more attention. You won’t see it on any analytics dashboards just yet but things may change soon enough. Admittedly, anxiety will still remain with its users as the result correctness would be hard to verify in a data-centric task as compared to generative text tasks which are easy to make sense of.

Sign up to discover human stories that deepen your understanding of the world.

--

--

Pang Jun Rong (Jayden)
Pang Jun Rong (Jayden)

Written by Pang Jun Rong (Jayden)

I'm Jayden, a computer science graduate with a diverse interest in both arts and sciences!

No responses yet

Write a response