How to connect your CrateDB data to LLM with LlamaIndex and Azure OpenAI

LlamaIndex is a data framework for Large Language Models (LLMs). It comes with pre-trained models on massive public datasets such as GPT-4 or Llama 2 and provides an interface to external data sources allowing for natural language querying on your private data.

Azure Open AI Service is a fully managed service that runs on the Azure global infrastructure and allows developers to integrate OpenAI models into their applications. Through Azure Open AI API one can easily access a wide range of AI models in a scalable and reliable way.

In this tutorial, we will illustrate how to augment existing LLMs with data stored in CrateDB through the LlamaIndex framework and Azure Open AI Service. By doing this, you will be able to use the power of generative AI models with your own data in just a few lines of code.

Prerequisites

Deploy models in Azure OpenAI

Before we use OpenAI models to generate responses for queries, we need to deploy two models in Azure: for the text generation task and for embeddings. There are several pre-trained models available in Azure OpenAI Studio we can choose from, as well as the capability to customize AI models, fine-tuned with custom data and hyperparameters.

To deploy the models required for this tutorial, follow these steps:

  1. In Azure OpenAI resource choose Model Deployments and then Manage Deployments as illustrated below.

  1. This will open Azure AI Studio. Azure AI Studio enables developers to build, run, and deploy AI applications. Click on the Create new deployment button to deploy the following models:
  2. GPT-35-turbo for text generation tasks
  3. text-embedding-ada-002 for generating embeddings

The basic deployment of each model is straightforward in Azure OpenAI Studio: You need to select the model you want to deploy and specify the unique name:

Finally, you should have an overview of all deployed models under the Deployments tab:

Load time-series data to CrateDB

Let’s now create the time_series_data table in CrateDB that contains time series data, where each row represents a data point with the following information:

  1. timestamp: The timestamp when the data point was recorded.
  2. value: The numerical value associated with the data point.
  3. location: The location or source of the data (either ‘Sensor A’ or ‘Sensor B’).
  4. sensor_id: The ID of the sensor that generated the data.
CREATE TABLE IF NOT EXISTS time_series_data (
    timestamp TIMESTAMP,
    value DOUBLE,
    location STRING,
    sensor_id INT
);

Import a portion of the data we will use for learning and querying:

INSERT INTO time_series_data (timestamp, value, location, sensor_id)
VALUES
    ('2023-09-14T00:00:00', 10.5, 'Sensor A', 1),
    ('2023-09-14T01:00:00', 15.2, 'Sensor A', 1),
    ('2023-09-14T02:00:00', 18.9, 'Sensor A', 1),
    ('2023-09-14T03:00:00', 12.7, 'Sensor B', 2),
    ('2023-09-14T04:00:00', 17.3, 'Sensor B', 2),
    ('2023-09-14T05:00:00', 20.1, 'Sensor B', 2),
    ('2023-09-14T06:00:00', 22.5, 'Sensor A', 1),
    ('2023-09-14T07:00:00', 18.3, 'Sensor A', 1),
    ('2023-09-14T08:00:00', 16.8, 'Sensor A', 1),
    ('2023-09-14T09:00:00', 14.6, 'Sensor B', 2),
    ('2023-09-14T10:00:00', 13.2, 'Sensor B', 2),
    ('2023-09-14T11:00:00', 11.7, 'Sensor B', 2);

Connect Azure OpenAI with LlamaIndex

Azure OpenAI resource differs slightly from the standard OpenAI resource as it requires the use of the embedding model, which we deployed in the previous step. The following code illustrates the setup of OpenAI API:

import openai
import sys
import os
import sqlalchemy as sa

from llama_index.llms import AzureOpenAI
from llama_index import LangchainEmbedding
from llama_index import SQLDatabase
from langchain.embeddings import OpenAIEmbeddings
from llama_index.indices.struct_store.sql_query import NLSQLTableQueryEngine
from llama_index import ServiceContext, set_global_service_context

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY" 
openai.api_type = "azure"
openai.api_base = "https://cratedb-ai.openai.azure.com/"
openai.api_version = "2023-12-01"
openai.api_key = os.getenv("OPENAI_API_KEY")

To find your api_base and OPEN_API_KEY check the Overview tab inside your OpenAI resource and look for the Endpoint and Manage keys values:

c8861714-f25c-44ea-a4d7-e15e6cabf176

The next step is to initialize LLM and embedding models. The model name is the actual model name deployed in the Azure OpenAI (e.g., gpt-35-turbo), and the engine and deployment_id parameters correspond to the custom deployment names ( e.g., my-gpt-35-turbo and my_embedding-model). To set this configuration globally, we can use a service context as the global default that applies to the entire LlamaIndex pipeline:

llm = AzureOpenAI(engine="my-gpt-35-turbo", model="gpt-35-turbo", temperature=0.0)

embedding_llm = LangchainEmbedding(
    OpenAIEmbeddings(
        deployment_id="my-embedding-model"
    )
)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embedding_llm)
set_global_service_context(service_context)

Connect CrateDB with LlamaIndex

Finally, let’s explore some of the core LlamaIndex SQL capabilities with CrateDB. In the following example, we will use the time_series_data table with the test data points and query it with text-to-SQL capabilities.

We use sqlalchemy, a popular SQL database toolkit, to connect to CrateDB and SQLDatabase wrapper that allows CrateDB data to be used within LlamaIndex.

engine_crate = sa.create_engine("crate://user:password@host:port", echo=True)
sql_database = SQLDatabase(engine_crate, include_tables=["time_series_data"])

To query CrateDB using natural language we make an instance of NLSQLTableQueryEngine and provide a list of tables:

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["time_series_data"],
)

At this point, we are ready to query CrateDB in plain English!

Ask question

When dealing with time-series data we are usually interested in aggregate values. For instance, with our query, we are interested in the average value of sensor 1:

query_str = "What is the average value for sensor 1?"
answer = query_engine.query(query_str)
print(answer.get_formatted_sources())
print("query was:", query_str)
print("answer was:", answer)

# query was: What is the average value for sensor 1?
# answer was: The average value for sensor 1 is 17.03.

Often, we are also interested in the query that produces the output.NLSQLTableQueryEngine has the capability to generate a query on a specific table:

print(answer.metadata)
# {'result': [(17.033333333333335,)], 'sql_query': 'SELECT AVG(value) FROM time_series_data WHERE sensor_id = 1'}

Takeaway

In this tutorial, we’ve embarked on the journey of using a natural language interface to query CrateDB data. We’ve explored how to seamlessly connect your data to the power of LLM using LlamaIndex and the capabilities of Azure OpenAI.

This tutorial is just the beginning. You can expect further resources, documentation, and tutorials related to CrateDB and generative AI from us. Also, stay tuned for the CrateDB 5.5 release: we will soon announce the support for the vector store and search, allowing you to implement similarity-based data retrieval efficiently.

In the meantime, explore the wide range of capabilities that CrateDB has to offer and start your cluster on CrateDB Cloud, including a forever free CRFREE plan. If you have further questions about CrateDB and its use cases, check our documentation or ask our growing community.

2 Likes