The predict function in ThanoSQL is designed to perform various prediction tasks using pre-trained models. This function supports a range of tasks, leveraging the capabilities of different engines.

Syntax

SELECT 
    [sequential_column,] [partition_column,] column_name, ...
    thanosql.predict(
        task := 'task_name',
        engine := 'engine_name',
        input := [column_name | 'input_data'],
        model := 'model_name',
        model_args := 'model_args_in_json',
        pipeline_args := 'pipeline_args_in_json',
        task_args := 'task_args_in_json',
        token := 'auth_token',
        base_url := 'base_url' -- Only applicable when the engine is 'openai'
    ) AS prediction
FROM 
    table_name

Parameters

ParameterTypeDefaultDescriptionOptions
taskstringThe prediction task to perform.'text-classification', 'sentiment-analysis', 'summarization', 'image-classification', 'image-segmentation', 'audio-classification', 'automatic-speech-recognition', 'video-classification'
enginestring'huggingface'The engine to use for performing predictions.'huggingface': Uses models from HuggingFace.
inputstringThe input data for the prediction task. It can be text, a URL, an S3 URI, or a path to a local file.N/A
modelstringThe name or path of the pre-trained model.Example: 'google-bert/bert-base-uncased'
model_argsjsonNoneJSON string representing additional arguments for the model.N/A
pipeline_argsjsonNoneJSON string representing additional arguments for the pipeline.N/A
task_argsjsonNoneJSON string representing additional arguments specific to the task.N/A
tokenstringNoneToken for authentication if required by the model.N/A
base_urlstringNoneBase URL to point the client to a different endpoint than the default OpenAI API endpoint. This is only applicable when the engine is openai.N/A

Returns

  • varies: The prediction result based on the task.

Currently Supported Tasks

  • Text Classification
  • Sentiment Analysis
  • Summarization
  • Image Classification
  • Image Segmentation
  • Audio Classification
  • Automatic Speech Recognition
  • Video Classification

We are actively adding more tasks to enhance your experience.

Example Usage

When using text-generating models with the huggingface engine, the default truncation_policy is ‘strict’. This could result in an error if the token count exceeds the model’s limit. If this happens, reduce the text length or use pipeline_args = "{'truncation_policy': True}".

The following examples are provided to help you become familiar with the ThanoSQL syntax. To try out these queries in real scenarios, please visit the Use Cases section for detailed tutorials and practical applications.

To run the models in this tutorial, you will need the following tokens:

  • OpenAI Token: Required to access all the OpenAI-related tasks when using OpenAI as an engine. This token enables the use of OpenAI’s language models for various natural language processing tasks.
  • Huggingface Token: Required only to access gated models such as Mistral on the Huggingface platform. Gated models are those that have restricted access due to licensing or usage policies, and a token is necessary to authenticate and use these models. For more information, check this Huggingface documentation. Make sure to have these tokens ready before proceeding with the tutorial to ensure a smooth and uninterrupted workflow.

Text Classification

Here is an example of how to use the predict function for text classification using a Hugging Face model:

SELECT
    thanosql.predict(
        task := 'text-classification',
        engine := 'huggingface',
        input := text_column,
        model := 'google-bert/bert-base-uncased',
        model_args := '{"device_map": "cpu"}'
    ) AS prediction
FROM
    documents

On execution, we get:

| prediction                            |
|---------------------------------------|
| {'label': 'NEGATIVE', 'score': 0.998} |
| {'label': 'POSITIVE', 'score': 0.987} |
Please note that you would need to create an access token from Hugging Face to use certain models.

Sentiment Analysis

Here is an example of how to use the predict function for sentiment analysis using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'sentiment-analysis',
        engine := 'huggingface',
        input := text_column,
        model := 'distilbert/distilbert-base-uncased',
        model_args := '{"device_map": "cpu"}'
    ) AS prediction
FROM
    reviews

On execution, we get:

| prediction |
|------------|
| 'POSITIVE' |
| 'NEGATIVE' |

Summarization

Here is an example of how to use the predict function for summarization using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'summarization',
        engine := 'huggingface',
        input := text_column,
        model := 'Falconsai/text_summarization'
    ) AS prediction
FROM
    documents

On execution, we get:

| prediction                                                                          |
|-------------------------------------------------------------------------------------|
| 'The product exceeded my expectations, and I would highly recommend it to others.'  |
| 'The experience was disappointing due to poor customer service and delays.'         |
| 'Overall, a great purchase; the features and performance are outstanding.'          |

Image Classification

Here is an example of how to use the predict function for image classification using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'image-classification',
        engine := 'huggingface',
        input := image_path,
        model := 'google/vit-base-patch16-224'
    ) AS prediction
FROM
    images

On execution, we get:

| prediction                               |
|------------------------------------------|
| {'label': 'Egyptian cat', 'score': 0.99} |
| {'label': 'Pembroke', 'score': 0.98}     |

Image Segmentation

Here is an example of how to use the predict function for image segmentation using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'image-segmentation',
        engine := 'huggingface',
        input := image_column,
        model := 'openmmlab/upernet-swin-small',
        model_args := '{"device_map": "cpu"}'
    ) AS prediction
FROM
    images

On execution, we get:

| prediction                                                                                 |
|--------------------------------------------------------------------------------------------|
| '{"objects": ["cat", "dog"]', "scores": [0.999756, 0.950858]}'                             |
| '{"objects": ["sculpture", "water", "flower"]', "scores": [0.943367, 0.772345, 0.491883]}' |

Audio Classification

Here is an example of how to use the predict function for audio classification using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'audio-classification',
        engine := 'huggingface',
        input := audio_column,
        model := 'superb/hubert-base-superb-ks',
        model_args := '{"device_map": "cpu"}'
    ) AS prediction
FROM
    audio

On execution, we get:

| prediction                                        |
|---------------------------------------------------|
| '{"label": "on", "score": 0.5916519165039062}'    |
| '{"label": "music", "score": 0.8824673891067505}' |

Automatic Speech Recognition

Here is an example of how to use the predict function for automatic speech recognition using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'automatic-speech-recognition',
        engine := 'huggingface',
        input := audio_column,
        model := 'openai/whisper-tiny'
    ) AS prediction
FROM
    audio

On execution, we get:

| prediction                                       |
|--------------------------------------------------|
| 'Mary had a little lamb Little lamb little lamb' |
| 'Happy birthday to you'                          |

Here is an example of how to use the predict function for automatic speech recognition using an OpenAI model:

SELECT 
    thanosql.predict(
        task := 'automatic-speech-recognition',
        engine := 'openai',
        input := audio_column,
        model := 'whisper-1',
        token := 'openai_api_key'
    ) AS prediction
FROM
    audio

On execution, we get:

| prediction                        |
|-----------------------------------|
| 'Twinkle twinkle little star'     |
| 'Jack and Jill went up the hill'  |

Video Classification

Here is an example of how to use the predict function for video classification using a Hugging Face model:

SELECT 
    thanosql.predict(
        task := 'video-classification',
        engine := 'huggingface',
        input := video_column,
        model := 'facebook/timesformer-base-finetuned-k400',
        model_args := '{"device_map": "cpu"}'
    ) AS prediction
FROM
    video

On execution, we get:

| prediction                                                     |
|----------------------------------------------------------------|
| '{"label": "petting animal", "score": 0.5179958343505859}'     |
| '{"label": "playing basketball", "score": 0.5449063777923584}' |

Model Restrictions

When using the predict function with the huggingface engine, ensure that only models compatible with the HuggingFace pipeline are used. Verify that the selected model is supported by the HuggingFace library to avoid compatibility issues. Even with compatible tasks and pipeline models, some models might still not work. We are actively working on improving compatibility and functionality to provide a better user experience. For more information, refer to the official HuggingFace documentation.