The embed function in ThanoSQL is designed to generate embeddings for given input data using a pre-trained model. This function utilizes various engines to provide efficient and high-quality embeddings for text, images, and other types of data.

Syntax

SELECT 
    [sequential_column,] [partition_column,] column_name, ...
    thanosql.embed(
        engine := 'engine_name',
        input := [column_name | 'input_text_or_path'],
        model := 'model_name',
        model_args := 'model_args_in_json',
        tokenizer_args := 'tokenizer_args_in_json',
        token := 'auth_token',
        base_url := 'base_url' -- Only applicable when the engine is 'openai'
    ) AS embedding
FROM 
    table_name

Parameters

ParameterTypeDefaultDescriptionOptions
enginestring'huggingface'The engine to use for generating embeddings.'huggingface': Uses models from HuggingFace.'thanosql': Uses ThanoSQL’s native models.'openai': Uses models from OpenAI.
inputstringThe input data based on which the embeddings will be generated. It can be text, a URL, an S3 URI, or a path to a local file.N/A
modelstringThe name or path of the pre-trained text generation model.Example: 'openai/clip-vit-base-patch32'
model_argsjsonNoneJSON string representing additional arguments for the model.N/A
tokenizer_argsjsonNoneJSON string representing additional arguments for the tokenizer.N/A
tokenstringNoneToken for authentication if required by the model.N/A
base_urlstringNoneBase URL to point the client to a different endpoint than the default OpenAI API endpoint. This is only applicable when the engine is openai.N/A

Returns

  • list: The generated embeddings based on the input.

Example Usage

The following examples are provided to help you become familiar with the ThanoSQL syntax. To try out these queries in real scenarios, please visit the Use Cases section for detailed tutorials and practical applications.

To run the models in this tutorial, you will need the following tokens:

  • OpenAI Token: Required to access all the OpenAI-related tasks when using OpenAI as an engine. This token enables the use of OpenAI’s language models for various natural language processing tasks.
  • Huggingface Token: Required only to access gated models such as Mistral on the Huggingface platform. Gated models are those that have restricted access due to licensing or usage policies, and a token is necessary to authenticate and use these models. For more information, check this Huggingface documentation. Make sure to have these tokens ready before proceeding with the tutorial to ensure a smooth and uninterrupted workflow.

Using Hugging Face Embedding Models (Text)

Here is an example of how to use the embed function using a Hugging Face model:

SELECT 
    thanosql.embed(
        input := text_column,
        engine := 'huggingface',
        model := 'openai/clip-vit-base-patch32',
        model_args := '{"device_map": "cpu"}'
    ) AS text_embedding
FROM
    documents

On execution, we get:

| text_embedding                                   |
|--------------------------------------------------|
| [0.0034567890, -0.008123456, ... , -0.001987654] |
| [0.0045678901, -0.010987654, ... , -0.002345678] |
Please note that you would need to create an access token from Hugging Face to use certain models.

Using Hugging Face Embedding Models (Image)

Here is an example of how to use the embed function to generate embeddings for an image using Hugging Face model:

SELECT embedding
FROM unsplash_embed_sample
ORDER BY embedding <-> (
    SELECT thanosql.embed(
        engine := 'huggingface',
        input := '/home/jovyan/use_case_search/coastal_cliff.jpg',
        model := 'openai/clip-vit-base-patch32',
        model_args := '{"device_map": "cpu"}'
        ) 
        AS embedding
)

On execution, we get:

| embedding                                               |
|---------------------------------------------------------|
| [-0.08091072, -0.07656342, -0.046813846, 0.344484, ...] |
| [-0.0023032427, 0.14910948, -0.12709272, 0.072619, ...] |
| [0.18557435, 0.17372614, 0.08067615, 0.28555152, ...]   |

As of right now, only CLIP models are supported for image embedding with the Hugging Face engine.

Using OpenAI Embedding Models

Here is an example of how to use the embed function using an OpenAI model:

SELECT
    thanosql.embed(
        input := text_column,
        engine := 'openai',
        model := 'text-embedding-ada-002',
        token := 'openai_api_key'
    ) AS text_embedding
FROM
    documents

On execution, we get:

| text_embedding                                    |
|---------------------------------------------------|
| [0.0023064255, -0.009327292, ... , -0.0028842222] |
| [0.0034567890, -0.012345678, ... , -0.0012345678] |
Please note that you would need to create an API key from OpenAI.

Using the OpenAI Client

Here is an example of how to use the embed function with the base URL using the OpenAI Client:

SELECT 
    thanosql.embed(
        engine := 'openai',
        input := 'Once upon a time in a faraway land',
        model := 'your_model_name',
        base_url := 'your_base_url', -- Only applicable when the engine is 'openai'
        token := 'your_token' 
    ) AS text_embedding

On execution, we get:

| text_embedding                         |
|----------------------------------------|
| [0.014321, -0.021234, ... , -0.042567] |

Standalone Usage

Here is an example of how to use the embed function as a standalone query:

SELECT 
    thanosql.embed(
        engine := 'huggingface',
        input := 'Once upon a time in a faraway land',
        model := 'openai/clip-vit-base-patch32',
        model_args := '{"device_map": "cpu"}'
    ) AS text_embedding

On execution, we get:

| text_embedding                         |
|----------------------------------------|
| [0.012345, -0.023456, ... , -0.034567] |

Model Restrictions

When using the embed function with the huggingface engine, ensure that only models compatible with the HuggingFace AutoModel and AutoTokenizer are used. Verify that the selected model is supported by the HuggingFace library to avoid compatibility issues. Even with compatible models, some models might still not work. We are actively working on improving compatibility and functionality to provide a better user experience. For more information, refer to the official Hugging Face documentation.