Generate datasets

Export logs for fine-tuning or batch jobs.

Create a data set

OpenAI announced the ability to fine-tune gpt-4o-mini. Use our AI SQL editor to create a fine-tuning dataset you can use for fine-tuning OpenAI. Use a similar process to generate batch datasets for evaluations, classification, or embeddings.

Example use cases:

  • Export datasets for fine-tuning models
  • Select records for evaluations of different models
  • Identify example sets for any batch process

Query your data set

Navigate to the AI SQL editor in your Velvet app. Write SQL to gather the relevant dataset.

Here is example SQL query for identifying logs into a single column.

-- CTE to gather all messages with appropriate roles and contents
WITH messages AS (
    SELECT
        id,
        created_at,
        jsonb_build_object('role', 'system', 'content', request->'body'->'messages'->0->>'content') AS message
    FROM llm_logs
    UNION ALL
    SELECT
        id,
        created_at,
        jsonb_build_object('role', 'user', 'content', request->'body'->'messages'->1->>'content') AS message
    FROM llm_logs
    UNION ALL
    SELECT
        id,
        created_at,
        jsonb_build_object('role', 'assistant', 'content', response->'body'->'choices'->0->'message'->>'content') AS message
    FROM llm_logs
)
-- Aggregate messages for each log entry
SELECT 
    jsonb_build_object(
        'messages', jsonb_agg(message ORDER BY id, created_at)
    ) AS transformed_log
FROM messages
GROUP BY id, created_at
ORDER BY created_at DESC
LIMIT 10; -- modify as needed
-- CTE to create a common filter
WITH filtered_logs AS (
    SELECT *
    FROM llm_logs
    WHERE request->>'url' = '/chat/completions'
),
-- CTE to gather all messages with appropriate roles and contents
WITH messages AS (
    SELECT
        id,
        created_at,
        jsonb_build_object('role', 'system', 'content', request->'body'->'messages'->0->>'content') AS message
    FROM filtered_logs
    UNION ALL
    SELECT
        id,
        created_at,
        jsonb_build_object('role', 'user', 'content', request->'body'->'messages'->1->>'content') AS message
    FROM filtered_logs
    UNION ALL
    SELECT
        id,
        created_at,
        jsonb_build_object('role', 'assistant', 'content', response->'body'->'choices'->0->'message'->>'content') AS message
    FROM filtered_logs
)
-- Aggregate messages for each log entry
SELECT 
    jsonb_build_object(
        'messages', jsonb_agg(message ORDER BY id, created_at)
    ) AS transformed_log
FROM messages
GROUP BY id, created_at
ORDER BY created_at DESC
LIMIT 10; -- modify as needed

Export as JSONL

OpenAI's fine-tuning and batch endpoints need the dataset formatted as JSONL .

Once you've run the SQL query and have a result set you want to use, click the Export button in the data table. Export just selected rows, or the entire result set. Click JSONL for the export type.

You'll now have a .jsonl to upload to OpenAI.


Summary of the process

Refer to the OpenAI fine-tuning and batch guides for full requirements.

  1. Query logs using the AI SQL editor.
  2. Export the dataset as JSONL (or CSV if you want to further modify).
  3. To make manual modifications, open the CSV in a spreadsheet application. Make adjustments and export as JSONL once complete.
  4. Upload file to OpenAI for fine-tuning or batching.