From sagemaker-ai
Generates Jupyter notebooks transforming datasets to ML formats like OpenAI chat, SageMaker SFT/DPO, HuggingFace preference from local files or S3.
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiThis skill uses the workspace's default tool permissions.
Transforms a data set provided by the user into their desired format. All transformation code is delivered as a Jupyter notebook.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Transforms a data set provided by the user into their desired format. All transformation code is delivered as a Jupyter notebook.
transform_dataset_2.ipynb) before writing..jsonl (JSON Lines — one JSON object per line).This skill supports two transformation purposes — training data and evaluation data — each with its own format resolution path. The purpose is determined in Step 1 of the workflow.
When the transformation is for model training, resolve the target format using the reference file ../dataset-evaluation/references/strategy_data_requirements.md. The required format depends on both the model type (Open Weights like Llama/Qwen vs Nova) and the finetuning technique (SFT, DPO, RLVR) — make sure to match on both dimensions. If either the model type or technique is not yet known, ask the user before resolving the format.
When the transformation is for model evaluation, resolve the target format using this order:
references/sagemaker_dataset_formats.md. Inform the user that the format schemas are from an offline copy and may be outdated.Use whichever source you successfully access as the source of truth for the target format. Do not rely on memorized schemas.
Your first response should determine whether this transformation is for model training or model evaluation. If the context already makes this clear (e.g., the user said "I need to prep my training data" or "I need to format my eval dataset"), confirm your understanding and move on. Otherwise, ask:
"Is this dataset transformation for model training or model evaluation? This helps me look up the right target format for you."
Remember this choice — it determines how the target format is resolved in Step 3.
⏸ Wait for user.
Acknowledge the user's request and state what this skill can do:
"I can help you transform your dataset's format! Here's my plan: I will first need to understand the format of your dataset and the transformation requirements. Once I have that, I will generate a dataset transformation function that we can refine together. After the dataset transformation function is refined to your liking, I will perform the transformation task and upload it to your desired location! Does this sound good?"
⏸ Wait for user.
For this step, you need to know: what dataset format the user would like to transform their dataset from and what dataset format they would like to transform it in to. If you know this already, skip this step. If not, ask the user:
"What's the dataset format you would like to transform it into?"
Resolve the target format based on the purpose determined in Step 1:
"I've found a SageMaker dataset format: {sagemaker-dataset-format-name} with schema: {sagemaker-dataset-format-schema}. Is this what you were referring to?"
If the user describes a custom format not listed in the reference doc, ask them to provide a sample record of the desired output format.
⏸ Wait for user.
For this step, you need: the location of the user's dataset. If you know this already, skip this step. If not, ask the user:
"Where can I find your dataset? Either a local directory or S3 location works!"
⏸ Wait for user.
Read 1–2 sample records from the user's dataset and show them so the user can confirm the source schema. Do not run format detection — that is handled by the planning skill before this skill is invoked.
Do not show a side-by-side mapping to the target format here — the detailed mapping will be handled in Step 7 when generating the transformation function.
⏸ Wait for user.
For this step, you need: to understand where to output the transformed dataset to. It could be an S3 URI or local directory If you already know where the dataset is supposed to be output to, skip this step. If not, ask the user:
"Where should I output your transformed dataset to? Either a local directory or S3 location works!"
If the user provides a directory (not a full file path), construct the output filename using the pattern {original_name}_{target_format}.jsonl (e.g., gen_qa_100k_openai.jsonl).
⏸ Wait for user.
For this step, you need: to generate a python function that transforms the dataset from the format in Step 5 to the format in Step 3
Read the reference guide at references/dataset_transformation_code.md and follow its skeleton exactly when generating the transformation function.
The python function should be in the form of:
def transform_dataset(df: pd.DataFrame) -> pd.DataFrame:
Add a %%writefile <project-dir>/scripts/transform_fn.py code cell to the notebook AND write the file to disk for testing. The <project-dir> is the project directory established by the directory-management skill (e.g., dpo-to-rlvr-conversion). All notebooks go in <project-dir>/notebooks/ and all scripts go in <project-dir>/scripts/.
Continue iterating with the user's feedback — update the notebook cell in place on each revision rather than showing code inline.
If sample data was collected in Step 5, test the function against the sample records:
/tmp/test_input.jsonl), then run:
python3 -c "import sys; sys.path.insert(0, '<project-dir>/scripts'); from transform_fn import transform_dataset; import pandas as pd; df = pd.read_json('/tmp/test_input.jsonl', lines=True); result = transform_dataset(df); print(result.to_json(orient='records', lines=True))"If no sample data, present the function for review and refinement.
⏸ Wait for user.
Before writing the notebook, read:
references/notebook_structure.md (cell order, placeholders, and content)references/notebook_writing_guide.md (Jupyter notebook JSON formatting)Generate the execution logic as code cells in the notebook.
%%writefile <project-dir>/scripts/<script_name>.py code cell to the notebook AND write the file to disk for testing.transform_dataset from transform_fn.Read the reference guide at references/dataset_transformation_code.md and follow its execution script skeleton exactly.
If sample data was collected in Step 5, test the full pipeline:
/tmp/test_input.jsonl).python3 <project-dir>/scripts/<script_name> --input /tmp/test_input.jsonl --output /tmp/test_output.jsonlIf no sample data, present the notebook for review and refinement.
⏸ Wait for user.
Check the size of the input dataset:
head-object (S3 service) with the bucket and key to get ContentLength.Decision criteria:
Inform the user of the recommendation and get their approval:
If local:
"Your dataset is {size} MB — since it's under 50 MB, I'd recommend running the transformation locally. Would you like to proceed with local execution, or would you prefer a SageMaker Processing Job instead?"
If SageMaker Processing Job:
"Your dataset is {size} MB — since it's over 50 MB, I'd recommend running this as a SageMaker Processing Job for better performance. Would you like to proceed with a SageMaker Processing Job, or would you prefer to run it locally instead?"
Do not execute until the user approves. If the user rejects the recommendation, switch to the alternative and get their explicit approval before proceeding.
⏸ Wait for user.
After user confirms, add an execution cell to the notebook. Do NOT run the full transformation — only generate the cell for the user to execute themselves:
If local execution:
.py files already on disk (written by the agent during Steps 7–8): import transform_dataset from transform_fn, load the dataset, transform, and save output. Scripts are located in <project-dir>/scripts/.If SageMaker Processing Job:
processor.run(wait=True, logs=True) to block the cell and stream logs until the job completes. See scripts/transformation_tools.py for reference implementation details.Important: The agent must NOT execute the full dataset transformation itself. The notebook cells are generated for the user to review and run. Only sample data (from Steps 7–8) should be transformed by the agent for validation purposes.
"I've added the execution cell to the notebook. You can run it to transform the full dataset. Would you like to review the notebook before running it?"
⏸ Wait for user.
For this step, you need: to verify the output looks correct and confirm with the user.
⏸ Wait for user to confirm.