Help us improve
Share bugs, ideas, or general feedback.
From guixu
Discovers, evaluates, and acquires datasets for AI model training/fine-tuning from Kaggle, HuggingFace, IPFS, arXiv, DBLP. Assesses quality, licensing, provenance; downloads free/paid data.
npx claudepluginhub guixu-project/guixuHow this skill is triggered — by the user, by Claude, or both
Slash command
/guixu:guixuThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!--
Mines projects and conversations into a searchable memory palace and retrieves past work via semantic search.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
Share bugs, ideas, or general feedback.
Guixu provides dataset discovery, valuation, and acquisition for AI agents. It searches across Kaggle, HuggingFace, IPFS, arXiv, DBLP, and other sources.
✅ USE this skill when:
❌ DON'T use this skill when:
Always start with intent_parse to structure the request:
intent_parse(query="find me a cat image dataset for classification", task_type="classification")
This returns a structured QueryProfile with:
task_type: classification, detection, segmentation, etc.keywords: dataset content termstarget_entity: main subjectdata_standard: sample_unit, budget, schema expectationsUse dataset_search with keywords from intent_parse:
dataset_search(query="cat image", task_type="classification", limit=10)
Supported sources (leave empty to search all):
kaggle, huggingface, ipfs, bittorrentarxiv, dblp, semanticscholardefillama, rwa_xyz, guixu-hubpansearch (cloud drives)Filter options:
filters.max_price: maximum price in USDfilters.free_only: only free datasetsfilters.license: specific license (e.g., "CC-BY-4.0")For each promising candidate, call dataset_evaluate:
dataset_evaluate(cid="kaggle:owner/dataset-name", task_description="cat image classification", task_type="classification", required_columns=["image_path", "label"])
This returns:
tcv_score: -100 (harmful) to +100 (highly valuable)schema_fit: compatibility with required columnscommunity_signal: reviews and ratingsOnce a dataset is selected:
dataset_download(cid="kaggle:owner/dataset-name")
# or
dataset_download(cid="hf:owner/dataset-name")
Free sources: uci:, openml:, zenodo:, figshare:, hf: (public), ipfs:, guixu-hub:
Requires login: kaggle:
1. intent_parse(query="find me a dog vs cat classification dataset", task_type="classification")
2. dataset_search(query="cat dog classification", task_type="classification", limit=10)
3. dataset_evaluate(cid="kaggle:username/dataset", task_description="dog cat binary classification", task_type="classification", required_columns=["image_path", "label"])
4. dataset_download(cid="kaggle:username/dataset")
1. intent_parse(query="find helmet detection dataset with bounding boxes", task_type="detection")
2. dataset_search(query="helmet detection bounding box", task_type="detection", limit=10)
3. For each candidate: dataset_evaluate(cid, task_description="helmet detection", task_type="detection", required_columns=["image_path", "bbox"])
4. Download best candidate
1. intent_parse(query="find stock price dataset for time series forecasting", task_type="forecasting")
2. dataset_search(query="stock price time series", task_type="forecasting", filters={source: "defillama"})
3. dataset_evaluate(cid, task_description="stock price prediction", task_type="forecasting", required_columns=["timestamp", "price"])
4. dataset_download(cid)
intent_parse FIRST — it extracts task_type and keywords that improve search qualityrequire_license_review: true in evaluationfilters.max_price or budget to limit spendingguixuhub, huggingface, ipfs before paid sourcesIf dataset_search returns no results:
If dataset_evaluate fails:
If dataset_download fails:
source:owner/dataset