From Claude-Data-Wrangler
Build a pipeline that takes the current working dataset, embeds the relevant text/fields, and upserts into a configured vector database backend (Pinecone, Qdrant, Weaviate, Milvus, pgvector, ChromaDB). Handles embedding model selection, chunking for long text, metadata attachment, namespace/collection management, and idempotent upserts. Use when the user wants to make a dataset semantically searchable.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin Claude-Data-WranglerThis skill uses the workspace's default tool permissions.
End-to-end pipeline: dataset → embeddings → vector DB upsert.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
End-to-end pipeline: dataset → embeddings → vector DB upsert.
database-guide recommended a vector backend.sentence-transformers (e.g. all-MiniLM-L6-v2, BAAI/bge-small-en-v1.5) or a task-tuned model.text-embedding-3-small / -large; Cohere; Voyage; Jina; Anthropic-compatible via wrappers.chunk_id and parent_row_id so downstream search can re-aggregate.$CLAUDE_USER_DATA/Claude-Data-Wrangler/config.json under vector_profiles for saved backends.parent_row_id + chunk_id) so re-runs are idempotent.$CLAUDE_USER_DATA/Claude-Data-Wrangler/config.json:
{
"vector_profiles": {
"pinecone-prod": {
"backend": "pinecone",
"index": "knowledge-base",
"namespace": "documents",
"api_key_ref": {"type": "op", "reference": "op://Private/Pinecone/api_key"}
},
"local-qdrant": {
"backend": "qdrant",
"url": "http://localhost:6333",
"collection": "documents"
}
},
"embedding_defaults": {
"model": "BAAI/bge-small-en-v1.5",
"dimension": 384,
"metric": "cosine"
}
}
pip install pandas sentence-transformers # local embedding
# per backend
pip install pinecone-client
pip install qdrant-client
pip install weaviate-client
pip install pymilvus
pip install psycopg[binary] # for pgvector
pip install chromadb
version field in metadata.paraphrase-multilingual-mpnet-base-v2, multilingual-e5-large); don't silently embed non-English with an English-only model.pii-flag first; embeddings can leak training-adjacent info and are hard to redact after the fact.