Effective dataset curation requires combining multiple search and filter tools in a specific sequence. These recipes provide step-by-step workflows for common platform outcomes, helping you achieve high-quality results efficiently.Documentation Index
Fetch the complete documentation index at: https://visual-layer-mintlify-changelog-1777594172.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Recipe 1: Finding Diverse Examples of a Specific Pattern
Objective: Isolate a diverse set of examples for a specific visual pattern, ensuring you capture rare variations without filling the dataset with repetitive images. Example scenarios:Manufacturing
Medical Imaging
Retail/Insurance
Defense & Intelligence
Broad Search
- Query: Describe the pattern in natural language (e.g., “surface damage,” “cracked glass,” “skin lesion”)
- Result: This returns a broad set of candidates, likely including some irrelevant images (false positives).
Visual Refinement
- Find a clear, high-quality example of the specific pattern you want in the search results.
- Crop the image to isolate just the pattern (excluding background or irrelevant context).
- Run the visual search to find visually similar patterns.
Ensure Diversity
- Set the Uniqueness Threshold to High.
- Why: This hides repetitive examples of the same common pattern, surfacing visually distinct variations and edge cases.
Recipe 2: Cleaning Raw Data for Labeling
Objective: Rapidly prepare a messy, raw dataset for labeling by removing low-quality data that would waste annotator time and budget. Example scenarios:Manufacturing
Medical Imaging
Retail
Defense & Intelligence
Research
Remove Technical Failures
- Set Blurry to
IS NOT. - Set Dark to
IS NOT. - Result: Removes unreadable or low-information images immediately.
Remove Annotation Errors
- Set Mislabels to
IS NOT. - Result: Excludes images where existing metadata likely conflicts with visual content, preventing bad ground truth from entering the pipeline.
Reduce Redundancy
- Set threshold to Medium.
- Result: If the ingest contains burst-mode photos or video sequences, this keeps only representative frames, significantly reducing the total count sent to labeling.
Recipe 3: Balancing Common Scenarios with Rare Edge Cases
Objective: Curate a dataset that captures both typical scenarios and rare edge cases while managing storage volumes efficiently. Example scenarios:Autonomous Vehicles
Manufacturing
Medical Imaging
Defense & Intelligence
Retail
Reduce Storage Costs
- Action: Review duplicate clusters from video sequences or burst captures.
- Select: Keep one representative frame per scenario.
- Result: Often reduces dataset size by 30-40% without losing scenario coverage.
Surface Rare Cases
- Action: Sort by high confidence outliers.
- Result: Surfaces rare variations that are critical for model robustness but easy to miss in manual review.
Categorize Challenging Conditions
- Filter:
DarkandBright. - Action: Instead of deleting these, tag them with a descriptive name (e.g., “Challenging Lighting,” “Low Visibility”).
- Result: Creates specific subsets for testing model performance in adverse conditions.
Recipe 4: Managing Large Visual Catalogs
Objective: Consolidate duplicate assets, enforce quality standards, and organize large collections of visual content. Example scenarios:E-commerce
Real Estate
Manufacturing
Media/Creative
Digital Asset Management
Consolidate Duplicate Assets
- Scenario: Multiple sources upload the same or nearly identical images.
- Action: Identify duplicate groups and link them to a single master asset.
- Result: Prevents search results from being flooded with identical or near-identical images.
Enforce Quality Standards
- Filter:
BlurryORDarkORBright. - Action: Flag these images for review, replacement, or auto-rejection.
- Result: Ensures only professional-quality images remain in the catalog.
Organize Unlabeled Content
- Filter:
LabelsISUnlabeled. - Action: Isolate unlabeled content and use Semantic Search to bulk-select and categorize items (e.g., “red sneakers,” “two-bedroom apartments,” “hydraulic fittings”).
Recipe 5: Identifying Annotation Inconsistencies
Objective: Find and fix labeling errors or inconsistencies across your dataset to improve model training quality. Example scenarios:Manufacturing
Medical Imaging
Retail
Defense & Intelligence
Autonomous Vehicles
Find Visual-Label Mismatches
- Action: Sort by high confidence mislabels.
- Result: Surfaces images where the visual content doesn’t align with the assigned label.
Review Class Outliers
- Action: Review images flagged as outliers within their assigned class.
- Result: Finds images that are technically correct but visually anomalous for that category (e.g., drawings in a photo dataset).
Validate with Visual Search
- Action: See what other images visually match this item.
- Result: If all visual matches have a different label, this confirms a likely mislabel.
Recipe 6: Creating Balanced Training Sets
Objective: Build a training dataset with appropriate class distribution and representation across important variations. Example scenarios:Manufacturing
Medical Imaging
Retail
Defense & Intelligence
Autonomous Vehicles
Assess Current Distribution
- Action: Review the distribution of images across classes.
- Result: Identify overrepresented and underrepresented categories.
Reduce Overrepresented Classes
- Set threshold to High to keep only the most distinctive examples.
- Result: Reduces redundancy while preserving diversity within that class.
Augment Underrepresented Classes
- Query: Describe the underrepresented category in detail.
- Review results and tag valid examples to expand that class.
Validate Diversity
- Use Visual Search from different cluster centers to ensure visual variety.
- Apply Select Uniques to prevent any single visual pattern from dominating.
Additional Tips for Recipe Success
Combine Filters Strategically
Most recipes work best when you apply filters in a specific order:- Start broad with semantic or visual search to establish scope.
- Remove obvious problems with quality filters early.
- Refine for diversity with uniqueness and outlier filters.
- Final polish with duplicate detection and targeted tagging.
Save Intermediate Steps
Save views at each major step in your recipe:- Enables you to backtrack if a filter removes too much.
- Creates audit trail for dataset curation decisions.
- Allows different team members to review at different stages.
Iterate and Adjust
These recipes are starting points, not rigid procedures:- Adjust thresholds based on your dataset characteristics.
- Add custom metadata filters for domain-specific criteria.
- Combine multiple recipes for complex curation workflows.