OpenAI

CLIP ViT-L/14

Pioneering multimodal embeddings for vision and text.

Cross-modal understandingImage searchZero-shot image classification
Today's score
87.0
Try CLIP ViT-L/14

Where it ranks today

Best for / Not great for

Best for
  • Image and video search
  • Multimodal RAG
  • Content moderation
  • Visual question answering
Not great for
  • Purely text-based tasks
  • Very high-dimensional text-only data
  • Low-resource environments

Why it ranks here

While not solely for text, CLIP's ability to link images and text makes it indispensable for multimodal search and RAG. Its strong performance in cross-modal tasks continues to drive innovation in content understanding.

30-day trend

Score breakdown

Search trends88
Benchmarks87
Developer buzz86
News mentions87

Pricing

API: $0.00 in · $0.00 out per 1M tokens · Consumer: $0.00/mo

Pricing plans

Popular
Open Source
Use and adapt the powerful multimodal model.
Free
  • Pre-trained model weights
  • Supports image and text
  • Customizable
  • Extensive research applications
Get on Hugging Face
API Access (via third-party)
Convenient API for multimodal tasks.
$0 /usage
  • Managed inference
  • Pay-per-use
  • Easy integration
  • Supports image/text search
Explore OpenAI API
Compare with another modelHow is this score calculated? →Snapshot 2026-05-23