Google Research

CoCa (Capturing Compositionality)

Efficiently combines image understanding and language generation.

Efficient joint training of vision and lStrong performance on captioning and VQAScalable architectureFoundation for multimodal models

Where it ranks today

Best for / Not great for

Best for
  • Image captioning
  • Visual Question Answering (VQA)
  • General image-text understanding tasks
  • Research in efficient multimodal learning
Not great for
  • Audio/video generation
  • Real-time interactive applications
  • Creative text generation beyond descriptions

Why it ranks here

CoCa represented a significant step in efficient multimodal training. While not a standalone product, its architectural innovations and performance benchmarks continue to influence the development of larger, more capable multimodal systems, particularly in research settings.

30-day trend

Score breakdown

Search trends84
Benchmarks86
Developer buzz85
News mentions84

Pricing

API: $0.00 in · $0.00 out per 1M tokens · Consumer: $0.00/mo

Pricing plans

Popular
Research Code
Open-source implementation for research.
Free
  • Model architecture
  • Training code
  • Dataset integration
  • Academic focus
View Code
Compare with another modelHow is this score calculated? →Snapshot 2026-05-25