Google Research

CoCa (Capturing Compositionality)

Name: CoCa (Capturing Compositionality)
Brand: Google Research
Rating: 8.5 (1 reviews)

Efficiently combines image understanding and language generation.

Efficient joint training of vision and lStrong performance on captioning and VQAScalable architectureFoundation for multimodal models

Today's score

85.0

Try CoCa (Capturing Compositionality)

Where it ranks today

Multimodal

#10

Best for / Not great for

Best for

Image captioning
Visual Question Answering (VQA)
General image-text understanding tasks
Research in efficient multimodal learning

Not great for

Audio/video generation
Real-time interactive applications
Creative text generation beyond descriptions

Why it ranks here

CoCa represented a significant step in efficient multimodal training. While not a standalone product, its architectural innovations and performance benchmarks continue to influence the development of larger, more capable multimodal systems, particularly in research settings.

30-day trend

Score breakdown

Search trends84

Benchmarks86

Developer buzz85

News mentions84

Pricing

API: $0.00 in · $0.00 out per 1M tokens · Consumer: $0.00/mo

Pricing plans

Popular

Research Code

Open-source implementation for research.

Free

Model architecture
Training code
Dataset integration
Academic focus

View Code

Compare with another model How is this score calculated? →Snapshot 2026-05-25