Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing max image / audio limits? #84

Open
domenic opened this issue Mar 13, 2025 · 0 comments
Open

Exposing max image / audio limits? #84

domenic opened this issue Mar 13, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@domenic
Copy link
Collaborator

domenic commented Mar 13, 2025

With the recent addition of multimodal support, we've noticed that Chrome's model has some limits. We're wondering if, or how, we should expose these through the API.

Our current model resamples audio so that there's a fixed ratio of tokens per second. It also resizes images to a consistent size, so that one image always takes up a specific number of tokens. (From my surface knowledge of other APIs, this is a common strategy.)

This means that, for example, our current model's context window is only big enough to accept 10 images. It doesn't matter how big those images are originally: 1x1 transparent PNGs are treated the same as full-HD photos, as they both get resized and then treated as the same number of tokens. Similarly, silent audio is treated the same as active conversation, and we can accept around 30 seconds of it before exceeding the input quota.

Developers can use the measureInputUsage() and inputQuota APIs to gain some insight into this process. But, it's less direct than it could be. Should we consider exposing more directly the max number of images supported, or the max length of audio supported?

Similarly, for images especially, knowing the "native resolution" of the model could be useful: if the model is always going to resize to 1000x1000, then a website might choose to capture input using a lower-fidelity video camera instead of a HD one, or similar.

The argument against doing this is that it only makes sense for certain architectures. It's conceivable that other architectures might have different image resizing strategies, or might for example include a small model in front of the base model that filters out silent audio. Such architectures do not have as clear "max images" or "native resolution" or "max audio time". If we prematurely added such APIs, they might not be applicable to all models, and in the worst case might hinder future innovation.

So I wanted to open this issue to get a sense more concretely of if these APIs would be good. In general in computer programming, it's better to try something and see if it fails, than to test ahead of time. (See, e.g., the specific case discussed here.) So we'd need a clear idea of what real-world scenarios can only be built with this sort of ahead-of-time-testing API, to weigh the tradeoffs properly.

@domenic domenic added the enhancement New feature or request label Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant