You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the recent addition of multimodal support, we've noticed that Chrome's model has some limits. We're wondering if, or how, we should expose these through the API.
Our current model resamples audio so that there's a fixed ratio of tokens per second. It also resizes images to a consistent size, so that one image always takes up a specific number of tokens. (From my surface knowledge of other APIs, this is a common strategy.)
This means that, for example, our current model's context window is only big enough to accept 10 images. It doesn't matter how big those images are originally: 1x1 transparent PNGs are treated the same as full-HD photos, as they both get resized and then treated as the same number of tokens. Similarly, silent audio is treated the same as active conversation, and we can accept around 30 seconds of it before exceeding the input quota.
Developers can use the measureInputUsage() and inputQuota APIs to gain some insight into this process. But, it's less direct than it could be. Should we consider exposing more directly the max number of images supported, or the max length of audio supported?
Similarly, for images especially, knowing the "native resolution" of the model could be useful: if the model is always going to resize to 1000x1000, then a website might choose to capture input using a lower-fidelity video camera instead of a HD one, or similar.
The argument against doing this is that it only makes sense for certain architectures. It's conceivable that other architectures might have different image resizing strategies, or might for example include a small model in front of the base model that filters out silent audio. Such architectures do not have as clear "max images" or "native resolution" or "max audio time". If we prematurely added such APIs, they might not be applicable to all models, and in the worst case might hinder future innovation.
So I wanted to open this issue to get a sense more concretely of if these APIs would be good. In general in computer programming, it's better to try something and see if it fails, than to test ahead of time. (See, e.g., the specific case discussed here.) So we'd need a clear idea of what real-world scenarios can only be built with this sort of ahead-of-time-testing API, to weigh the tradeoffs properly.
The text was updated successfully, but these errors were encountered:
With the recent addition of multimodal support, we've noticed that Chrome's model has some limits. We're wondering if, or how, we should expose these through the API.
Our current model resamples audio so that there's a fixed ratio of tokens per second. It also resizes images to a consistent size, so that one image always takes up a specific number of tokens. (From my surface knowledge of other APIs, this is a common strategy.)
This means that, for example, our current model's context window is only big enough to accept 10 images. It doesn't matter how big those images are originally: 1x1 transparent PNGs are treated the same as full-HD photos, as they both get resized and then treated as the same number of tokens. Similarly, silent audio is treated the same as active conversation, and we can accept around 30 seconds of it before exceeding the input quota.
Developers can use the
measureInputUsage()
andinputQuota
APIs to gain some insight into this process. But, it's less direct than it could be. Should we consider exposing more directly the max number of images supported, or the max length of audio supported?Similarly, for images especially, knowing the "native resolution" of the model could be useful: if the model is always going to resize to 1000x1000, then a website might choose to capture input using a lower-fidelity video camera instead of a HD one, or similar.
The argument against doing this is that it only makes sense for certain architectures. It's conceivable that other architectures might have different image resizing strategies, or might for example include a small model in front of the base model that filters out silent audio. Such architectures do not have as clear "max images" or "native resolution" or "max audio time". If we prematurely added such APIs, they might not be applicable to all models, and in the worst case might hinder future innovation.
So I wanted to open this issue to get a sense more concretely of if these APIs would be good. In general in computer programming, it's better to try something and see if it fails, than to test ahead of time. (See, e.g., the specific case discussed here.) So we'd need a clear idea of what real-world scenarios can only be built with this sort of ahead-of-time-testing API, to weigh the tradeoffs properly.
The text was updated successfully, but these errors were encountered: