Added space in prompt for better infills and proper stop token #34

Kevsnz · 2024-02-03T12:09:45Z

Hi!

I've been experimenting with CodeLlama FIM last couple of days. What I discovered is that CodeLlama gives more robust results when sentinel tokens in prompt is surrounded by spaces. It's especially noticeable at the beginning of the file when 'system' part of prefix dominates the prompt.

Typical failure is shown below:

After the fix model fills the code properly:

I also changed stop tokens. I added <EOT> and <EOD> according to the paper and removed <PRE>, <SUF> and <MID> tokens since they stopped showing up in model responses after I fixed the prompt.

Kevsnz · 2024-02-04T13:41:54Z

So, after some more digging I think I've found the culprit.

I noticed that this happens only in Windows. Windows uses \r\n as newlines. For some reason CodeLlama (or it's tokenizer) fails to recognize <SUF> token if it's followed by \r.

That's why adding space after <SUF> stops those infilling fails. Removing \r after the token also fixes the problem.

Experiments

Failure: <SUF>\r\n

OK: <SUF> \r\n

OK: <SUF>\n (funny thing is that the model has put missing \r in generated completion 🤣):

Other possible way to fix it could be to replace all occurrences of \r\n' with \nin the prompt and put\r`s back into model output, and do this only in Windows. I'm not sure how to check which OS is running and/or if it's possible in VSCode at all.

ex3ndr · 2024-02-04T18:09:04Z

Wow, that's crazy! did they changed FIM tokens? Which paper you are referring to?

Kevsnz · 2024-02-05T13:39:31Z

It doesn't look like those tokens changed.

In the Code Llama paper (part 2.3) they refer to Bavarian et al. (2022) (part 3) where prompt format for fill-in-the-middle is described. <EOT> token finishes the whole sequence there.

However I can't find where I saw <EOD> token, probably in the source files somewhere on HugginFace. I haven't seen it in model's output, so maybe it can be removed.

Also, today I encountered same FIM failure on my Mac (using original extension version) where there is no \rs in prompts. Adding space after <SUF> token seem to have fixed the problem there.

ex3ndr · 2024-02-08T19:26:58Z

Perfect, thanks!

Added space for better infills and proper stop token

06eb07f

ex3ndr merged commit 1e2431a into ex3ndr:main Feb 8, 2024

Kevsnz deleted the prompt-fix branch February 9, 2024 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added space in prompt for better infills and proper stop token #34

Added space in prompt for better infills and proper stop token #34

Uh oh!

Kevsnz commented Feb 3, 2024 •

edited

Loading

Uh oh!

Kevsnz commented Feb 4, 2024

Uh oh!

ex3ndr commented Feb 4, 2024

Uh oh!

Kevsnz commented Feb 5, 2024 •

edited

Loading

Uh oh!

ex3ndr commented Feb 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added space in prompt for better infills and proper stop token #34

Added space in prompt for better infills and proper stop token #34

Uh oh!

Conversation

Kevsnz commented Feb 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kevsnz commented Feb 4, 2024

Uh oh!

ex3ndr commented Feb 4, 2024

Uh oh!

Kevsnz commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ex3ndr commented Feb 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kevsnz commented Feb 3, 2024 •

edited

Loading

Kevsnz commented Feb 5, 2024 •

edited

Loading