Skip to content

Conversation

Kevsnz
Copy link
Contributor

@Kevsnz Kevsnz commented Feb 3, 2024

Hi!

I've been experimenting with CodeLlama FIM last couple of days. What I discovered is that CodeLlama gives more robust results when sentinel tokens in prompt is surrounded by spaces. It's especially noticeable at the beginning of the file when 'system' part of prefix dominates the prompt.

Typical failure is shown below:
before

After the fix model fills the code properly:
after

I also changed stop tokens. I added <EOT> and <EOD> according to the paper and removed <PRE>, <SUF> and <MID> tokens since they stopped showing up in model responses after I fixed the prompt.

@Kevsnz
Copy link
Contributor Author

Kevsnz commented Feb 4, 2024

So, after some more digging I think I've found the culprit.

I noticed that this happens only in Windows. Windows uses \r\n as newlines. For some reason CodeLlama (or it's tokenizer) fails to recognize <SUF> token if it's followed by \r.

That's why adding space after <SUF> stops those infilling fails. Removing \r after the token also fixes the problem.

Experiments

Failure: <SUF>\r\n

fail

OK: <SUF> \r\n
ok with space

OK: <SUF>\n (funny thing is that the model has put missing \r in generated completion 🤣):
ok without cr

Other possible way to fix it could be to replace all occurrences of \r\n' with \nin the prompt and put\r`s back into model output, and do this only in Windows. I'm not sure how to check which OS is running and/or if it's possible in VSCode at all.

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 4, 2024

Wow, that's crazy! did they changed FIM tokens? Which paper you are referring to?

@Kevsnz
Copy link
Contributor Author

Kevsnz commented Feb 5, 2024

It doesn't look like those tokens changed.

In the Code Llama paper (part 2.3) they refer to Bavarian et al. (2022) (part 3) where prompt format for fill-in-the-middle is described. <EOT> token finishes the whole sequence there.

However I can't find where I saw <EOD> token, probably in the source files somewhere on HugginFace. I haven't seen it in model's output, so maybe it can be removed.

Also, today I encountered same FIM failure on my Mac (using original extension version) where there is no \rs in prompts. Adding space after <SUF> token seem to have fixed the problem there.

@ex3ndr ex3ndr merged commit 1e2431a into ex3ndr:main Feb 8, 2024
@ex3ndr
Copy link
Owner

ex3ndr commented Feb 8, 2024

Perfect, thanks!

@Kevsnz Kevsnz deleted the prompt-fix branch February 9, 2024 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants