Correctly fold unknown-8bit originating from encoded words. #142517

bitdancer · 2025-12-10T14:50:41Z

The unknown-8bit trick was designed to deal with unknown bytes in an
ASCII message, and it works fine for that. However, I also tried to
extend it to handle bytes that can't be decoded using the charset
specified in an encoded word, and there it fails because there can be
other non-ASCII characters that were successfully decoded. The fix is
simple: do the unknown-8bit encoding using the utf-8 codec. This is
especially appropriate since anyone trying to do recovery on an unknown
byte string will probably attempt utf-8 first.

The unknown-8bit trick was designed to deal with unknown bytes in an ASCII message, and it works fine for that. However, I also tried to extend it to handle bytes that can't be decoded using the charset specified in an encoded word, and there it fails because there can be other non-ASCII characters that were *successfully* decoded. The fix is simple: do the unknown-8bit encoding using the utf-8 codec. This is especially appropriate since anyone trying to do recovery on an unknown byte string will probably attempt utf-8 first.

bitdancer · 2025-12-16T19:27:27Z

Does anyone want to review this, or shall I just merge it?

miss-islington-app · 2025-12-24T14:15:03Z

Thanks @bitdancer for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

miss-islington-app · 2025-12-24T14:15:03Z

Thanks @bitdancer for the PR 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

…-142517) The unknown-8bit trick was designed to deal with unknown bytes in an ASCII message, and it works fine for that. However, I also tried to extend it to handle bytes that can't be decoded using the charset specified in an encoded word, and there it fails because there can be other non-ASCII characters that were *successfully* decoded. The fix is simple: do the unknown-8bit encoding using the utf-8 codec. This is especially appropriate since anyone trying to do recovery on an unknown byte string will probably attempt utf-8 first. (cherry picked from commit 1e17ccd) Co-authored-by: R. David Murray <rdmurray@bitdance.com>

bedevere-app · 2025-12-24T14:15:13Z

GH-143146 is a backport of this pull request to the 3.14 branch.

…-142517) The unknown-8bit trick was designed to deal with unknown bytes in an ASCII message, and it works fine for that. However, I also tried to extend it to handle bytes that can't be decoded using the charset specified in an encoded word, and there it fails because there can be other non-ASCII characters that were *successfully* decoded. The fix is simple: do the unknown-8bit encoding using the utf-8 codec. This is especially appropriate since anyone trying to do recovery on an unknown byte string will probably attempt utf-8 first. (cherry picked from commit 1e17ccd) Co-authored-by: R. David Murray <rdmurray@bitdance.com>

bedevere-app · 2025-12-24T14:15:18Z

GH-143147 is a backport of this pull request to the 3.13 branch.

…H-142517) (#143147) The unknown-8bit trick was designed to deal with unknown bytes in an ASCII message, and it works fine for that. However, I also tried to extend it to handle bytes that can't be decoded using the charset specified in an encoded word, and there it fails because there can be other non-ASCII characters that were *successfully* decoded. The fix is simple: do the unknown-8bit encoding using the utf-8 codec. This is especially appropriate since anyone trying to do recovery on an unknown byte string will probably attempt utf-8 first. (cherry picked from commit 1e17ccd) Co-authored-by: R. David Murray <rdmurray@bitdance.com> Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>

…H-142517) (#143146) The unknown-8bit trick was designed to deal with unknown bytes in an ASCII message, and it works fine for that. However, I also tried to extend it to handle bytes that can't be decoded using the charset specified in an encoded word, and there it fails because there can be other non-ASCII characters that were *successfully* decoded. The fix is simple: do the unknown-8bit encoding using the utf-8 codec. This is especially appropriate since anyone trying to do recovery on an unknown byte string will probably attempt utf-8 first. (cherry picked from commit 1e17ccd) Co-authored-by: R. David Murray <rdmurray@bitdance.com> Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>

bitdancer requested a review from a team as a code owner December 10, 2025 14:50

bitdancer self-assigned this Dec 10, 2025

bedevere-app bot added the awaiting core review label Dec 10, 2025

bitdancer added 2 commits December 10, 2025 10:03

News entry.

1bba134

bitdancer force-pushed the undecodable_encoded_words branch from 4ae90b4 to 1bba134 Compare December 10, 2025 15:04

bitdancer added the skip issue label Dec 10, 2025

bitdancer merged commit 1e17ccd into python:main Dec 24, 2025
48 checks passed

bedevere-app bot removed the awaiting core review label Dec 24, 2025

bitdancer added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Dec 24, 2025

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Dec 24, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Dec 24, 2025

bitdancer added a commit to bitdancer/cpython that referenced this pull request Dec 24, 2025

pythongh-142517: Fix typo in news item.

11d39a6

bitdancer added a commit that referenced this pull request Dec 24, 2025

gh-142517: Fix typo in news item. (#143150)

7342890

bitdancer deleted the undecodable_encoded_words branch December 24, 2025 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Correctly fold unknown-8bit originating from encoded words. #142517

Correctly fold unknown-8bit originating from encoded words. #142517

bitdancer commented Dec 10, 2025

Uh oh!

bitdancer commented Dec 16, 2025

Uh oh!

Uh oh!

miss-islington-app bot commented Dec 24, 2025

Uh oh!

miss-islington-app bot commented Dec 24, 2025

Uh oh!

bedevere-app bot commented Dec 24, 2025

Uh oh!

bedevere-app bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Correctly fold unknown-8bit originating from encoded words. #142517

Correctly fold unknown-8bit originating from encoded words. #142517

Conversation

bitdancer commented Dec 10, 2025

Uh oh!

bitdancer commented Dec 16, 2025

Uh oh!

Uh oh!

miss-islington-app bot commented Dec 24, 2025

Uh oh!

miss-islington-app bot commented Dec 24, 2025

Uh oh!

bedevere-app bot commented Dec 24, 2025

Uh oh!

bedevere-app bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant