Skip to content

Commit 71278bf

Browse files
Dandandanclaude
andcommitted
Fix u128 bitwise trick to mask out bytes beyond string length
Inlined StringView values only occupy the first `len` bytes of the 12-byte data region. Mask out remaining bytes to avoid counting garbage continuation bytes. Also trimmed verbose comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b3b7eb8 commit 71278bf

1 file changed

Lines changed: 7 additions & 11 deletions

File tree

datafusion/functions/src/unicode/character_length.rs

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -186,17 +186,13 @@ where
186186
T::default_value()
187187
} else if len <= 12 {
188188
// Inlined string: count UTF-8 chars directly from the u128 view.
189-
// Bytes are at positions 4..4+len in the view (little-endian).
190-
// Shift right by 32 bits to get the string bytes in the low bits.
191-
let data = *raw_view >> 32;
192-
// Create a mask of just the high bit of each byte (0x80)
193-
// and the bit below it (0x40) to detect continuation bytes (10xxxxxx).
194-
// A continuation byte has bit7=1 and bit6=0.
195-
// ~data inverts: continuation bytes get bit7=0, bit6=1
196-
// (data >> 6) shifts bit7 into bit1 and bit6 into bit0
197-
// OR with ~data: for continuation bytes, bit6 is guaranteed 1
198-
// For non-continuation bytes, at least one of these will have bit7=1
199-
// We only need to check the high bit of each byte after the OR.
189+
// Shift right 32 bits to get string bytes in low bits, then
190+
// mask to only the valid `len` bytes (remaining bytes may be garbage).
191+
let valid_mask = (1u128 << (len * 8)) - 1;
192+
let data = (*raw_view >> 32) & valid_mask;
193+
// Count non-continuation bytes: a UTF-8 continuation byte matches
194+
// 10xxxxxx, so (byte | (byte >> 1)) & 0x80 is set for all
195+
// non-continuation bytes (they have bit7=0 or bit6=1).
200196
let not_continuation =
201197
(data | (!data >> 1)) & 0x0080_0080_0080_0080_0080_0080u128;
202198
T::Native::usize_as(not_continuation.count_ones() as usize)

0 commit comments

Comments
 (0)