UTF8
UTF8 encoding/decoding.
Types
oc_utf32
typedef u32 oc_utf32;
A unicode codepoint.
oc_utf8_status
typedef enum oc_utf8_status
{
OC_UTF8_OK = 0,
OC_UTF8_OUT_OF_BOUNDS = 1,
OC_UTF8_UNEXPECTED_CONTINUATION_BYTE = 3,
OC_UTF8_UNEXPECTED_LEADING_BYTE = 4,
OC_UTF8_INVALID_BYTE = 5,
OC_UTF8_INVALID_CODEPOINT = 6,
OC_UTF8_OVERLONG_ENCODING = 7
} oc_utf8_status;
This enum declares the possible return status of UTF8 decoding/encoding operations.
Enum Constants
OC_UTF8_OK
The operation was successful.OC_UTF8_OUT_OF_BOUNDS
The operation unexpectedly encountered the end of the utf8 sequence.OC_UTF8_UNEXPECTED_CONTINUATION_BYTE
A continuation byte was encountered where a leading byte was expected.OC_UTF8_UNEXPECTED_LEADING_BYTE
A leading byte was encountered in the middle of the encoding of utf8 codepoint.OC_UTF8_INVALID_BYTE
The utf8 sequence contains an invalid byte.OC_UTF8_INVALID_CODEPOINT
The operation encountered an invalid utf8 codepoint.OC_UTF8_OVERLONG_ENCODING
The utf8 sequence contains an overlong encoding of a utf8 codepoint.
oc_utf8_dec
typedef struct oc_utf8_dec
{
oc_utf8_status status;
oc_utf32 codepoint;
u32 size;
} oc_utf8_dec;
A type representing the result of decoding of utf8-encoded codepoint.
Fields
status
The status of the decoding operation. If notOC_UTF8_OK
, it describes the error that was encountered during decoding.codepoint
The decoded codepoint.size
The size of the utf8 sequence encoding that codepoint.
oc_unicode_range
typedef struct oc_unicode_range
{
oc_utf32 firstCodePoint;
u32 count;
} oc_unicode_range;
A type representing a contiguous range of unicode codepoints.
Fields
firstCodePoint
The first codepoint of the range.count
The number of codepoints in the range.
Functions
oc_utf8_size_from_leading_char
u32 oc_utf8_size_from_leading_char(char leadingChar);
Get the size of a utf8-encoded codepoint for the first byte of the encoded sequence.
Parameters
leadingChar
The first byte of utf8 sequence.
Return
The size of the utf8 sequence, in bytes.
oc_utf8_codepoint_size
u32 oc_utf8_codepoint_size(oc_utf32 codePoint);
Get the size of the utf8 encoding of a codepoint.
Parameters
codePoint
A unicode codepoint.
Return
The size of the encoded codepoint, in bytes.
oc_utf8_codepoint_count_for_string
u64 oc_utf8_codepoint_count_for_string(oc_str8 string);
Parameters
string
A utf8 encoded string.
oc_utf8_byte_count_for_codepoints
u64 oc_utf8_byte_count_for_codepoints(oc_str32 codePoints);
Get the length of the utf8 encoding of a sequence of unicode codepoints.
Parameters
codePoints
A sequence of unicode codepoints.
Return
The length required to encode the codepoints, in bytes.
oc_utf8_next_offset
u64 oc_utf8_next_offset(oc_str8 string, u64 byteOffset);
Get the offset of the next codepoint after a given offset, in a utf8 encoded string.
Parameters
string
A utf8 encoded string.byteOffset
The offset after which to look for the next codepoint, in bytes.
Return
The offset of the start of the encoding of the next codepoint, in bytes.
oc_utf8_prev_offset
u64 oc_utf8_prev_offset(oc_str8 string, u64 byteOffset);
Get the offset of the previous codepoint before a given offset, in a utf8 encoded string.
Parameters
string
A utf8 encoded string.byteOffset
The offset before which to look for the previous codepoint, in bytes.
Return
The offset of the start of the encoding of the previous codepoint, in bytes.
oc_utf8_decode
oc_utf8_dec oc_utf8_decode(oc_str8 string);
Decode a utf8 encoded codepoint.
Parameters
string
A utf8-encoded codepoint.
Return
The decoded result.
oc_utf8_decode_at
oc_utf8_dec oc_utf8_decode_at(oc_str8 string, u64 offset);
Decode a codepoint at a given offset in a utf8 encoded string.
Parameters
string
A utf8 encoded string.offset
The offset at which to decode a codepoint.
Return
The decoded result.
oc_utf8_encode
oc_str8 oc_utf8_encode(char* dst, oc_utf32 codePoint);
Encode a unicode codepoint into a utf8 sequence.
Parameters
dst
A pointer to the backing memory for the encoded sequence. This must point to a buffer big enough to encode the codepoint.codePoint
The unicode codepoint to encode.
Return
The utf8 sequence encoding the codepoint.
oc_utf8_to_codepoints
oc_str32 oc_utf8_to_codepoints(u64 maxCount, oc_utf32* backing, oc_str8 string);
Decode a utf8 string to a string of unicode codepoints using memory passed by the caller.
Parameters
maxCount
The maximum number of codepoints that the backing memory can contain.backing
A pointer to the backing memory for the result. This must point to a buffer capable of holdingmaxCount
codepoints.string
A utf8 encoded string.
Return
The decoded codepoints string.
oc_utf8_from_codepoints
oc_str8 oc_utf8_from_codepoints(u64 maxBytes, char* backing, oc_str32 codePoints);
Encode a string of unicode codepoints into a utf8 string using memory passed by the caller.
Parameters
maxBytes
The maximum number of bytes that the backing memory can contain.backing
A pointer to the backing memory for the result. This must point to a buffer capable of holdingmaxBytes
bytes.codePoints
A string of unicode codepoints.
Return
The utf8 encoded string.
oc_utf8_push_to_codepoints
oc_str32 oc_utf8_push_to_codepoints(oc_arena* arena, oc_str8 string);
Decode a utf8 encoded string to a string of unicode codepoints using an arena.
Parameters
arena
The arena on which to allocate the codepoints.string
A utf8 encoded string.
Return
The decoded codepoints. The contents of the string is allocated on arena
.
oc_utf8_push_from_codepoints
oc_str8 oc_utf8_push_from_codepoints(oc_arena* arena, oc_str32 codePoints);
Encode a string of unicode codepoints into a utf8 string using an arena.
Parameters
arena
The arena on which to allocate the utf8 encoded string.codePoints
A string of unicode codepoints.
Return
The encoded utf8 string. The contents of the string is allocated on arena
.