UTF8
UTF8 encoding/decoding.
Types
oc_utf32
typedef u32 oc_utf32;
A unicode codepoint.
oc_utf8_status
typedef enum oc_utf8_status
{
OC_UTF8_OK = 0,
OC_UTF8_OUT_OF_BOUNDS = 1,
OC_UTF8_UNEXPECTED_CONTINUATION_BYTE = 2,
OC_UTF8_UNEXPECTED_LEADING_BYTE = 3,
OC_UTF8_INVALID_BYTE = 4,
OC_UTF8_INVALID_CODEPOINT = 5,
OC_UTF8_OVERLONG_ENCODING = 6
} oc_utf8_status;
This enum declares the possible return status of UTF8 decoding/encoding operations.
Enum Constants
OC_UTF8_OKThe operation was successful.OC_UTF8_OUT_OF_BOUNDSThe operation unexpectedly encountered the end of the utf8 sequence.OC_UTF8_UNEXPECTED_CONTINUATION_BYTEA continuation byte was encountered where a leading byte was expected.OC_UTF8_UNEXPECTED_LEADING_BYTEA leading byte was encountered in the middle of the encoding of utf8 codepoint.OC_UTF8_INVALID_BYTEThe utf8 sequence contains an invalid byte.OC_UTF8_INVALID_CODEPOINTThe operation encountered an invalid utf8 codepoint.OC_UTF8_OVERLONG_ENCODINGThe utf8 sequence contains an overlong encoding of a utf8 codepoint.
oc_utf8_dec
typedef struct oc_utf8_dec
{
oc_utf8_status status;
oc_utf32 codepoint;
u32 size;
} oc_utf8_dec;
A type representing the result of decoding of utf8-encoded codepoint.
Fields
statusThe status of the decoding operation. If notOC_UTF8_OK, it describes the error that was encountered during decoding.codepointThe decoded codepoint.sizeThe size of the utf8 sequence encoding that codepoint.
oc_unicode_range
typedef struct oc_unicode_range
{
oc_utf32 firstCodePoint;
u32 count;
} oc_unicode_range;
A type representing a contiguous range of unicode codepoints.
Fields
firstCodePointThe first codepoint of the range.countThe number of codepoints in the range.
Functions
oc_utf8_size_from_leading_char
u32 oc_utf8_size_from_leading_char(char leadingChar);
Get the size of a utf8-encoded codepoint for the first byte of the encoded sequence.
Parameters
leadingCharThe first byte of utf8 sequence.
Return
The size of the utf8 sequence, in bytes.
oc_utf8_codepoint_size
u32 oc_utf8_codepoint_size(oc_utf32 codePoint);
Get the size of the utf8 encoding of a codepoint.
Parameters
codePointA unicode codepoint.
Return
The size of the encoded codepoint, in bytes.
oc_utf8_codepoint_count_for_string
u64 oc_utf8_codepoint_count_for_string(oc_str8 string);
Parameters
stringA utf8 encoded string.
oc_utf8_byte_count_for_codepoints
u64 oc_utf8_byte_count_for_codepoints(oc_str32 codePoints);
Get the length of the utf8 encoding of a sequence of unicode codepoints.
Parameters
codePointsA sequence of unicode codepoints.
Return
The length required to encode the codepoints, in bytes.
oc_utf8_next_offset
u64 oc_utf8_next_offset(oc_str8 string, u64 byteOffset);
Get the offset of the next codepoint after a given offset, in a utf8 encoded string.
Parameters
stringA utf8 encoded string.byteOffsetThe offset after which to look for the next codepoint, in bytes.
Return
The offset of the start of the encoding of the next codepoint, in bytes.
oc_utf8_prev_offset
u64 oc_utf8_prev_offset(oc_str8 string, u64 byteOffset);
Get the offset of the previous codepoint before a given offset, in a utf8 encoded string.
Parameters
stringA utf8 encoded string.byteOffsetThe offset before which to look for the previous codepoint, in bytes.
Return
The offset of the start of the encoding of the previous codepoint, in bytes.
oc_utf8_decode
oc_utf8_dec oc_utf8_decode(oc_str8 string);
Decode a utf8 encoded codepoint.
Parameters
stringA utf8-encoded codepoint.
Return
The decoded result.
oc_utf8_decode_at
oc_utf8_dec oc_utf8_decode_at(oc_str8 string, u64 offset);
Decode a codepoint at a given offset in a utf8 encoded string.
Parameters
stringA utf8 encoded string.offsetThe offset at which to decode a codepoint.
Return
The decoded result.
oc_utf8_encode
oc_str8 oc_utf8_encode(char* dst, oc_utf32 codePoint);
Encode a unicode codepoint into a utf8 sequence.
Parameters
dstA pointer to the backing memory for the encoded sequence. This must point to a buffer big enough to encode the codepoint.codePointThe unicode codepoint to encode.
Return
The utf8 sequence encoding the codepoint.
oc_utf8_to_codepoints
oc_str32 oc_utf8_to_codepoints(u64 maxCount, oc_utf32* backing, oc_str8 string);
Decode a utf8 string to a string of unicode codepoints using memory passed by the caller.
Parameters
maxCountThe maximum number of codepoints that the backing memory can contain.backingA pointer to the backing memory for the result. This must point to a buffer capable of holdingmaxCountcodepoints.stringA utf8 encoded string.
Return
The decoded codepoints string.
oc_utf8_from_codepoints
oc_str8 oc_utf8_from_codepoints(u64 maxBytes, char* backing, oc_str32 codePoints);
Encode a string of unicode codepoints into a utf8 string using memory passed by the caller.
Parameters
maxBytesThe maximum number of bytes that the backing memory can contain.backingA pointer to the backing memory for the result. This must point to a buffer capable of holdingmaxBytesbytes.codePointsA string of unicode codepoints.
Return
The utf8 encoded string.
oc_utf8_push_to_codepoints
oc_str32 oc_utf8_push_to_codepoints(oc_arena* arena, oc_str8 string);
Decode a utf8 encoded string to a string of unicode codepoints using an arena.
Parameters
arenaThe arena on which to allocate the codepoints.stringA utf8 encoded string.
Return
The decoded codepoints. The contents of the string is allocated on arena.
oc_utf8_push_from_codepoints
oc_str8 oc_utf8_push_from_codepoints(oc_arena* arena, oc_str32 codePoints);
Encode a string of unicode codepoints into a utf8 string using an arena.
Parameters
arenaThe arena on which to allocate the utf8 encoded string.codePointsA string of unicode codepoints.
Return
The encoded utf8 string. The contents of the string is allocated on arena.