UTF8

UTF8 encoding/decoding.

Types

oc_utf32

typedef u32 oc_utf32;

A unicode codepoint.


oc_utf8_status

typedef enum oc_utf8_status
{
    OC_UTF8_OK = 0,
    OC_UTF8_OUT_OF_BOUNDS = 1,
    OC_UTF8_UNEXPECTED_CONTINUATION_BYTE = 3,
    OC_UTF8_UNEXPECTED_LEADING_BYTE = 4,
    OC_UTF8_INVALID_BYTE = 5,
    OC_UTF8_INVALID_CODEPOINT = 6,
    OC_UTF8_OVERLONG_ENCODING = 7
} oc_utf8_status;

This enum declares the possible return status of UTF8 decoding/encoding operations.

Enum Constants

  • OC_UTF8_OK The operation was successful.
  • OC_UTF8_OUT_OF_BOUNDS The operation unexpectedly encountered the end of the utf8 sequence.
  • OC_UTF8_UNEXPECTED_CONTINUATION_BYTE A continuation byte was encountered where a leading byte was expected.
  • OC_UTF8_UNEXPECTED_LEADING_BYTE A leading byte was encountered in the middle of the encoding of utf8 codepoint.
  • OC_UTF8_INVALID_BYTE The utf8 sequence contains an invalid byte.
  • OC_UTF8_INVALID_CODEPOINT The operation encountered an invalid utf8 codepoint.
  • OC_UTF8_OVERLONG_ENCODING The utf8 sequence contains an overlong encoding of a utf8 codepoint.

oc_utf8_dec

typedef struct oc_utf8_dec
{
    oc_utf8_status status;
    oc_utf32 codepoint;
    u32 size;
} oc_utf8_dec;

A type representing the result of decoding of utf8-encoded codepoint.

Fields

  • status The status of the decoding operation. If not OC_UTF8_OK, it describes the error that was encountered during decoding.
  • codepoint The decoded codepoint.
  • size The size of the utf8 sequence encoding that codepoint.

oc_unicode_range

typedef struct oc_unicode_range
{
    oc_utf32 firstCodePoint;
    u32 count;
} oc_unicode_range;

A type representing a contiguous range of unicode codepoints.

Fields

  • firstCodePoint The first codepoint of the range.
  • count The number of codepoints in the range.

Functions

oc_utf8_size_from_leading_char

u32 oc_utf8_size_from_leading_char(char leadingChar);

Get the size of a utf8-encoded codepoint for the first byte of the encoded sequence.

Parameters

  • leadingChar The first byte of utf8 sequence.

Return

The size of the utf8 sequence, in bytes.


oc_utf8_codepoint_size

u32 oc_utf8_codepoint_size(oc_utf32 codePoint);

Get the size of the utf8 encoding of a codepoint.

Parameters

  • codePoint A unicode codepoint.

Return

The size of the encoded codepoint, in bytes.


oc_utf8_codepoint_count_for_string

u64 oc_utf8_codepoint_count_for_string(oc_str8 string);

Parameters

  • string A utf8 encoded string.

oc_utf8_byte_count_for_codepoints

u64 oc_utf8_byte_count_for_codepoints(oc_str32 codePoints);

Get the length of the utf8 encoding of a sequence of unicode codepoints.

Parameters

  • codePoints A sequence of unicode codepoints.

Return

The length required to encode the codepoints, in bytes.


oc_utf8_next_offset

u64 oc_utf8_next_offset(oc_str8 string, u64 byteOffset);

Get the offset of the next codepoint after a given offset, in a utf8 encoded string.

Parameters

  • string A utf8 encoded string.
  • byteOffset The offset after which to look for the next codepoint, in bytes.

Return

The offset of the start of the encoding of the next codepoint, in bytes.


oc_utf8_prev_offset

u64 oc_utf8_prev_offset(oc_str8 string, u64 byteOffset);

Get the offset of the previous codepoint before a given offset, in a utf8 encoded string.

Parameters

  • string A utf8 encoded string.
  • byteOffset The offset before which to look for the previous codepoint, in bytes.

Return

The offset of the start of the encoding of the previous codepoint, in bytes.


oc_utf8_decode

oc_utf8_dec oc_utf8_decode(oc_str8 string);

Decode a utf8 encoded codepoint.

Parameters

  • string A utf8-encoded codepoint.

Return

The decoded result.


oc_utf8_decode_at

oc_utf8_dec oc_utf8_decode_at(oc_str8 string, u64 offset);

Decode a codepoint at a given offset in a utf8 encoded string.

Parameters

  • string A utf8 encoded string.
  • offset The offset at which to decode a codepoint.

Return

The decoded result.


oc_utf8_encode

oc_str8 oc_utf8_encode(char* dst, oc_utf32 codePoint);

Encode a unicode codepoint into a utf8 sequence.

Parameters

  • dst A pointer to the backing memory for the encoded sequence. This must point to a buffer big enough to encode the codepoint.
  • codePoint The unicode codepoint to encode.

Return

The utf8 sequence encoding the codepoint.


oc_utf8_to_codepoints

oc_str32 oc_utf8_to_codepoints(u64 maxCount, oc_utf32* backing, oc_str8 string);

Decode a utf8 string to a string of unicode codepoints using memory passed by the caller.

Parameters

  • maxCount The maximum number of codepoints that the backing memory can contain.
  • backing A pointer to the backing memory for the result. This must point to a buffer capable of holding maxCount codepoints.
  • string A utf8 encoded string.

Return

The decoded codepoints string.


oc_utf8_from_codepoints

oc_str8 oc_utf8_from_codepoints(u64 maxBytes, char* backing, oc_str32 codePoints);

Encode a string of unicode codepoints into a utf8 string using memory passed by the caller.

Parameters

  • maxBytes The maximum number of bytes that the backing memory can contain.
  • backing A pointer to the backing memory for the result. This must point to a buffer capable of holding maxBytes bytes.
  • codePoints A string of unicode codepoints.

Return

The utf8 encoded string.


oc_utf8_push_to_codepoints

oc_str32 oc_utf8_push_to_codepoints(oc_arena* arena, oc_str8 string);

Decode a utf8 encoded string to a string of unicode codepoints using an arena.

Parameters

  • arena The arena on which to allocate the codepoints.
  • string A utf8 encoded string.

Return

The decoded codepoints. The contents of the string is allocated on arena.


oc_utf8_push_from_codepoints

oc_str8 oc_utf8_push_from_codepoints(oc_arena* arena, oc_str32 codePoints);

Encode a string of unicode codepoints into a utf8 string using an arena.

Parameters

  • arena The arena on which to allocate the utf8 encoded string.
  • codePoints A string of unicode codepoints.

Return

The encoded utf8 string. The contents of the string is allocated on arena.