Types¶
The set of all types comprises dtypes and arrays.
The rest of this document assumes that the ndtypes
module has been
imported:
from ndtypes import ndt
Dtypes¶
An important notion in datashape is the dtype
, which roughly translates to
the element type of an array. In datashape, the dtype
can be of arbitrary
complexity and can contain e.g. tuples, records and functions.
Scalars¶
Scalars are the primitive C/C++ types. Most scalars are fixed-size and platform independent.
Fixed size¶
Datashape offers a number of fixed-size scalars. Here’s how to construct a simple
int64_t
type:
>>> ndt('int64')
ndt("int64")
All fixed-size scalars:
void boolean signed int unsigned int float [2] complex void
bool
[1]int8
uint8
float16
complex32
int16
uint16
float32
complex64
[3]int32
uint32
float64
complex128
[4]int64
uint64
bfloat16
bcomplex32
[1] implemented as char
[2] IEEE 754-2008 binary floating point types
[3] implemented as complex<float32>
[4] implemented as complex<float64>
Aliases¶
Datashape has a number of aliases for scalars, which are internally mapped
to their corresponding platform specific fixed-size types. This is how to
construct an intptr_t
:
>>> ndt('intptr')
ndt("int64")
Machine dependent aliases:
intptr
intptr_t
uintptr
uintptr_t
Chars, strings, bytes¶
Encodings¶
Datashape defines the following encodings for strings and characters. Each encoding has several aliases:
canonical form aliases ‘ascii’ ‘A’ ‘us-ascii’ ‘utf8’ ‘U8’ ‘utf-8’ ‘utf16’ ‘U16’ ‘utf-16’ ‘utf32’ ‘U32’ ‘utf-32’ ‘ucs2’ ‘ucs_2’ ‘ucs2’
As seen in the table, encodings must be given in string form:
>>> ndt("char('utf16')")
ndt("char('utf16')")
Chars¶
The char
constructor accepts 'ascii'
, 'ucs2'
and 'utf32'
encoding
arguments. char
without arguments is equivalent to char(utf32)
.
>>> ndt("char('ascii')")
ndt("char('ascii')")
>>> ndt("char('utf32')")
ndt("char('utf32')")
>>> ndt("char")
ndt("char('utf32')")
UTF-8 strings¶
The string
type is a variable length NUL-terminated UTF-8 string:
>>> ndt("string")
ndt("string")
Fixed size strings¶
The fixed_string
type takes a length and an optional encoding argument:
>>> ndt("fixed_string(1729)")
ndt("fixed_string(1729)")
>>> ndt("fixed_string(1729, 'utf16')")
ndt("fixed_string(1729, 'utf16')")
Bytes¶
The bytes type is variable length and takes an optional alignment argument.
Valid values are powers of two in the range [1, 16]
.
>>> ndt("bytes")
ndt("bytes")
>>> ndt("bytes(align=2)")
ndt("bytes(align=2)")
Fixed size bytes¶
The fixed_bytes
type takes a length and an optional alignment argument.
The latter is a keyword-only argument in order to prevent accidental swapping of
the two integer arguments:
>>> ndt("fixed_bytes(size=32)")
ndt("fixed_bytes(size=32)")
>>> ndt("fixed_bytes(size=128, align=8)")
ndt("fixed_bytes(size=128, align=8)")
References¶
Datashape references are fully general and can point to types of arbitrary complexity:
>>> ndt("ref(int64)")
ndt("ref(int64)")
>>> ndt("ref(10 * {a: int64, b: 10 * float64})")
ndt("ref(10 * {a : int64, b : 10 * float64})")
Categorical type¶
The categorical type allows to specify subsets of types. This is implemented as a set of typed values. Types are inferred and interpreted as int64, float64 or strings. The NA keyword creates a category for missing values.
>>> ndt("categorical(1, 10)")
ndt("categorical(1, 10)")
>>> ndt("categorical(1.2, 100.0)")
ndt("categorical(1.2, 100)")
>>> ndt("categorical('January', 'August')")
ndt("categorical('January', 'August')")
>>> ndt("categorical('January', 'August', NA)")
ndt("categorical('January', 'August', NA)")
Option type¶
The option type provides safe handling of values that may or may not be present. The concept is well-known from languages like ML or SQL.
>>> ndt("?complex64")
ndt("?complex64")
Dtype variables¶
Dtype variables are used in quantifier free type schemes and pattern matching. The range of a variable extends over the entire type term.
>>> ndt("T")
ndt("T")
>>> ndt("10 * 16 * T")
ndt("10 * 16 * T")
Symbolic constructors¶
Symbolic constructors stand for any constructor that takes the given datashape argument. Used in pattern matching.
>>> ndt("Coulomb(float64)")
ndt("Coulomb(float64)")
Type kinds¶
Type kinds denote specific subsets of dtypes, types or dimension types. Type kinds are in the dtype section because of the way the grammar is organized. Currently available are:
type kind set specific subset Any
datashape
datashape
Scalar
dtypes
scalars
Categorical
dtypes
categoricals
FixedString
dtypes
fixed_strings
FixedBytes
dtypes
fixed_bytes
Fixed
dimension kind instances
fixed dimensions
Type kinds are used in pattern matching.
Composite types¶
Datashape has container and function dtypes.
Tuples¶
As usual, the tuple type is the product type of a fixed number of types:
>>> ndt("(int64, float32, string)")
ndt("(int64, float32, string)")
Tuples can be nested:
>>> ndt("(bytes, (int8, fixed_string(10)))")
ndt("(bytes, (int8, fixed_string(10)))")
Records¶
Records are equivalent to tuples with named fields:
>>> ndt("{a: float32, b: float64}")
ndt("{a : float32, b : float64}")
Functions¶
In datashape, function types can have positional and keyword arguments. Internally, positional arguments are represented by a tuple and keyword arguments by a record. Both kinds of arguments can be variadic.
Positional-only¶
This is a function type with a single positional int32
argument, returning
an int32
:
>>> ndt("(int32) -> int32")
ndt("(int32) -> int32")
This is a function type with three positional arguments:
>>> ndt("(int32, complex128, string) -> float64")
ndt("(int32, complex128, string) -> float64")
Positional-variadic¶
This is a function type with a single required positional argument, followed by any number of additional positional arguments:
>>> ndt("(int32, ...) -> int32")
ndt("(int32, ...) -> int32")
Arrays¶
In datashape dimension kinds [5] are part of array type declarations. Datashape supports the following dimension kinds:
Fixed Dimension¶
A fixed dimension denotes an array type with a fixed number of elements of a specific type. The type can be written in two ways:
>>> ndt("fixed(shape=10) * uint64")
ndt("10 * uint64")
>>> ndt("10 * uint64")
ndt("10 * uint64")
Formally, fixed(shape=10)
is a dimension constructor, not a type constructor.
The *
is the array type constructor in infix notation, taking as arguments
a dimension and an element type.
The second form is equivalent to the first one. For users of other languages,
it may be helpful to view this type as array[10] of uint64
.
Multidimensional arrays are constructed in the same manner, the *
is
right associative:
>>> ndt("10 * 25 * float64")
ndt("10 * 25 * float64")
Again, it may help to view this type as array[10] of (array[25] of float64)
.
In this case, float64
is the dtype of the multidimensional
array.
Dtypes can be arbitrarily complex. Here is an array with a dtype of a record that contains another array:
>>> ndt("120 * {size: int32, items: 10 * int8}")
ndt("120 * {size : int32, items : 10 * int8}")
Variable Dimension¶
The variable dimension kind describes an array type with a variable number of elements of a specific type:
>>> ndt("var * float32")
ndt("var * float32")
In this case, var
is the dimension constructor and the *
fulfils the
same role as above. Many managed languages have variable sized arrays, so this
type could be viewed as array of float32
. In a sense, fixed size arrays
are just a special case of variable sized arrays.
Symbolic Dimension¶
Datashape supports symbolic dimensions, which are used in pattern matching. A symbolic dimension is an uppercase variable that stands for a fixed dimension.
In this manner entire sets of array types can be specified. The following type
describes the set of all M * N
matrices with a float32
dtype:
>>> ndt("M * N * float32")
ndt("M * N * float32")
The next type describes a function that performs matrix multiplication on any
permissible pair of input matrices with dtype T
:
>>> ndt("(M * N * T, N * P * T) -> M * P * T")
ndt("(M * N * T, N * P * T) -> M * P * T")
In this case, we have used both symbolic dimensions and the type variable T
.
Symbolic dimensions can be mixed fixed dimensions:
>>> ndt("10 * N * float64")
ndt("10 * N * float64")
Ellipsis Dimension¶
The ellipsis, used in pattern matching, stands for any number of dimensions. Datashape supports both named and unnamed ellipses:
>>> ndt("... * float32")
ndt("... * float32")
Named form:
>>> ndt("Dim... * float32")
ndt("Dim... * float32")
Ellipsis dimensions play an important role in broadcasting, more on the topic in the section on pattern matching.
[5] | In the whole text dimension kind and dimension are synonymous. |