-
Hi @jpivarski, I thought I'd place this here for a wider discussion. Something I am encountering quite a lot at the moment is the need to decide upon (and require) a particular array structure for a routine. With NumPy, I'd often be able to assume that an operation will vectorize over the n-th dimension, and leading dimensions just modify the output structure. There doesn't seem to be a way to do this in Awkward unless the operation is a ufunc. In my particular case, I have a lookup function that maps from a record to another record, which I want to perform at the innermost dimension. Where the RecordArray has only one dimension, this is simply LOOKUP_TABLE = ak.from_numpy(np.random.uniform(size=(10, 10, 10, 3)))
def transform(addr, highlevel=True):
transformed = LOOKUP_TABLE[
addr["x"],
addr["y"],
addr["z"],
]
return ak.zip(
{
"u": transformed[:, 0],
"v": transformed[:, 1],
"w": transformed[:, 2],
},
highlevel=highlevel,
) With additional structure, e.g. several of these addresses per-event (in uproot terminology), this mapping no longer works, because jagged indices are treated differently to regular ones. data = ak.zip({
'x': [[0,1,2], [3], [4,5,6,7]],
'y': [[8,9,0], [8], [7,8,5,3]],
'z': [[5,4,3], [9], [1,1,2,5]],
}, with_name="coord") >>> data.type
3 * var * {"x": int64, "y": int64, "z": int64}
>>> transform(data) ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-682-4ad728482488> in <module>
----> 1 transform(data)
<ipython-input-678-5aa8c791c389> in transform(addr, highlevel)
1 def transform(addr, highlevel=True):
----> 2 transformed = LOOKUP_TABLE[
3 addr["x"],
4 addr["y"],
5 addr["z"],
/opt/texat-venv/lib/python3.9/site-packages/awkward/highlevel.py in __getitem__(self, where)
972 have the same dimension as the array being indexed.
973 """
--> 974 return ak._util.wrap(self.layout[where], self._behavior)
975
976 def __setitem__(self, where, what):
ValueError: cannot fit jagged slice with length 3 into RegularArray of size 10
(https://github.com/scikit-hep/awkward-1.0/blob/1.2.2/src/libawkward/array/RegularArray.cpp#L1537) Flattening the ArrayBecause this operation does not care about the structure at all, we can completely flatten the address, and then completely unflatten the result. One way to do this is to actually invoke def apply_flat(array, func):
sizes = []
# Flatten array
flattened = array
for i in range(array.ndim - 1):
sizes.append(ak.num(flattened))
flattened = ak.flatten(flattened)
# Compute result on flattened array
result = func(flattened)
# Unflatten the result
for s in reversed(sizes):
result = ak.unflatten(result, s)
return result This approach is not elegant, and may involve several copies of the data (due to Operating on the ContentsAn alternative solution is to operate directly on the underlying buffers (contents). In this example, we can require that any array must have records with scalar fields, such that the def apply_to_record(array, record, func):
def getfunction(layout, depth):
if not isinstance(layout, ak.layout.RecordArray):
return
if layout.parameter("__record__") != record:
return
return lambda: func(layout)
content = ak._util.recursively_apply(
ak.operations.convert.to_layout(array), getfunction
)
return ak._util.wrap(content, ak._util.behaviorof(array)) where we call the transform upon the layout which has the appropriate record name. This works by reusing the parent/outer layouts, but replacing the underlying record itself. >>> apply_to_record(data, "coord", lambda x: transform(x, highlevel=False))
<Array [[{u: 0.227, v: 0.0513, ... w: 0.689}]] type='3 * var * {"u": float64, "v...'> It seems like this is done in several places in coffea. There might be a nicer way to do this, I'd appreciate any discussion on the topic! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Maybe the appropriate jagged array to use as a slice could be built with ak.local_index? |
Beta Was this translation helpful? Give feedback.
-
Here's another approach (you mentioned Numba in Gitter). You can write a Numba function that doesn't care how many levels deep an array is, as long as it can resolve that depth during compilation. I just got help on this on the numba/numba-dev Gitter (thanks, @seibert!), since my attempt to do it using nb.generated_jit ran into an unimplemented feature. The reason for using The >>> import numba as nb
>>> import numpy as np
>>> import awkward as ak
>>>
>>> def dummy(x):
... pass
...
>>> @nb.extending.overload(dummy)
... def _add_one(data):
... if data.type.ndim == 1:
... print(f"Typing THEN branch: ndim={data.type.ndim}")
... def impl(data):
... tmp = np.asarray(data) # zero-copy wrap as NumPy, so that it can be changed in-place
... for i in range(len(tmp)):
... tmp[i] += 1
... return impl
... else:
... print(f"Typing ELSE branch: ndim={data.type.ndim}")
... def impl(data):
... for x in data:
... dummy(x)
... return impl
...
>>> @nb.njit
... def add_one(data):
... print("Actually running")
... return dummy(data)
...
>>> array = ak.Array([[[[[1], []]], [[], [[2, 3]]]], []])
>>> add_one(array)
Typing ELSE branch: ndim=5
Typing ELSE branch: ndim=4
Typing ELSE branch: ndim=3
Typing ELSE branch: ndim=2
Typing THEN branch: ndim=1
Actually running
>>> array
<Array [[[[[2], []]], [[], [[3, 4]]]], []] type='2 * var * var * var * var * int64'> The Normally, we don't update Awkward Arrays in place because it's hard to know which arrays share buffers: an update might have long-distance effects. So here's an example that doesn't change the array in-place. Unfortunately, ak.ArrayBuilder is dynamically typed (even in Numba) and it will slow down the process. (TypedArrayBuilder should fix this, once it's implemented as a Numba extension.) >>> def dummy(data, builder):
... pass
...
>>> @nb.extending.overload(dummy)
... def _add_one(data, builder):
... if data.type.ndim == 1:
... def impl(data, builder):
... for x in data:
... builder.integer(x + 1)
... return builder
... return impl
... else:
... def impl(data, builder):
... for x in data:
... builder.begin_list()
... dummy(x, builder)
... builder.end_list()
... return builder
... return impl
...
>>> @nb.njit
... def add_one(data, builder):
... return dummy(data, builder)
...
>>> builder = add_one(
... ak.Array([[[[[1], []]], [[], [[2, 3]]]], []]),
... ak.ArrayBuilder(),
... )
>>> builder.snapshot()
<Array [[[[[2], []]], [[], [[3, 4]]]], []] type='2 * var * var * var * var * int64'> |
Beta Was this translation helpful? Give feedback.
Here's another approach (you mentioned Numba in Gitter). You can write a Numba function that doesn't care how many levels deep an array is, as long as it can resolve that depth during compilation. I just got help on this on the numba/numba-dev Gitter (thanks, @seibert!), since my attempt to do it using nb.generated_jit ran into an unimplemented feature. The reason for using
nb.generated_jit
instead ofnb.jit
is that you get to write type-dependent Python code that only runs at compile-time—basically, like template metaprogramming in C++, but with a much easier language to do that in (Python).The
nb.generated_jit
approach raises NotImplementedError when you construct a recursion on types,…