-
Hi there, I am new to python and am really excited by AwkwardArray: it is exactly what I switched programming languages for! I have this question that might be ridiculously easy, but I couldn't find a good answer at the API or previous discussions: I have two AwkwardArrays (dumping partial records in JSON form):
They are in different sizes. I want to merge the first with the second based on the RandomID. How do I do that most efficiently? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
Hi! If I'm reading your question right, this is an important database/data analysis operation known as "joining." Awkward Array doesn't have a built-in operation for joining because the scope of the project is to be "NumPy with data structures," not "Pandas with data structures." Obviously, the latter would also be useful, but one step at a time. (We're thinking about ways of doing this; see, for example, #350 (comment) and search for discussions about Awkward Xarray.) By "NumPy" vs "Pandas," I'm referring to a level of abstraction. Pandas has this join operation (which they call "merging") as a primary high-level feature. With NumPy, you can do it, but it takes more steps. Likewise, you can do it with Awkward Array as a multi-step process, too. In the example below, I'll be using only vectorized operations (no Python for loops), though joining can't be done in O(n) time (where n is the length of the input arrays); I'm pretty sure the following is O(n log n). Let me start with a highly simplified example because I'll be walking through the logic of it, and I want as few distractions as possible. It may be possible to wrap this up into a generic function, and maybe a function like that should become one of the built-in Awkward operations. >>> import awkward as ak
>>> import numpy as np
>>> one = ak.Array([{"x": 3}, {"x": 1}, {"x": 5}, {"x": 1}, {"x": 2}])
>>> two = ak.Array([{"y": 4}, {"y": 3}, {"y": 3}, {"y": 1}, {"y": 6}])
Needless to say, the join will not be symmetric. Any problems with We're going to use NumPy functions because the corresponding Awkward functions don't exist. The The relevant NumPy functions are np.unique, which will create a sorted, no-duplicates copy of Step 1: create a dictionary of >>> # Using the fact that we can zero-copy project out one's index:
>>> one.x
<Array [3, 1, 5, 1, 2] type='5 * int64'>
>>> np.asarray(one.x)
array([3, 1, 5, 1, 2])
>>> dictionary, index = np.unique(np.asarray(one.x), return_index=True)
>>> dictionary
array([1, 2, 3, 5])
>>> index
array([1, 4, 0, 2]) The >>> one[index]
<Array [{x: 1}, {x: 2}, {x: 3}, {x: 5}] type='4 * {"x": int64}'> but we won't need that until the final step. Step 2: find the values in >>> closest = np.searchsorted(dictionary, np.asarray(two.y), side="left")
>>> closest
array([3, 2, 2, 0, 4]) This is a vectorized "search sorted," which finds the closest match of all values in The return values of this function are the index positions in >>> len(closest), len(two.y), len(dictionary)
(5, 5, 4) Unfortunately, due to the way that >>> dictionary[closest]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: index 4 is out of bounds for axis 0 with size 4 that one value gives us troubles. In the next step, we'll be applying a filter requiring each closest match to be an exact match (we want to exactly match Step 3: filter out close matches that aren't exact matches. Because of the issue described above, we'll have to filter out non-exact matches in two steps:
You can filter an array with an array of booleans, but doing two filters is problematic because the first changes the length, such that the second array of booleans doesn't fit: >>> example = np.array(["a", "b", "c", "d", "e"])
>>> drop_vowels1 = np.array([False, True, True, True, True])
>>> drop_vowels2 = np.array([True, True, True, True, False])
>>> example[drop_vowels1]
array(['b', 'c', 'd', 'e'], dtype='<U1')
>>> example[drop_vowels1][drop_vowels2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: boolean index did not match indexed array along dimension 0;
dimension is 4 but corresponding boolean dimension is 5
>>> # What you'd have to do instead (yuck):
>>> example[drop_vowels1][drop_vowels2[drop_vowels1]]
array(['b', 'c', 'd'], dtype='<U1') Since Awkward Array has missing values (more generally than NumPy's masked arrays), we've introduced another kind of filtering operation, >>> is_within_range = ak.Array(closest).mask[closest < len(dictionary)]
>>> is_within_range
<Array [3, 2, 2, 0, None] type='5 * ?int64'>
>>> is_good_match = ak.Array(dictionary)[is_within_range] == two.y
>>> is_good_match
<Array [False, True, True, True, None] type='5 * ?bool'> Note that these need to be Awkward Arrays, not NumPy arrays, for this "masked filter" feature to work. Now we can apply this filter to the closest matches to get an indexer " >>> reordering = ak.Array(closest).mask[is_good_match]
>>> reordering
<Array [None, 2, 2, 0, None] type='5 * ?int64'> To see how we can use this >>> ak.Array(dictionary)[reordering]
<Array [None, 3, 3, 1, None] type='5 * ?int64'>
>>> two.y
<Array [4, 3, 3, 1, 6] type='5 * int64'> Everywhere that a value in the Step 4: apply this We'd like to apply >>> dictionary
array([1, 2, 3, 5])
>>> one[index]
<Array [{x: 1}, {x: 2}, {x: 3}, {x: 5}] type='4 * {"x": int64}'>
>>> one[index][reordering]
<Array [None, {x: 3}, {x: 3}, {x: 1}, None] type='5 * ?{"x": int64}'>
>>> two.y
<Array [4, 3, 3, 1, 6] type='5 * int64'> (This is function composition, in which the So now that we can put full records from >>> joined = ak.zip({"one": one[index][reordering], "two": two})
>>> joined.type
5 * {"one": ?{"x": int64}, "two": {"y": int64}}
>>> joined.tolist()
[{'one': None, 'two': {'y': 4}},
{'one': {'x': 3}, 'two': {'y': 3}},
{'one': {'x': 3}, 'two': {'y': 3}},
{'one': {'x': 1}, 'two': {'y': 1}},
{'one': None, 'two': {'y': 6}}] (If you want to merge them into one level of record structure, which has potential issues if any field names are the same, then you'll have to build that manually with As promised, values in Your case may be simpler: you might not have any duplicates or mismatches, but the above procedure would still work. Your case might be more complex: you might need a full outer join, rather than a left/right outer join. I think you could get that with a few extra steps after this one, but not knowing whether you need that, I'm going to just stop here. Good luck! |
Beta Was this translation helpful? Give feedback.
-
Out of curiosity, which programming language did you switch from? It looks like you're doing something with health care data? |
Beta Was this translation helpful? Give feedback.
Hi! If I'm reading your question right, this is an important database/data analysis operation known as "joining." Awkward Array doesn't have a built-in operation for joining because the scope of the project is to be "NumPy with data structures," not "Pandas with data structures." Obviously, the latter would also be useful, but one step at a time. (We're thinking about ways of doing this; see, for example, #350 (comment) and search for discussions about Awkward Xarray.)
By "NumPy" vs "Pandas," I'm referring to a level of abstraction. Pandas has this join operation (which they call "merging") as a primary high-level feature. With NumPy, you can do it, but it takes more steps. Likewise, you …