Merging arrays based on common fields (i.e. "joining" tables) #633

nongiga · 2021-01-03T13:48:51Z

nongiga
Jan 3, 2021

Hi there,

I am new to python and am really excited by AwkwardArray: it is exactly what I switched programming languages for!

I have this question that might be ridiculously easy, but I couldn't find a good answer at the API or previous discussions:

I have two AwkwardArrays (dumping partial records in JSON form):

{
    "CaseNum": 3119905,
    "Isolates": [
      {
        "IsoNum": 5567545
        "SeqsDir": "Sequencing_analysis/Seqplate1to13_17to25/",
        "SiteString": "Urine"
      }
    ],
  },

{
    "CaseNum": 1905256,
    "Seqplates": [
      "1_E5",
      "1_D5"
    ]
  },
  {
    "CaseNum": 1907609,
    "Seqplates": [
      "22_H10",
      "23_H4"
    ]
  },

They are in different sizes. I want to merge the first with the second based on the RandomID. How do I do that most efficiently?

Thanks!

Answered by jpivarski

Jan 3, 2021

Hi! If I'm reading your question right, this is an important database/data analysis operation known as "joining." Awkward Array doesn't have a built-in operation for joining because the scope of the project is to be "NumPy with data structures," not "Pandas with data structures." Obviously, the latter would also be useful, but one step at a time. (We're thinking about ways of doing this; see, for example, #350 (comment) and search for discussions about Awkward Xarray.)

By "NumPy" vs "Pandas," I'm referring to a level of abstraction. Pandas has this join operation (which they call "merging") as a primary high-level feature. With NumPy, you can do it, but it takes more steps. Likewise, you …

View full answer

jpivarski · 2021-01-03T19:26:36Z

jpivarski
Jan 3, 2021
Maintainer

Hi! If I'm reading your question right, this is an important database/data analysis operation known as "joining." Awkward Array doesn't have a built-in operation for joining because the scope of the project is to be "NumPy with data structures," not "Pandas with data structures." Obviously, the latter would also be useful, but one step at a time. (We're thinking about ways of doing this; see, for example, #350 (comment) and search for discussions about Awkward Xarray.)

By "NumPy" vs "Pandas," I'm referring to a level of abstraction. Pandas has this join operation (which they call "merging") as a primary high-level feature. With NumPy, you can do it, but it takes more steps. Likewise, you can do it with Awkward Array as a multi-step process, too. In the example below, I'll be using only vectorized operations (no Python for loops), though joining can't be done in O(n) time (where n is the length of the input arrays); I'm pretty sure the following is O(n log n).

Let me start with a highly simplified example because I'll be walking through the logic of it, and I want as few distractions as possible. It may be possible to wrap this up into a generic function, and maybe a function like that should become one of the built-in Awkward operations.

>>> import awkward as ak
>>> import numpy as np
>>> one = ak.Array([{"x": 3}, {"x": 1}, {"x": 5}, {"x": 1}, {"x": 2}])
>>> two = ak.Array([{"y": 4}, {"y": 3}, {"y": 3}, {"y": 1}, {"y": 6}])

one and two are both arrays of records and they each have a field that will be used to relate one array to the other. If the records had more fields, as yours do, they would "go along for the ride" when joining one to two, so for simplicity, I've left them out entirely. I've added a lot of complications to the indexes so that this example is general enough:

There are repeated indexes in one: in two cases, the x field is 1. Only the first of these will be matched to {"y": 1} in two (not the last, due to an implementation detail of np.unique).
There are repeated indexes in two: in two cases, the y field is 3. The matching {"x": 3} from one will be duplicated to combine with each of these.
There are indexes in one that don't correspond to any in two, namely {"x": 5} and {"x": 2}. These will be dropped.
There are indexes in two that don't correspond to any in one, namely {"y": 4} and {"y": 6}. These will be combined with a missing value (None) where a value from one is expected.

Needless to say, the join will not be symmetric. Any problems with one's index result in skipping the record (it does not appear in the output, even as a placeholder) but any problems with two's index result in duplicating a value from one or introducing a placeholder. That makes this either a "left outer join" or a "right outer join," depending on whether you consider one to be to the left of two or not. (If this gets wrapped up in a general-purpose function, care would need to be taken to get the order right to match the formal definition.) If you have an asymmetry between your two arrays, be sure to associate the right ones with one and two!

We're going to use NumPy functions because the corresponding Awkward functions don't exist. The np.asarray(·) and ak.Array(·) functions convert Awkward Arrays into NumPy arrays and back (they are identical to ak.to_numpy and ak.from_numpy) without copying data (yay!) if the Awkward Array is an n-dimensional, rectangular array. If it is not, then it raises an exception. One direction is more forgiving than the other because Awkward Array types are more general than NumPy shape + dtypes.

The relevant NumPy functions are np.unique, which will create a sorted, no-duplicates copy of one's index, and np.searchsorted, which will find values in two's index that match this dictionary.

Step 1: create a dictionary of one's index.

>>> # Using the fact that we can zero-copy project out one's index:
>>> one.x
<Array [3, 1, 5, 1, 2] type='5 * int64'>
>>> np.asarray(one.x)
array([3, 1, 5, 1, 2])
>>> dictionary, index = np.unique(np.asarray(one.x), return_index=True)
>>> dictionary
array([1, 2, 3, 5])
>>> index
array([1, 4, 0, 2])

The dictionary is the sorted, no-duplicates copy of one.x and index is something we can use to reorder and de-duplicate actual records from one:

>>> one[index]
<Array [{x: 1}, {x: 2}, {x: 3}, {x: 5}] type='4 * {"x": int64}'>

but we won't need that until the final step.

Step 2: find the values in two's index that are as close as possible to those in the dictionary.

>>> closest = np.searchsorted(dictionary, np.asarray(two.y), side="left")
>>> closest
array([3, 2, 2, 0, 4])

This is a vectorized "search sorted," which finds the closest match of all values in two.y to any value in dictionary in a single Python function call. The implicit loop over all elements in two.y is done in compiled code, so it doesn't matter how large your arrays are.

The return values of this function are the index positions in dictionary that are closest to each from two.y. It therefore has the same length as two.y (not the same length as dictionary).

>>> len(closest), len(two.y), len(dictionary)
(5, 5, 4)

Unfortunately, due to the way that np.searchsorted works, it maps values in two.y that are larger than any in dictionary to len(dictionary). (Setting side="right" doesn't fix this and only makes the next step more complicated.) Therefore, if we use this index on the dictionary:

>>> dictionary[closest]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 4 is out of bounds for axis 0 with size 4

that one value gives us troubles. In the next step, we'll be applying a filter requiring each closest match to be an exact match (we want to exactly match one.x to two.y), so values like this are a failure that we'll be wanting to filter out anyway. The thing that's unfortunate is that they're a special case failure.

Step 3: filter out close matches that aren't exact matches.

Because of the issue described above, we'll have to filter out non-exact matches in two steps:

values in closest that are beyond the length of dictionary (i.e. 4 in the example above, which is greater than all good matches)
values in closest that don't exactly map to the original two.y (because they're in between ones that do, or less than all of them).

You can filter an array with an array of booleans, but doing two filters is problematic because the first changes the length, such that the second array of booleans doesn't fit:

>>> example = np.array(["a", "b", "c", "d", "e"])
>>> drop_vowels1 = np.array([False, True, True, True, True])
>>> drop_vowels2 = np.array([True, True, True, True, False])
>>> example[drop_vowels1]
array(['b', 'c', 'd', 'e'], dtype='<U1')
>>> example[drop_vowels1][drop_vowels2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: boolean index did not match indexed array along dimension 0;
            dimension is 4 but corresponding boolean dimension is 5

>>> # What you'd have to do instead (yuck):
>>> example[drop_vowels1][drop_vowels2[drop_vowels1]]
array(['b', 'c', 'd'], dtype='<U1')

Since Awkward Array has missing values (more generally than NumPy's masked arrays), we've introduced another kind of filtering operation, array.mask[filter], that does not change the length of the array; it masks out values by replacing them with a placeholder None. Subsequent operations with indexes containing None pass the None values through as a placeholder. Thus, we can do the two steps of our filter without having to filter our filters, as we would in the NumPy example above.

>>> is_within_range = ak.Array(closest).mask[closest < len(dictionary)]
>>> is_within_range
<Array [3, 2, 2, 0, None] type='5 * ?int64'>
>>> is_good_match = ak.Array(dictionary)[is_within_range] == two.y
>>> is_good_match
<Array [False, True, True, True, None] type='5 * ?bool'>

Note that these need to be Awkward Arrays, not NumPy arrays, for this "masked filter" feature to work.

Now we can apply this filter to the closest matches to get an indexer "reordering" that can be applied to the dictionary to get something that lines up with two. Since we're going with masked filters anyway, let's continue with that. (Otherwise, if you want the None values to become False, as a normal filtering array, use ak.fill_none to convert them.)

>>> reordering = ak.Array(closest).mask[is_good_match]
>>> reordering
<Array [None, 2, 2, 0, None] type='5 * ?int64'>

To see how we can use this reordering, apply it to dictionary:

>>> ak.Array(dictionary)[reordering]
<Array [None, 3, 3, 1, None] type='5 * ?int64'>
>>> two.y
<Array [4, 3, 3, 1, 6] type='5 * int64'>

Everywhere that a value in the dictionary can be matched to a value in two.y, it is matched (duplicated if necessary), and everywhere else, it's None.

Step 4: apply this reordering to one.

We'd like to apply reordering to one, but it indexes dictionary to match two.y. Remember how we asked np.unique for return_index=True? Now we'll be using that index to map one to dictionary and then use reordering to map dictionary's order to two's.

>>> dictionary
array([1, 2, 3, 5])
>>> one[index]
<Array [{x: 1}, {x: 2}, {x: 3}, {x: 5}] type='4 * {"x": int64}'>
>>> one[index][reordering]
<Array [None, {x: 3}, {x: 3}, {x: 1}, None] type='5 * ?{"x": int64}'>
>>> two.y
<Array [4, 3, 3, 1, 6] type='5 * int64'>

(This is function composition, in which the index and reordering are both int → int functions.)

So now that we can put full records from one in the same order as full records from two, we can use them in the same operations, including ak.zip, which creates records of records.

>>> joined = ak.zip({"one": one[index][reordering], "two": two})
>>> joined.type
5 * {"one": ?{"x": int64}, "two": {"y": int64}}
>>> joined.tolist()
[{'one': None, 'two': {'y': 4}},
 {'one': {'x': 3}, 'two': {'y': 3}},
 {'one': {'x': 3}, 'two': {'y': 3}},
 {'one': {'x': 1}, 'two': {'y': 1}},
 {'one': None, 'two': {'y': 6}}]

(If you want to merge them into one level of record structure, which has potential issues if any field names are the same, then you'll have to build that manually with ak.zip({"x": joined.one.x, "y": joined.two.y}) and so on for all fields you want to merge. There's no high-level function for that yet because we'd have to figure out an interface for deciding which side to get each field from and how to name them. SQL and Pandas have troubles with that, too.)

As promised, values in one with no corresponding value in two are dropped, values in one with duplicate indexes are dropped, values in two with no corresponding value in one are given a placeholder None, and values in two with duplicate indexes copy the whole one record. (Note: I'm using the word "copy" in a high-level sense: the actual array is built with a lazy indexer that doesn't literally copy values from one until you access them. Look at joined.layout to see this internal structure.) If one and two had any more fields than the ones we're using for indexing, those fields would go along for the ride: they'd be reordered and/or duplicated as needed.

Your case may be simpler: you might not have any duplicates or mismatches, but the above procedure would still work. Your case might be more complex: you might need a full outer join, rather than a left/right outer join. I think you could get that with a few extra steps after this one, but not knowing whether you need that, I'm going to just stop here.

Good luck!

7 replies

nongiga Jan 4, 2021
Author

Alright! I followed your instructions and only made minor changes for the last command that's particular to my case:

If I am trying to add a variable in two to one:

dictionary, index = np.unique(np.asarray(two.RandomID), return_index=True)
closest = np.searchsorted(dictionary, np.asarray(one.RandomID), side="left")
is_within_range = ak.Array(closest).mask[closest < len(dictionary)]
is_good_match = ak.Array(dictionary)[is_within_range] == one.RandomID
reordering = ak.Array(closest).mask[is_good_match]
one["Seqplates"] = two.Seqplates[index][reordering]
ila.tolist()

Here I am only adding variable Seqplates to my existing list (outer-right I believe)

In my case I will be adding smaller datasets to a large dataset so this is valuable.

nongiga Jan 4, 2021
Author

I have a follow-up question (maybe it's more suitable for a different post):

Following merging the information, I want to compare lists within my AwkwardArray:

So let's say I have these lists:
One:

{
    "CaseNum": 1905256,
    "Isolates": [
      {
        "IsoNum": 5567545
        "SeqsPlate": "1_E5",
      }
    ],
  },

Two:

{
    "CaseNum": 1905256,
    "Seqplates": [
      "1_E5",
      "1_D5"
    ]
  },
  {
    "CaseNum": 1907609,
    "Seqplates": [
      "22_H10",
      "23_H4"
    ]
  },

I already merged them into a single array, joined:

{
    "CaseNum": 1905256,
    "Isolates": [
      {
        "IsoNum": 5567545
        "SeqsPlate": "1_E5",
      }
     Seqplates: [
      "1_E5",
      "1_D5"
    ]
    ],
  },

Now I want to determine whether one.Isolates.Seqsplate is in one.Seqplate for each instance of Isolates.

In pseudocode:

for record in joined:
    for isolate in record.Isolates:
        if Isolate.Seqplate in record.Seqplates:
            Isolate.IsPresent=True
        else:
            Isolate.IsPresent=False

I want to save this information in a new field under Isolates, so that I get:

{
   "CaseNum": 1905256,
   "Isolates": [
     {
       "IsoNum": 5567545,
       "SeqsPlate": "1_E5",
       "IsPresent": True
     }
    Seqplates: [
     "1_E5",
     "1_D5"
   ]
   ],
 },

Is there a more efficient way to do it in AK than to loop over the elements?

jpivarski Jan 4, 2021
Maintainer

You can avoid a nested for loop in Python by competing the Cartesian product of the two arrays of lists with ak.cartesian.

Specifically, use nested=True so that the output is doubly nested but maintains the first-level lists' lengths, like

[[[(1, "a"), (1, "b")], [(2, "a"), (2, "b")], [(3, "a"), (3, "b")]], ...]

and then use == to check for equality between the slot0 and slot1 of the tuple, and finally the ak.any reducer with axis=-1 to replace the innermost lists with True if the have a match and False if they do not.

Or you might want to pass the array into Numba and use ak.ArrayBuilder to construct the boolean lists.

nongiga Jan 4, 2021
Author

I duly appreciate your response, since I'm beginning to realize my questions are more pythonian than particularly AwkwardArrayian I'll try to post my solution here for future reference, if it's of any help.

Edit: that was quicker than I thought, thank you for the instructions!

cart=ak.cartesian([joined.Isolates.Seqplates, joined.Seqplates], nested=True)
is_present=ak.any(cart.slot0==cart.slot1, axis=-1)
is_present=ak.fill_none(is_present, False)
joined['Isolates']=ak.with_field(joined.Isolates, is_present, where="IsPresent")
display(joined.tolist())

jpivarski Jan 4, 2021
Maintainer

my questions are more pythonian than particularly AwkwardArrayian

I wouldn't say so: you're asking which functions from the Awkward Array library would allow you to avoid Python for loops. A Python expert without knowledge of Awkward Array would have the same questions, so this is the right place to ask about that.

In some cases, like the above, I can't give explicit examples because I'm answering on my phone. I'm glad you could fill in the gaps and get a working solution!

jpivarski · 2021-01-03T19:28:07Z

jpivarski
Jan 3, 2021
Maintainer

I am new to python and am really excited by AwkwardArray: it is exactly what I switched programming languages for!

Out of curiosity, which programming language did you switch from? It looks like you're doing something with health care data?

1 reply

nongiga Jan 3, 2021
Author

Interestingly I was working with MATLAB. The matrix-based programming was very useful, but I got frustrated with the availability of bioinformatic tools. I didn't show it but in my project, I am merging both health records and sequencing data, and most of the sequencing analysis tools and written in python. Given that both this and 'big data' are usually in python I thought it'd be wise to switch in the long term.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging arrays based on common fields (i.e. "joining" tables) #633

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Merging arrays based on common fields (i.e. "joining" tables) #633

nongiga Jan 3, 2021

Replies: 2 comments · 8 replies

jpivarski Jan 3, 2021 Maintainer

nongiga Jan 4, 2021 Author

nongiga Jan 4, 2021 Author

jpivarski Jan 4, 2021 Maintainer

nongiga Jan 4, 2021 Author

jpivarski Jan 4, 2021 Maintainer

jpivarski Jan 3, 2021 Maintainer

nongiga Jan 3, 2021 Author

nongiga
Jan 3, 2021

Replies: 2 comments 8 replies

jpivarski
Jan 3, 2021
Maintainer

nongiga Jan 4, 2021
Author

nongiga Jan 4, 2021
Author

jpivarski Jan 4, 2021
Maintainer

nongiga Jan 4, 2021
Author

jpivarski Jan 4, 2021
Maintainer

jpivarski
Jan 3, 2021
Maintainer

nongiga Jan 3, 2021
Author