Better API for specifying return data of signals #376

cx1111 · 2022-05-23T19:48:08Z

The current rdrecord API is rather confusing when dealing with more complex records/signals. Namely: multi-frequency signals.

We should create a better API to allow users to:

Specify/know the exact return types/dimensions of the signals being read
Deal with single and multi-frequency records consistently.

In addition, the current API is also limiting when reading single-frequency records, as there is no way for users to explicitly specify whether they want 1d or 2d numpy arrays.

We current have:

wfdb-python/wfdb/io/record.py

Line 1715 in 0d42dfb

def rdrecord(

where the only parameter that influences the return type is: smooth_frames

We've chatted about the issue here, where I suggested a new set of params:
#313 (comment)

What do people think of the suggestion of having two params: return_dims and smooth_frames? return_dims will apply for both single and multi-frequency signals, and smooth_frames will apply only to multi-frequency signals. Given the combination of the params, the return value of the signals will be:

(1, True): List of 1d arrays. Guaranteed to be same length.
(1, False): List of 1d arrays. Not guaranteed to be same length for multi-frequency signals.
(2, True): Single 2d array.
(2, False): Illegal combination.

The text was updated successfully, but these errors were encountered:

cx1111 · 2022-05-23T19:50:14Z

@tompollard @bemoody @alistairewj thoughts?

bemoody · 2022-05-24T14:21:08Z

As an application programmer, I want the data. I actually don't care what data structure is used to hold it, as long as the structure is consistent for all input records, and as long as it isn't grossly inefficient. If you give me an NxM array, and what I really want is an M-element list of N-element arrays, I can very easily call list(x.transpose()) myself. As a library programmer, I think it's better to put that sort of trivial burden onto applications, rather than making the library interface more complicated. In any case, there are several more combinations of arguments you didn't mention: - return_dims=1 and smooth_frames is unspecified - return_dims=2 and smooth_frames is unspecified - smooth_frames=True and return_dims is unspecified - smooth_frames=False and return_dims is unspecified - neither return_dims nor smooth_frames is specified

cx1111 · 2022-05-24T15:05:51Z

Thanks. Some more thoughts:

Many users will want the 2d array to pass into convolutional neural networks or general multi-channel filter functions.
For multi-frequency records, the only way to avoid losing information is to return an array of 1d arrays.

I think it's worth it for our library to support both dimensions, despite the additional complexity. Regarding efficiency/performance, would either form perform a lot better?

I imagine for the ease of writing the functions, but at a large performance cost, we do have the option of always processing the signals as a list of 1d arrays, and then only reshaping into 2d before returning the value.

The function argument values can be required, and have defaults, so that we don't have to deal with the unspecified cases.

bemoody · 2022-05-24T16:16:54Z

One-dimensional CNNs, yeah. A two-dimensional CNN (and a two-dimensional input array) would only make sense if the channels were ordered spatially, right? But I'm not here to argue with people's life choices. :) There surely are many applications where storing each channel in contiguous memory could give slightly better performance than interleaving the channels. I believe numpy allows you to create arrays in "Fortran order" rather than "C order", so this wouldn't require any API change.

The function argument values can be required, and have defaults, so that we don't have to deal with the unspecified cases.

Well, that's the trouble. The way you've outlined it, there isn't any default value for return_dims that makes sense. To be backward compatible, return_dims would have to be "1 if smooth_frames is false, 2 if smooth_frames is true". As a database author, I care a lot about the defaults. I think it's important that the library should expose high-resolution data *by default*. I don't want to have to keep explaining to people that they need to pass a mysterious extra argument to prevent the library from smudging the data. There are a few ways this could be done: - The library could provide *both* multi-frequency (e_p_signal) and low-resolution smooth-frames (p_signal) data by default (and I could tell people "just use e_p_signal and forget p_signal exists.") - The library could provide only multi-frequency data by default. - The library could provide *high-resolution* smooth-frames data (upsampling, not downsampling.) However, I think the first option is the only way to really be backward compatible. There's no way to be backward compatible *and* provide e_p_signal and p_signal simultaneously *and* make p_signal high-res instead of low-res.

cx1111 · 2022-05-25T15:22:34Z

Let's not worry about backwards compatibility here. The current API behavior is not ideal, and does not allow the user to get a predictable output.

I didn't think about the high-res option. I'm not sure how useful it would be, but we could provide it. We would then have combinations of return_dims and smooth_frames (or better name):

(1, None): List of 1d arrays. Not guaranteed to be same length for multi-frequency signals.
(1, 'average'): List of 1d arrays. Guaranteed to be same length.
(1, 'upsample'): List of 1d arrays. Guaranteed to be the same length. (For Record objects, what happens to fs?)
(2, None): Illegal combination.
(2, 'average'): Single 2d array.
(2, 'upsample'): Single 2d array.

We could also add an INFO logging statement that occurs when reading a multi-frequency record, so that it's more immediately obvious.

I realize that all of this only considers the final output array(s) (such as when calling the current rdsamp), not the intermediate steps and the object attributes used to represent the signals. Let me think about that.

bemoody · 2022-05-25T17:38:38Z

I didn't think about the high-res option. I'm not sure how useful it would be, but we could provide it.

I'm honestly not sure how useful it would be either... but as you say, people like having the 2D array, and *if* someone insists on having a 2D array, it makes much more sense to upsample rather than downsampling.

I realize that all of this only considers the final output array(s) (such as when calling the current rdsamp), not the intermediate steps and the object attributes used to represent the signals.

I don't use wfdb.rdsamp, and I'd encourage anyone still using it to use wfdb.rdrecord instead. wfdb.rdsamp always returns a two-dimensional array, and I don't think it would be useful to change that behavior now. rdrecord does not return an array or a list. It returns a Record object. A Record object has numerous attributes and methods. That means there's no reason a Record object cannot simultaneously provide access to many different views of the same data. At the same time, I'm still not clear why it would be beneficial to provide data as a list-of-equal-length-arrays. I see that as being redundant (it's not any more efficient, and it's less useful for some applications, than a 2D array), and possibly confusing (at present if a Record has an e_p_signal then you *know* it hasn't been resampled.)

bemoody · 2022-05-27T17:20:38Z

An idea I mentioned in the meeting today was to allow the p_signal (and e_p_signal, etc.) attributes to be lazily loaded, perhaps if you call rdrecord with stream=True or something like that.

A very rough sketch might look like this:

def rdrecord(record_name, sampfrom=0, sampto=None, stream=False):
    record = rdheader(record_name)
    if sampto is None:
        sampto = record.sig_len
    if stream:
        record.p_signal = Lazy2DSignalArray(record, sampfrom, sampto)
    else:
        record.p_signal = _read_2d_signals(record, sampfrom, sampto)
    return record

class Lazy2DSignalArray:
    def __init__(self, record, sampfrom, sampto):
        self._record = record
        self._base = sampfrom
        self._len = sampto - sampfrom

    def __len__(self):
        return self._len

    def __getitem__(self, key):
        if isinstance(key, slice):
            (start, stop, step) = key.indices(self._len)
            data = _read_2d_signals(self._record,
                                    start + self._base,
                                    stop + self._base)
            return data[::step]
        elif isinstance(key, tuple):
            return self[key[0]][key[1:]]
        elif hasattr(key, '__index__'):
            return self[key:key+1][0]
        else:
            raise TypeError

tompollard · 2022-06-01T16:19:42Z

Just catching up on this!

Let's not worry about backwards compatibility here. The current API behavior is not ideal, and does not allow the user to get a predictable output.

What about deprecating rdrecord and introducing wfdb.read_record() or maybe just wfdb.read()?

What do people think of the suggestion of having two params: return_dims and smooth_frames? return_dims will apply for both single and multi-frequency signals, and smooth_frames will apply only to multi-frequency signals.

Is return_dims necessary at all? The fact that it leads to situations like (2, False): Illegal combination. isn't nice. Could we treat everything as 2d and maybe have some kind of binary flatten flag if necessary?

cx1111 · 2022-06-07T21:35:57Z

What about deprecating rdrecord and introducing wfdb.read_record() or maybe just wfdb.read()?

That's the long term plan. We want a major version where old and new APIs are available, and then another one that removes the old ones.

Is return_dims necessary at all? The fact that it leads to situations like (2, False): Illegal combination. isn't nice. Could we treat everything as 2d and maybe have some kind of binary flatten flag if necessary?

flatten essentially does the same thing as return_dims and the illegal combination issue would still hold. I'd argue that it's a worse keyword to use, since flattening an array may mean different things, thus making the keyword less explicit.

cx1111 · 2022-06-07T21:38:06Z

Hot off the press: #388

See the top level functions at the bottom of the module demonstrating a non-streaming and streaming (needs a lot of work) API. My biggest takeaway while playing around with this: Why does any class/object need to have the array as one of its attribute? Does it benefit the library, or the user? If we just separate the full data array from the class/object, we can remove the p_signal/d_signal/e_p_signal naming garbage.

cx1111 · 2022-06-23T18:16:00Z

^^ Thinking about it more, separating the signal data from the object may bring more confusion than benefits. Imagine reading a JPEG file into an Image object, and not having the bytes as an object field. That would be very strange.

cx1111 mentioned this issue Jun 7, 2022

[Draft]: Add new record #388

Draft

cx1111 mentioned this issue Jun 10, 2022

Better encapsulation of record metadata #371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better API for specifying return data of signals #376

Better API for specifying return data of signals #376

cx1111 commented May 23, 2022 •

edited

Loading

cx1111 commented May 23, 2022

bemoody commented May 24, 2022 via email

cx1111 commented May 24, 2022 •

edited

Loading

bemoody commented May 24, 2022 via email

cx1111 commented May 25, 2022

bemoody commented May 25, 2022 via email

bemoody commented May 27, 2022

tompollard commented Jun 1, 2022

cx1111 commented Jun 7, 2022

cx1111 commented Jun 7, 2022 •

edited

Loading

cx1111 commented Jun 23, 2022

Better API for specifying return data of signals #376

Better API for specifying return data of signals #376

Comments

cx1111 commented May 23, 2022 • edited Loading

cx1111 commented May 23, 2022

bemoody commented May 24, 2022 via email

cx1111 commented May 24, 2022 • edited Loading

bemoody commented May 24, 2022 via email

cx1111 commented May 25, 2022

bemoody commented May 25, 2022 via email

bemoody commented May 27, 2022

tompollard commented Jun 1, 2022

cx1111 commented Jun 7, 2022

cx1111 commented Jun 7, 2022 • edited Loading

cx1111 commented Jun 23, 2022

cx1111 commented May 23, 2022 •

edited

Loading

cx1111 commented May 24, 2022 •

edited

Loading

cx1111 commented Jun 7, 2022 •

edited

Loading