Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OWCSVImport: Get ready for OWFile merge #5077

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

irgolic
Copy link
Member

@irgolic irgolic commented Nov 7, 2020

Issue

First step in fixing #3997.

Description of changes

Adds a domain editor widget to CSV File Import. This is the first and probably most complicated step in merging CSV File Import and File.

I ignored context settings for now, @janezd, this is something I'd rather let you implement.

I tried to keep it clean, but I did something hacky in Orange.widgets.data.owcsvimport.Options, to insert the set type in between the Option ranges, and in Orange.widgets.utils.domaineditor.DomainEditor, to store the original column ordering (domain loses this information by converting to attributes/meta/class). @ales-erjavec, feel free to nitpick the implementation, I'll do my best to make it as nice as I can.

Some things to note:

  • editing the domain does not update the pandas DataFrame signal (I'd give Pandas forward compat #1828 another shot instead of fixing this in an ugly way)
  • importing duplicate columns adds a ".1" postfix to the column name, but editing the domain such that the columns are duplicated appends " (1)" and " (2)" to the names. This behavior should probably be unified.
  • importing a text column that could be numerical doesn't let you set it to numerical in domain editor (fixing this would require either not setting types in import dialogue, or changes in Variable)

The logic is that you use the Import dialogue for importing, and afterward, Domain Editor for domain editing. Following this line of thought, it seems like setting column types shouldn't be a feature of the import dialogue. I'm quite partial to the interface though, so I'd like to keep it. Domain editor would need a way to edit several columns types at a time before I'm willing to give it up. I suppose the 'not being able to cast text to numerical' issue is a wontfix, and users likely wouldn't stumble upon it, or would find the fix for it by themselves (reimporting with the correct vartype).

Is there a reason OWFile's domain editor has an Apply button? Would this be necessary in OWCSVImport, given the comprehensive import options?

Future work in merging CSVFileImport and File includes:

  • URLs support
  • better Info message (taking inspiration from File)
  • defaulting to datasets directory, given scratch workflow
  • filling combobox with loadable files in same directory, given saved workflow
  • changing default combobox text (perhaps a file should always be chosen?)
  • fixing sizehint (I'm first going to work on Rework left side for better scrollbars and static buttonsArea orange-widget-base#102, which touches on resizable widgets)
Includes
  • Code changes
  • Tests (TODO after feedback)
  • Documentation

@irgolic irgolic marked this pull request as draft November 7, 2020 15:30
@irgolic irgolic force-pushed the csvfileimport-domain-editor branch from 4d50c06 to 13a54b3 Compare November 7, 2020 15:33
@codecov
Copy link

codecov bot commented Nov 7, 2020

Codecov Report

Merging #5077 (38e9d7c) into master (22f5f91) will decrease coverage by 0.02%.
The diff coverage is 74.35%.

❗ Current head 38e9d7c differs from pull request most recent head 797ab7b. Consider uploading reports for the commit 797ab7b to get more accurate results

@@            Coverage Diff             @@
##           master    #5077      +/-   ##
==========================================
- Coverage   86.38%   86.35%   -0.03%     
==========================================
  Files         304      304              
  Lines       61753    61870     +117     
==========================================
+ Hits        53343    53428      +85     
- Misses       8410     8442      +32     

@ales-erjavec
Copy link
Contributor

it seems like setting column types shouldn't be a feature of the import dialogue.

No it has to be a part of import dialog. The type casting has to be done on actual text values as read from the file. You cannot for instance read text column of codes '000', '007', '008', (them being converted to numbers) and then convert to text or categorical.

@irgolic
Copy link
Member Author

irgolic commented Nov 9, 2020

You cannot for instance read text column of codes '000', '007', '008', (them being converted to numbers) and then convert to text or categorical.

You are absolutely right. This would warrant a warning when changing a variable that was imported as numeric to string in the domain editor.

@irgolic
Copy link
Member Author

irgolic commented Nov 10, 2020

You cannot for instance read text column of codes '000', '007', '008', (them being converted to numbers) and then convert to text or categorical.

You are absolutely right. This would warrant a warning when changing a variable that was imported as numeric to string in the domain editor.

@ales-erjavec On second thought, what if everything were imported as text initially, and cast to whatever attribute type was selected/automatically determined afterward?

@ales-erjavec
Copy link
Contributor

... what if everything were imported as text initially, and cast to whatever attribute type was selected/automatically determined afterward?

Then memory consumption (and probably overall performance) would be on par with File.

@irgolic
Copy link
Member Author

irgolic commented Nov 10, 2020

... what if everything were imported as text initially, and cast to whatever attribute type was selected/automatically determined afterward?

Then memory consumption (and probably overall performance) would be on par with File.

I see, warning it is then.

@janezd janezd self-assigned this Nov 13, 2020
@janezd
Copy link
Contributor

janezd commented Nov 20, 2020

From the user perspective, my only objection is that the options button is now rather hidden. I loaded some data where I needed to chose the decimal number format, and had to look around for the button. In terms of the flow, you first select the file, then set delimiters etc, and the you edit domain. Perhaps the layout should follow this.

@janezd janezd removed their assignment Nov 20, 2020
@irgolic irgolic force-pushed the csvfileimport-domain-editor branch from 13a54b3 to 5973430 Compare January 7, 2021 15:12
@irgolic irgolic changed the title [RFC] OWCSVImport: Add a domain editor OWCSVImport: Add a domain editor Jan 7, 2021
@irgolic irgolic marked this pull request as ready for review January 7, 2021 15:12
@irgolic irgolic force-pushed the csvfileimport-domain-editor branch 2 times, most recently from 59aa70a to 1e8269e Compare January 7, 2021 20:29
@ajdapretnar
Copy link
Contributor

I have a couple of comments.

  1. Clicking on "Browse files" seems counter-intuitive to me. At least it goes against Orange convention in all data loading files. I am in favour of having a folder icon next to the file list. Speaking of Orange convention, I would place Reload button at the top too (I didn't even see it as the bottom).
  2. Import Options can be to the right or below the file loading options (preferably below).
  3. Info box should be smaller, more compact. Now it is the same size as domain editor, even though the latter is way more important.
  4. Size Hint: both the widget and the import options dialogue could be smaller upon opening.
  5. Changing variable type is now slightly better, but it still shifts the dialogue a bit.
  6. I managed to crash the widget with a certain data set. Can't share it publicly, ping me for a data set.
Traceback (most recent call last):
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owcsvimport.py", line 1284, in __handle_result
    table = pandas_to_table(df)
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owcsvimport.py", line 1918, in pandas_to_table
    coldata = pd.Categorical(
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 316, in __init__
    dtype = CategoricalDtype._from_values_or_dtype(
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py", line 330, in _from_values_or_dtype
    dtype = CategoricalDtype(categories, ordered)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py", line 222, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py", line 369, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py", line 546, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique

@irgolic
Copy link
Member Author

irgolic commented Feb 2, 2021

Clicking on "Browse files" seems counter-intuitive to me. At least it goes against Orange convention in all data loading files. I am in favour of having a folder icon next to the file list. Speaking of Orange convention, I would place Reload button at the top too (I didn't even see it as the bottom).

Agreed, I was just playing around/experimenting.

How's this?

Screenshot 2021-02-02 at 00 52 51

Changing variable type is now slightly better, but it still shifts the dialogue a bit.

Can you attach a video of this?

I managed to crash the widget with a certain data set. Can't share it publicly, ping me for a data set.

Reproduced on master.

@ajdapretnar
Copy link
Contributor

To be completely honest, I was quite confused when I first saw the last proposal. Don't know why, as I have suggested it. 😮

That said, looking at it for a few minutes, it makes sense. I just don't understand the separation between the folder icon and Import options. Any reason for that?

Also, the URL option is gone. Where'd it go? Or is this just a mock-up? Also, Browse documentation data sets is missing.

I guess we still want to keep the info box, even though I just can't get myself to like it.

Clicking issue. I have slowed down important sections. See how the dropdown gets shifted?
select-issue

'may result in altered values.\n'
'For example, 001 turns into 1.\n'
'Change the variable type to String in '
'the Import Options to avoid this.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is too long (but I'll admit, I have no better alternative either). In the widget it looks like it's just one line and it leaves the user confused. I didn't see there's more text until I checked it here. Also, when hovering over the warning, only the missing part of the warning showed in the tooltip? That is weird - is it intentional? 🤔 If we have long warnings, then there must be eliding (...) so that the user knows to look for more information, IMO.

Also, newline character is missing at the end of the first line (or a space, your choice).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a general problem, maybe we should put a little down arrow at the end of warnings with more details

@irgolic
Copy link
Member Author

irgolic commented Feb 2, 2021

I just don't understand the separation between the folder icon and Import options. Any reason for that?

I could move it closer.

Also, the URL option is gone. Where'd it go? Or is this just a mock-up? Also, Browse documentation data sets is missing.

This isn't ready to replace the File widget, just adds a domain editor and rearranges the gui.

I guess we still want to keep the info box, even though I just can't get myself to like it.

Some info is good, but CSV doesn't make good use of it. I'll use File's info box later on.

@ajdapretnar
Copy link
Contributor

Here are my latest comments:

  • I would expect the Import Options to open immediately upon loading the file, as they used to. Otherwise my semicolon separated file loads strangely and I have to browse how to change the options, etc. It's a hassle. If Import Options is opened immediately, then the user can set the right parameters right away. Same as in Libre Office, which I really like.
  • Warnings are not fixed. Space is missing and for longer messages it is not clear there is more to read. Elide or otherwise indicate there's more to the warning.
  • The widget need Reset button to return to the previous state. I am in favour of apply button, too.
  • Size Hint is still not solved - widget opens too big, too wide, info box should be smaller. Info box should not stretch more than there's text (see File for example).
  • Editor doesn't remember choices. Set an attribute in one file, change the file, change it back, the attribute setting is not remembered.
  • Editor's buggy. There's a lag in coloring the attribute row fully (see image).

Screen Shot 2021-03-05 at 10 32 25

Something that is a problem on master as well:

  • Open dropdown next to the attribute, say for its role. Then click outside (nothing selected/changed). The field remains blue with dropdown arrows for some reason. If I click outside again, the dropdown briefly opens, then is gone. This is weird.
    editor-bug

@ajdapretnar ajdapretnar removed their assignment Mar 5, 2021
@irgolic irgolic marked this pull request as draft April 15, 2021 18:25
@irgolic
Copy link
Member Author

irgolic commented Aug 4, 2021

Screen Shot 2021-08-04 at 2 22 49 PM

This is what I'm working with now. I've realized that users need a 'tangible' view of data to intuit that data exists there. E.g., when I asked users to describe a basic workflow, they described 'A file is loaded in the File widget, then it is transformed into a table in the Data Table widget, ...'. Hopefully with a table view in the file loading widget, this confusion will go away.

irgolic added 3 commits August 4, 2021 22:41
Some work still required in maintaining consistency
between import options and domain editor
- establishing a singular deduplication mechanism
@irgolic irgolic force-pushed the csvfileimport-domain-editor branch 4 times, most recently from 5f89fb9 to 38e9d7c Compare August 5, 2021 01:15
@irgolic irgolic force-pushed the csvfileimport-domain-editor branch from 38e9d7c to 797ab7b Compare August 6, 2021 22:13
@irgolic irgolic changed the title OWCSVImport: Add a domain editor OWCSVImport: Get ready for OWFile merge Aug 6, 2021
@irgolic
Copy link
Member Author

irgolic commented Aug 6, 2021

This should be ready to replace OWFile. Lots of bugs which were also present in OWFile were fixed, particularly with loading URLs and spreadsheets. That said, it still needs better coverage, and the lint is pretty bad – I'm calling a lot of private stuff in io/owfile/owtable.

Check it out though, I think it works pretty great :)

@ajdapretnar
Copy link
Contributor

This works very well!

I have tested:

  • importing .xlsx, .tab files from URL
  • setting and resetting roles and types
  • interaction between Import Options and Columns

I propose to disable selection in a data table, because it is misleading (no effect on the output). Perhaps turning of sorting would not be bad either, because the table is just a (pre)view here. But if everyone thinks this is valuable, let's keep it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants