-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spellcheck all enabled scripts and improve URL detection #613
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1c5d955: Nevermind.. if mLocale is null then it is assigned a value. My bad.
Are the changes to URL detection and spell check dependent on each other? |
@Helium314 the URL detection changes are minimal and were made solely for the spellchecker so that it won't mark URI as typos. There might be a branch conflict if I'll separate them. |
Why actually check all enabled scripts? I remember somewhere you wrote that checking the whole text when adding a new word is already inefficient. Checking it in all scripts seems even less efficient than that. Btw it might be good to define what is the wanted behavior, because from your description I don't think I understand it.
I did not check the code yet, but it looks like your are changing |
I know this approach is not exactly efficient, but spellchecking is not inherently a light task, and those who enable it should know it requires more processing power. That's why the regex is checked only for those who enable the experimental URL detection in settings. To make it efficient might require a real overhaul and I admit that I am not knowledgeable enough for that task. The check for determining the locale from the script of the letters is O(n^2) at the worst case if it's a number or similar. The wanted behavior is as follows: If I type in English and then switch to any subtype of a different script like Greek (for example), the words written in English should be checked in the English dictionary and not the Greek one. It's just that the current implementation flags way too many legit words, and it's not that efficient either. It even checks numbers against the dictionary which is really a waste. |
@Helium314 Are there any plans to include this in the upcoming version? |
I made another change to the spellchecker, it should now use the cache to avoid checking the dictionary when the word has been checked before. |
Could you make changes to spellcheck so that it doesn't check single letters and names after @? For example, when replying to someone on Discord or GitHub, their nick will be underlined and this is a bit irritating. Single letters are also underlined for some reason. Even capital letters of those that are already in the dictionary, e.g. O or Z. |
@Uranusek I agree the spellchecker should not underline words starting with @ |
With the current changes in this PR, words with only 1 letter will not be reported as typos, so this issue should be fixed. |
Oh, that's great, thanks! I think spellchecker shouldn't underline email addresses either. |
That should also not happen anymore when this PR is merged, but only if you enable the "Advanced URL detection" (disabled by default) setting. An alternative solution which should barely impact performance is to whitelist all words containing an at sign '@', but it would mean that this, for example, would not be considered a typo: |
I finally had a look at this PR, and was reminded why I had postponed this several times. Sorry for this story again, but it really bothers me. This should be at least 4 PRs:
Advantages:
Now for the actual code stuff: Some comments / questions in the code section.
Performance is fine. |
} | ||
} | ||
|
||
// Finds the enabled locale with a script matching one of the letters in the given text. | ||
public Locale findLikelyLocaleOfText(final String text) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is somewhat misleading, better rename to reflect what it actualy does: find a locale with matching script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.. maybe it should be moved out of the AndroidWordLevelSpellCheckerSession class too because it is not specifically related to the spellchecker except for the local variable localesToCheck (which can be passed as an argument)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary to move it, but I definitely agree with your reasoning. Possibly could be moved to ScriptUtils
or maybe LocaleUtils
.
@@ -47,6 +48,8 @@ public abstract class AndroidWordLevelSpellCheckerSession extends Session { | |||
|
|||
public final static String[] EMPTY_STRING_ARRAY = new String[0]; | |||
|
|||
public final static int FLAG_UNCHECKABLE = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes into SuggestionsInfo
as suggestionsAttributes
, but in the documentation I can't find anything referring to uncheckable. 0 just seems to be when none of the RESULT_ATTR_[...]
applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I may have overlooked that, although a 0 flag is already used elsewhere in the code:
https://github.com/Helium314/HeliBoard/blame/136b45880e6c3aadb6d9e1b68f90851b1fbb94e7/app/src/main/java/helium314/keyboard/latin/spellcheck/AndroidSpellCheckerSession.java#L72
so it might need to be changed as well.
if we don't know what is the fallback behavior in the absence of flag then it would probably be wrong to name it like that with a constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above says "Neither RESULT_ATTR_IN_THE_DICTIONARY nor RESULT_ATTR_LOOKS_LIKE_TYPO", and this also fits with documentation, which implies that 0 means none of the flags.
There is no information regarding 0 meaning "uncheckable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you had spent more time to understand how the code works, you could have inferred that cached suggestions with FLAG_UNCHECKABLE
just return AndroidSpellCheckerService.getNotInDictEmptySuggestions(false)
, so this a flag of 0 is not returned to/handled by the native code/another class.
Essentially this constant could be named however we want as long as its purpose is clear.
The only change I would make is visibility-related: turn it to a private constant.
@@ -171,58 +205,31 @@ public void onClose() { | |||
} | |||
|
|||
private static final int CHECKABILITY_CHECKABLE = 0; | |||
private static final int CHECKABILITY_TOO_MANY_NON_LETTERS = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to be notified if I wrote "My phone number is0123456789" <- missed a space after "is"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I see here is that this might not be wanted by some people (it might have been introduced for a reason), and with the current style of PR I can't just simply revert this change if it turns out clearly unpopular.
@Helium314 thanks for the review, you've raised some good points. Admittedly I'm somewhat lazy and don't like fixing merge conflicts so that's why it's all wound up in a single PR. As you know from my other PRs I kind of lost interest with the project and have less time currently so I don't know when/if to rework this PR. Sorry about that.. If you would still like to merge this PR (you could also split it as you had suggested), I'll do my best to answer (hopefully not too many) questions here (within reasonable time). |
Such an overhaul can be done, but needs to be in a better discussed and documented way. Things like removing
I will merge it with some smaller changes as mentioned above. |
What I suggested is also the behavior of some spellcheckers like the one present in Microsoft Word. The reason why your fork of OpenBoard interested me is because you added some neat features to it. It seems that (perhaps due to time constraints) you are no longer focused on renovating the keyboard anymore, only to maintain it. That's understandable, but you did not convince me that my ideas are not desirable by others. I had spent some considerable time debugging the spellchecker and I think I have more than a basic understanding of how it works. I assure you that the person who originally added this functionality did not get it completely right (not trying to claim that I know much better.) I won't be participating in this PR anymore because I do not think that your review is open-minded and you will likely heavily change things you do not really understand or able to test. |
(My comment below is off-topic so I'm hiding it by default.) Read my commentIt's so sad to read this comment, even though I can totally understand it. @Helium314 I hope this will be taken into consideration. |
I made some changes to the spellcheck so that it checks enabled locales other than the active one if they are of different scripts. If they are of different script, there is no concern that the spellcheck would validate typos in written other locales.
If no locale could be determined, then the text consists of non-letters only (numbers, special characters, etc.) so it should not be spell checked.
The result is saved in the cache for quick retrieval since the spellcheck service may be called multiple times for the same word.
This change should also make it seem like the spellchecker can "remember which language a word was typed" (#93) even though the TextInfo has no attribute for the locale.
If the experimental setting "URL detection" is enabled, URLs are checked using a regex pattern Matcher to prevent them from being flagged as typos.
For some reason though, URLs containing '?', '!', or ';' are still truncated before they are sent to the spellchecker, so it marks the text after any of those symbols as a typo if its not a valid dictionary word.