fix: word count #118

gentlementlegen · 2024-09-12T17:06:32Z

Resolves #82
QA: https://github.com/ubiquibot/conversation-rewards/actions/runs/10845258477/job/30095719413

Changes

added log level so logs can be changed through the configuration
now only the content of the node is evaluated, ignoring its children content
html comments are stripped from the result

# Conflicts: # tests/__mocks__/results/content-evaluator-results.json # tests/__mocks__/results/formatting-evaluator-results.json # tests/__mocks__/results/github-comment-results.json # tests/__mocks__/results/output-reward-split.html # tests/__mocks__/results/output.html # tests/__mocks__/results/permit-generation-results.json # tests/__mocks__/results/reward-split.json

0x4007 · 2024-09-13T08:05:27Z

src/configuration/incentives.ts

@@ -10,6 +11,7 @@ import { userExtractorConfigurationType } from "./user-extractor-config";

 export const incentivesConfigurationSchema = T.Object(
  {
+    logLevel: T.Enum(LOG_LEVEL, { default: LOG_LEVEL.INFO }),


Shouldn't the default be errors?

I think it is nice to get info within the logs by default because some interesting info is there during the process.

0x4007 · 2024-09-13T08:06:17Z

src/parser/data-purge-module.ts

@@ -26,6 +26,8 @@ export class DataPurgeModule implements Module {
          .replace(/^>.*$/gm, "")
          // Remove commands such as /start
          .replace(/^\/.+/g, "")
+          // Remove HTML comments
+          .replace(/<!--[\s\S]*?-->/g, "")


If we use a virtual DOM creator like jsdom or mdast, we should be able to query element.textContent and it should handle this and other situations in a robust manner. Since this is already finished, its fine. But if there are any problems with the implementation, or more operations you need to perform, consider the virtual DOM approach.

Body comes in a text form. Then it gets transformed into MD -> HTML. So it is rendered as a text form. So typically when fetched from GitHub the body looks like
Resolves #23  so I don't see a way to skip it other than removing it.

HTML comments shouldn't be included in element.textContent is my point

Yes, but there is no way to know it is a comment before converting it to HTML, since the MD renderer is ran first
https://github.com/gentlementlegen/conversation-rewards/blob/cf5ecb6a9f1bb551fa01c866d6f8ccaefa7e1804/src/parser/formatting-evaluator-module.ts#L112

(it is actually the same for v1) so it is first converted to a p that contains the comment.

0x4007 · 2024-09-13T08:07:36Z

src/parser/formatting-evaluator-module.ts

@@ -139,17 +141,36 @@ export class FormattingEvaluatorModule implements Module {

    for (const element of elements) {
      const tagName = element.tagName.toLowerCase();
-      const wordCount = this._countWords(this._multipliers[commentType].regex, element.textContent || "");
+      // We cannot use textContent otherwise we would duplicate counts, so instead we extract text nodes


textContent of the top level parent element will do the right thing.

Correct, will be part of #92

0x4007 · 2024-09-13T08:08:11Z

src/parser/formatting-evaluator-module.ts

-        score,
-      };
+      logger.debug("Tag content results", { tagName, symbols, text: element.textContent });
+      // If we already had that tag included in the result, merge them and update total count


I suppose that for the statistics it might be interesting to count words per element but honestly its out of scope and doesn't add business value while complicating the code.

Will be changed in #92

gentlementlegen added 7 commits September 13, 2024 02:05

chore(WIP): debug logs

eed7998

chore: update symbol count

95d0838

feat: added log level

f41393a

chore: fixed tests

3ac1081

chore: fixed tests

83b507c

chore: fixed logs

2d89d4e

gentlementlegen marked this pull request as ready for review September 13, 2024 07:55

gentlementlegen requested review from 0x4007, Keyrxng and whilefoo September 13, 2024 07:55

0x4007 reviewed Sep 13, 2024

View reviewed changes

gentlementlegen requested a review from 0x4007 September 13, 2024 08:34

0x4007 approved these changes Sep 13, 2024

View reviewed changes

0x4007 merged commit 3120201 into ubiquity-os-marketplace:development Sep 13, 2024
6 checks passed

ubiquity-os bot mentioned this pull request Sep 13, 2024

Unexpected Word Count #82

Closed

gentlementlegen deleted the fix/token-count branch September 13, 2024 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: word count #118

fix: word count #118

gentlementlegen commented Sep 12, 2024 •

edited

Loading

0x4007 Sep 13, 2024

gentlementlegen Sep 13, 2024

0x4007 Sep 13, 2024 •

edited

Loading

gentlementlegen Sep 13, 2024

0x4007 Sep 13, 2024

gentlementlegen Sep 13, 2024

0x4007 Sep 13, 2024

gentlementlegen Sep 13, 2024

0x4007 Sep 13, 2024

gentlementlegen Sep 13, 2024

fix: word count #118

fix: word count #118

Conversation

gentlementlegen commented Sep 12, 2024 • edited Loading

Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x4007 Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gentlementlegen commented Sep 12, 2024 •

edited

Loading

0x4007 Sep 13, 2024 •

edited

Loading