The specific rules for each quality metric are as follows:
Function Name | Type | Description | Reference |
---|---|---|---|
RuleAlphaWords | EFFECTIVENESS | check whether the ratio of words that contain at least one alphabetic character > 0.6 | Redpajama MAP-en Gopher Dolma |
RuleCapitalWords | UNDERSTANDABILITY | check whether capital words ratio > 0.2 | Redpajama MAP-en |
RuleCharNumber | EFFECTIVENESS | check whether the number of char > 100 | MAP-en |
RuleColonEnd | COMPLETENESS | check whether the last char is ':' | |
RuleContentNull | EFFECTIVENESS | check whether content is null | |
RuleCurlyBracket | UNDERSTANDABILITY | check whether the ratio of the number of {,} and the number of characters < 0.025 | Redpajama C4 |
RuleDocRepeat | SIMILARITY | check whether content repeats | Redpajama MAP-en FineWeb Gopher |
RuleHtmlEntity | RELEVANCE | check whether content has html entity | |
RuleIDCard | SECURITY | check if the content contains ID card. | |
RuleLineEndWithEllipsis | COMPLETENESS | check whether the ratio of line ends with ellipsis < 0.3 | Redpajama MAP-en Gopher Dolma |
RuleLineEndWithTerminal | COMPLETENESS | check whether the ratio of line ends with terminal punctuation mark > 0.6 | Redpajama FineWeb C4 |
RuleLineStartWithBulletpoint | UNDERSTANDABILITY | check whether the ratio of line starts with bullet points < 0.9 | Redpajama MAP-en Gopher Dolma |
RuleLineJavascriptCount | EFFECTIVENESS | check whether line with the word Javascript. | Redpajama FineWeb C4 |
RuleLoremIpsum | EFFECTIVENESS | check whether the ratio of lorem ipsum < 3e-08 | Redpajama MAP-en FineWeb C4 |
RuleMeanWordLength | EFFECTIVENESS | check whether the mean length of word in [3, 10] | Redpajama MAP-en Gopher Dolma |
RuleNoPunc | FLUENCY | check whether paragraph has no punctuation. | |
RuleSentenceNumber | COMPLETENESS | check whether the number of sentence in [3, 7500] | Redpajama MAP-en FineWeb C4 |
RuleSpecialCharacter | RELEVANCE | check whether content has special characters. | |
RuleStopWord | EFFECTIVENESS | check whether the ratio of stop word > 0.06 | Redpajama MAP-en Gopher Dolma |
RuleSymbolWordRatio | EFFECTIVENESS | check whether the ratio of symbol / word is > 0.4 | Redpajama Gopher Dolma |
RuleUniqueWords | UNDERSTANDABILITY | check whether the ratio of unique words > 0.1 | Redpajama MAP-en |
RuleWatermark | RELEVANCE | check whether content has watermarks. | |
RuleWordNumber | EFFECTIVENESS | check whether the number of word in [20, 100000] | Redpajama MAP-en Gopher Dolma |