Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multipart match #231

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
3 changes: 2 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@ require (
github.com/mjl-/sherpats v0.0.6
github.com/prometheus/client_golang v1.18.0
github.com/russross/blackfriday/v2 v2.1.0
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this is used? Was it for testing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be mistake. I will check.

go.etcd.io/bbolt v1.3.11
golang.org/x/crypto v0.27.0
golang.org/x/exp v0.0.0-20240416160154-fe59bbe5cc7f
golang.org/x/net v0.29.0
golang.org/x/text v0.18.0
golang.org/x/text v0.19.0
rsc.io/qr v0.2.0
)

Expand Down
6 changes: 4 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ github.com/prometheus/procfs v0.12.0 h1:jluTpSng7V9hY0O2R9DzzJHYb2xULk9VTR1V1R/k
github.com/prometheus/procfs v0.12.0/go.mod h1:pcuDEFsWDnvcgNzo4EEweacyhjeA9Zk3cnaOZAZEfOo=
github.com/russross/blackfriday/v2 v2.1.0 h1:JIOH55/0cWyOuilr9/qlrm0BSXldqnqwMsf35Ld67mk=
github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d h1:hrujxIzL1woJ7AwssoOcM/tq5JjjG2yYOc8odClEiXA=
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d/go.mod h1:uugorj2VCxiV1x+LzaIdVa9b4S4qGAcH6cbhh4qVxOU=
github.com/sirupsen/logrus v1.2.0/go.mod h1:LxeOpSwHxABJmUn/MG1IvRgCAasNZTLOkJPxbbu5VWo=
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
Expand Down Expand Up @@ -97,8 +99,8 @@ golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7w
golang.org/x/sys v0.25.0 h1:r+8e+loiHxRqhXVl6ML1nO3l1+oFoWbnlu2Ehimmi34=
golang.org/x/sys v0.25.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.18.0 h1:XvMDiNzPAl0jr17s6W9lcaIhGUfUORdGCNsuLmPG224=
golang.org/x/text v0.18.0/go.mod h1:BuEKDfySbSR4drPmRPG/7iBdf8hvFMuRexcpahXilzY=
golang.org/x/text v0.19.0 h1:kTxAhCbGbxhK0IwgSKiMO5awPoDQ0RpfiVYBfK860YM=
golang.org/x/text v0.19.0/go.mod h1:BuEKDfySbSR4drPmRPG/7iBdf8hvFMuRexcpahXilzY=
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.25.0 h1:oFU9pkj/iJgs+0DT+VMHrx+oBKs/LJMV+Uvg78sl+fE=
golang.org/x/tools v0.25.0/go.mod h1:/vtpO8WL1N9cQC3FN5zPqb//fRXskFHbLKk4OW1Q7rg=
Expand Down
4 changes: 4 additions & 0 deletions message/part.go
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ func (p *Part) String() string {
return fmt.Sprintf("&Part{%s/%s offsets %d/%d/%d/%d lines %d decodedsize %d next %d last %d bound %q parts %v}", p.MediaType, p.MediaSubType, p.BoundaryOffset, p.HeaderOffset, p.BodyOffset, p.EndOffset, p.RawLineCount, p.DecodedSize, p.nextBoundOffset, p.lastBoundOffset, p.bound, p.Parts)
}

func (p *Part) GetBound() string {
return string(p.bound)
}

// newPart parses a new part, which can be the top-level message.
// offset is the bound offset for parts, and the start of message for top-level messages. parent indicates if this is a top-level message or sub-part.
// If an error occurs, p's exported values can still be relevant. EnsurePart uses these values.
Expand Down
19 changes: 19 additions & 0 deletions store/account.go
Original file line number Diff line number Diff line change
Expand Up @@ -1855,12 +1855,31 @@ ruleset:

header:
for _, t := range rs.HeadersRegexpCompiled {
isSubjectMatch := t[0].MatchString("subject")
for k, vl := range header {
k = strings.ToLower(k)
if t[0].MatchString("body") { // message body match
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned elsewhere that it may be good to separate the body-matching from header matching. And indeed that seems better, at the minimum to avoid confusion between potential headers called "body" and the actual body. I was thinking we could maybe use an empty header key to indicate matching the body, but HeadersRegexp is a map, and it will probably look weird in the config file, if it even works at all.
A new config option in Ruleset indeed would require changing the web interface, and with the current approach (one big table) make it so big we need to refactor it. I can tackel that UI change. I think we would need a new BodyRegexps []string field in the config.Ruleset?

Btw, for this code, shouldn't the "if" statement be before its for-loop ("range header")? It's not executed for each header key/value in the message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Separating header match and body match is the correct way I think too.
Changing web interface and config data structure seems complicated. Can you try them?

shouldn't the "if" statement be before its for-loop ("range header")?

Yes, after all, I noticed I wrote naive code...

ws := PrepareWordSearch([]string{t[1].String()}, []string{})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure anymore that PrepareWordSearch is the best way to do the matching. It is used by IMAP search and webmail search, and it can require presence/absence of certain words, but that's not needed for these matches, and we want to match on regular expressions (at least for now, in the future, perhaps we could add more elaborate matching mechanisms, including "not"-matches).

I think we can use https://pkg.go.dev/regexp#Regexp.MatchReader. The RuneReader interface is implemented by bufio.Reader: https://pkg.go.dev/bufio#Reader.ReadRune. So I think we can wrap the io.Reader returned by https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader in a bufio.Reader, and call MatchReader (or a similar method) on it. We would also do that for each Part.Parts (multipart messages) recursively (see https://pkg.go.dev/github.com/mjl-/mox/message#Part), until we have a match.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I referred the codes used in webmail search. I will check MatchReader.

// todo: regexp match
ok, err := ws.MatchPart(log, &p, true)
if err != nil {
log.Errorx("Failed to match body: %v", err)
}
if ok {
continue header
}
}
if !t[0].MatchString(k) {
continue
}
for _, v := range vl {
if isSubjectMatch {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoding RFC2047-encoded words is a good idea.
We should probably attempt decoding it for all headers.
https://www.xmox.nl/xr/dev/rfc/2047.html#L343 specifies quite elaborate rules for where in a header the encoded words are allowed. I think it's too much to follow those requirements explicitly, at least for the purpose of matching text against a header. Hopefully, it works well enough to do a quick scan if the magic "=?" and "?=" occur in the header value, and try to parse it if that's the case.

Decoding should probably be done with mime.WordDecoder, as is done at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L480. The code at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L448 also handles the various character encodings (though perhaps more need to explicitly added: I think "ianaindex" misses a few characters sets, not sure about the japanese ones).

I think rfc2047-decoding headers could be a separate PR, it isn't tied to matching words in the body.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I'll check them.

// todo: memorize decoded text
v, err = decodeRFC2047(v)
if err != nil {
log.Errorx("Failed to decode subject: %v", err, slog.String("v", v))
}
}
v = strings.ToLower(strings.TrimSpace(v))
if t[1].MatchString(v) {
continue header
Expand Down
175 changes: 172 additions & 3 deletions store/search.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,22 @@ package store

import (
"bytes"
"encoding/base64"
"fmt"
"io"
"mime/quotedprintable"
"regexp"
"strings"
"unicode"
"unicode/utf8"

"github.com/mjl-/mox/message"
"github.com/mjl-/mox/mlog"

"golang.org/x/text/encoding"
"golang.org/x/text/encoding/japanese"
encUnicode "golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)

// WordSearch holds context for a search, with scratch buffers to prevent
Expand Down Expand Up @@ -82,11 +91,26 @@ func (ws WordSearch) matchPart(log mlog.Log, p *message.Part, headerToo bool, se
}

if len(p.Parts) == 0 {
var tp io.Reader
if p.MediaType != "TEXT" {
// todo: for other types we could try to find a library for parsing and search in there too.
return false, nil
if p.MediaType == "MULTIPART" {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious: The "if" above, for "len(p.Parts) == 0" should cause this if-branch to only be taken if this is not a multipart (i.e. it is a leaf part). The multipart-matching should be handled by "for _, pp := range p.Parts {" below (called recursively).
If p.Parts is empty for multiparts, perhaps the Part wasn't fully initialized/parsed ("walked") yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought same when I see these codes. When I use my multipart mail sample, len(p.Parts) == 0 becomes true but can be something misunderstand. I'll check.

// Decode and make io.Reader
// todo: avoid to load all content
content, err := io.ReadAll(p.RawReader())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have to use p.Reader() (https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader), which should already decode the character set. If decoding doesn't yet work for the japanese encoding, it may require changing the "wordDecoder" as mentioned earlier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks.

if err != nil {
return false, err
}
tp, err = decodeMultiPart(string(content), p.GetBound())
if err != nil {
return false, err
}
} else {
// todo: for other types we could try to find a library for parsing and search in there too.
return false, nil
}
} else {
tp = p.ReaderUTF8OrBinary()
}
tp := p.ReaderUTF8OrBinary()
// todo: for html and perhaps other types, we could try to parse as text and filter on the text.
miss, err := ws.searchReader(log, tp, seen)
if miss || err != nil || ws.isQuickHit(seen) {
Expand Down Expand Up @@ -193,3 +217,148 @@ func toLower(buf []byte) []byte {
}
return r
}

func decodeRFC2047(encoded string) (string, error) {
// match e.g. =?(iso-2022-jp)?(B)?(Rnc6...)?=
r := regexp.MustCompile(`(?i)=\?([^?]+)\?([BQ])\?([^?]+)\?=`)
matches := r.FindAllStringSubmatch(encoded, -1)

if len(matches) == 0 { // no match. Looks ASCII.
return encoded, nil
}

var decodedStrings []string
for _, match := range matches {
charset := match[1]
encodingName := match[2]
encodedText := match[3]

reader, err := decodeTransferEncodeAndCharset(encodingName, charset, encodedText)
if err != nil {
return encoded, err
}

decodedText, err := io.ReadAll(reader)
if err != nil {
return encoded, err
}

decodedStrings = append(decodedStrings, string(decodedText))
}

// Concat multiple strings
return strings.Join(decodedStrings, ""), nil
}

func decodeTransferEncodeAndCharset(encodingName string, charset string, encodedText string) (io.Reader, error) {
decodedString, err := decodeTransferEncode(encodingName, encodedText)
if len(decodedString) == 0 && err != nil {
return nil, err
}

// try to decode even if unknown encoding
reader, err := decodeCharset(charset, decodedString)
if err != nil {
return nil, err
}
return reader, nil
}

// Decode Base64 or Quoted Printable
func decodeTransferEncode(encodingName string, encodedText string) (string, error) {
// Decode Base64 or Quoted-Printable
var decodedBytes []byte
var err error
switch strings.ToUpper(encodingName) {
case "B": // Base64
decodedBytes, err = base64.StdEncoding.DecodeString(encodedText)
if err != nil {
return string(decodedBytes), fmt.Errorf("Base64 decode error: %w", err)
}
case "Q": // Quoted-Printable
decodedBytes, err = io.ReadAll(quotedprintable.NewReader(strings.NewReader(encodedText)))
if err != nil {
return string(decodedBytes), fmt.Errorf("Quoted-Printable decode error: %w", err)
}
default:
return encodedText, fmt.Errorf("not supported encoding: %s", encodingName)
}
return string(decodedBytes), nil
}

func decodeCharset(charset string, decodedString string) (io.Reader, error) {
// Select charset
var enc encoding.Encoding
switch strings.ToLower(charset) {
case "iso-2022-jp":
enc = japanese.ISO2022JP
case "utf-8":
enc = encUnicode.UTF8
case "us-ascii":
return strings.NewReader(decodedString), nil
default:
return nil, fmt.Errorf("not supported charset: %s", charset)
}

// Decode with charset
reader := transform.NewReader(strings.NewReader(decodedString), enc.NewDecoder())
return reader, nil
}

func decodeMultiPart(body string, boundary string) (io.Reader, error) {
encPattern := `Content-Transfer-Encoding:\s+(\w+)`
charsetPattern := `charset="((?:\w|-)+)"`

// Regexp for MIME encode type & Charset match
encRe, err := regexp.Compile(encPattern)
if err != nil {
return nil, fmt.Errorf("error compiling regex:%v", err)
}
charsetRe, err := regexp.Compile(charsetPattern)
if err != nil {
return nil, fmt.Errorf("error compiling regex:%v", err)
}

// Split by boundary
parts := strings.Split(body, boundary)
var readers []io.Reader

// Make decoded io.Readers for each part
for _, part := range parts {
part = strings.TrimSpace(part)
if len(part) == 0 {
continue
}

// Extract MIME header and body
headerBody := strings.SplitN(part, "\r\n\r\n", 2)
if len(headerBody) < 2 {
// retry
headerBody = strings.SplitN(part, "\n\n", 2)
if len(headerBody) < 2 {
continue
}
}

mimeHeader := headerBody[0]
encodedBody := headerBody[1]

// Find encode types
encMatches := encRe.FindStringSubmatch(mimeHeader)
charsetMatches := charsetRe.FindStringSubmatch(mimeHeader)

// Decode
if len(encMatches) > 1 && len(charsetMatches) > 1 {
reader, err := decodeTransferEncodeAndCharset(encMatches[1][0:1], charsetMatches[1], encodedBody)
if err != nil {
return nil, err
}
readers = append(readers, reader)

} else {
return nil, fmt.Errorf("failed to match encoding and charset in:\n%s", mimeHeader)
}
}

return io.MultiReader(readers...), nil
}
86 changes: 86 additions & 0 deletions store/search_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
package store

import (
"fmt"
"io"
"log/slog"
"os"
"strings"
"testing"

"github.com/mjl-/mox/message"
"github.com/mjl-/mox/mlog"
)

func TestSubjectMatch(t *testing.T) {
// Auto detect subject text encoding and decode

//log := mlog.New("search", nil)

originalSubject := `テストテキスト Abc 123...`
asciiSubject := "test text Abc 123..."

encodedSubjectUTF8 := `=?UTF-8?b?44OG44K544OI44OG44Kt44K544OIIEFiYyAxMjMuLi4=?=`
encodedSubjectISO2022 := `=?iso-2022-jp?B?GyRCJUYlOSVIJUYlLSU5JUgbKEIgQWJjIDEyMy4uLg==?=`
encodedSubjectUTF8 = encodedSubjectUTF8 + " \n " + encodedSubjectUTF8
encodedSubjectISO2022 = encodedSubjectISO2022 + " \n " + encodedSubjectISO2022
originalSubject = originalSubject + originalSubject

encodedTexts := map[string]string{encodedSubjectUTF8: originalSubject, encodedSubjectISO2022: originalSubject, asciiSubject: asciiSubject}

for encodedSubject, originalSubject := range encodedTexts {

// Autodetect & decode
decodedSubject, err := decodeRFC2047(encodedSubject)

fmt.Printf("decoded text:%s\n", decodedSubject)
if err != nil {
t.Fatalf("Decode error: %v", err)
}

if originalSubject != decodedSubject {
t.Fatalf("Decode mismatch %s != %s", originalSubject, decodedSubject)
}
}
}

func TestMultipartMailDecode(t *testing.T) {
log := mlog.New("search", nil)

// Load raw mail file
filePath := "../../data/mail_raw.txt" // multipart mail raw data
wordFilePath := "../../data/word.txt"

msgFile, err := os.Open(filePath)
if err != nil {
t.Fatalf("Failed to open file: %v", err)
}
defer msgFile.Close()

// load word
wordFile, err := os.Open(wordFilePath)
if err != nil {
t.Fatalf("Failed to open file: %v", err)
}
defer wordFile.Close()
tmp, err := io.ReadAll(wordFile)
if err != nil {
t.Fatalf("Failed to load search word: %v", err)
}
searchWord := strings.TrimSpace(string(tmp))

// Parse mail
mr := FileMsgReader([]byte{}, msgFile)
p, err := message.Parse(log.Logger, false, mr)
if err != nil {
t.Fatalf("parsing message for evaluating rulesets, continuing with headers %v, %s", err, slog.String("parse", ""))
}

// Match
ws := PrepareWordSearch([]string{searchWord}, []string{})
ok, _ := ws.MatchPart(log, &p, true)
if !ok {
t.Fatalf("Match failed %s", ws.words)
}
log.Debug("Check match", slog.String("word", string(searchWord)), slog.Bool("ok", ok))
}
5 changes: 4 additions & 1 deletion vendor/modules.txt
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ github.com/prometheus/procfs/internal/util
# github.com/russross/blackfriday/v2 v2.1.0
## explicit
github.com/russross/blackfriday/v2
# github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d
## explicit
github.com/saintfish/chardet
# go.etcd.io/bbolt v1.3.11
## explicit; go 1.22
go.etcd.io/bbolt
Expand Down Expand Up @@ -97,7 +100,7 @@ golang.org/x/sync/errgroup
golang.org/x/sys/cpu
golang.org/x/sys/unix
golang.org/x/sys/windows
# golang.org/x/text v0.18.0
# golang.org/x/text v0.19.0
## explicit; go 1.18
golang.org/x/text/cases
golang.org/x/text/encoding
Expand Down
Loading