Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data not being processed correctly #5

Open
maxpowel opened this issue Feb 13, 2024 · 10 comments
Open

Data not being processed correctly #5

maxpowel opened this issue Feb 13, 2024 · 10 comments

Comments

@maxpowel
Copy link

Hi, I found other image that is not being processed correctly.

This attachment contains the original data, the pbm created by fax crate and the tiff to preview it
stream_6.zip

Looks similar to #2 because it is partially processed and at some point boom!

Using this go snippet you can get this output (it is similar to the one in other ticket but with with newlines calculated instead of hardcoded). The library source code is at https://github.com/golang/image/blob/master/ccitt/reader.go
image

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"io"
	"os"
	"strconv"
	"strings"

	"golang.org/x/image/ccitt"
)

func byteToBits(b byte) string {
	// Convertir el byte a una cadena de bits
	bits := strconv.FormatUint(uint64(b), 2)

	// Añadir ceros a la izquierda si la cadena de bits es menor a 8 caracteres
	for len(bits) < 8 {
		bits = "0" + bits
	}
	return bits
}

func main() {
	file, err := os.Open("/media/storage/ocr/elemento")
	if err != nil {
		fmt.Println("Error al abrir la imagen CCITT:", err)
		return
	}
	defer file.Close()
	fmt.Println("File open")
	fileReader := bufio.NewReader(file)

	cols := 264

	rows := 100

	blackIs1 := false
	encodedByteAlign := false

	opts := &ccitt.Options{Invert: blackIs1, Align: encodedByteAlign}
	mode := ccitt.Group4

	rd := ccitt.NewReader(fileReader, ccitt.MSB, mode, cols, rows, opts)

	var b bytes.Buffer
	written, err := io.Copy(&b, rd)
	if err != nil {
		fmt.Println("Error al proc:", err)
	} else {
		fmt.Println("OK written", written)
		index := 0
		for v, e := b.ReadByte(); v != 'e' && e == nil; v, e = b.ReadByte() {
			bits := byteToBits(v)
			fmt.Print(strings.Replace(strings.Replace(bits, "1", " ", -1), "0", "*", -1))
			//fmt.Print(bits)
			index += 1
			if index%(cols/8) == 0 {
				fmt.Println("")
				index = 0
			}
		}

	}
}

Thank you so much!

@s3bk s3bk closed this as completed in 44005aa Feb 14, 2024
@s3bk
Copy link
Collaborator

s3bk commented Feb 14, 2024

If you find more broken images, please reopen the issue.
I should add a test data folder...

@maxpowel
Copy link
Author

Thank, you.

I found more files
files.zip
It contains the files:
33_1832 -> with of 1832
44_1984 -> with of 1984
65_1840 -> with of 1840
71_1880 -> with of 1880

If I can help with something else please tell me

@s3bk s3bk reopened this Feb 14, 2024
@s3bk
Copy link
Collaborator

s3bk commented Feb 15, 2024

I fixed it mostly.

There is still something funny with the height of the images.

@maxpowel
Copy link
Author

Is there something I can do to help?

@s3bk
Copy link
Collaborator

s3bk commented Mar 25, 2024

The bug is a missing vertical line at the end. Nothing terrible.
if you find more broken files, let me know.

@maxpowel
Copy link
Author

I have a new batch
errors.zip
It contains files with this name format:
{id}-w{width}.raw, for example the filename 1066_0-w1192.raw which id is 1066_0 (useless for parsing) and width is 1192 (needed for parsing).

Thanks

@s3bk
Copy link
Collaborator

s3bk commented Apr 5, 2024

Most are fixed, but a few appear still broken. I need to convert them to pbm to check against.

@s3bk
Copy link
Collaborator

s3bk commented Apr 5, 2024

@maxpowel can you check the files 1106, 888, 786 and 746 ?
I wrapped them in a tiff file and then my image viewer shows strange lines. Is their width correct?

@maxpowel
Copy link
Author

maxpowel commented Apr 6, 2024

Thank you @s3bk!

These images are correct. They are just strange lines. These images come from a pdf, exactly from this page:
image

As you can see, it is a table and these strange lines correspond to the table borders and inner lines. The scanner somehow separated the table lines from the text. These images are the text:
ok-1103_0-w1240
ok-1104_0-w1240

The missing parts of the left column (some kind of ids) are also separated in other images. Looks like the monster of frankesntein.
It even separated the odd lines from the others. Weird but probably it is for optimizations stuff.

Now all files are being properly processed (it is big scanned PDF file with hundreds of images). Now I will test with a few thousand PDFs I have. I will notice you with the results.

Again, thanks for your effort

@maxpowel
Copy link
Author

Hello, I have some new files but now the issue is with decode_g3. My testing code is this:

let g3_success = decoder::decode_g3(input.iter().cloned(),  |transitions| {
        for c in pels(transitions, width) {
            let bit = match c {
                Color::Black => Bits { data: 1, len: 1 },
                Color::White => Bits { data: 0, len: 1 }
            };
            writer.write(bit).unwrap();
        }
        writer.pad();
        //height += 1;
    }).is_some();

It is very similar to g4.

Here some samples
test.zip
And the command to convert them into tiff
fax2tiff -3 -X 1656 -M test/14_0_1656.raw -o ee.tif

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants