Data not being processed correctly #5

maxpowel · 2024-02-13T11:44:20Z

Hi, I found other image that is not being processed correctly.

This attachment contains the original data, the pbm created by fax crate and the tiff to preview it
stream_6.zip

Looks similar to #2 because it is partially processed and at some point boom!

Using this go snippet you can get this output (it is similar to the one in other ticket but with with newlines calculated instead of hardcoded). The library source code is at https://github.com/golang/image/blob/master/ccitt/reader.go

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"io"
	"os"
	"strconv"
	"strings"

	"golang.org/x/image/ccitt"
)

func byteToBits(b byte) string {
	// Convertir el byte a una cadena de bits
	bits := strconv.FormatUint(uint64(b), 2)

	// Añadir ceros a la izquierda si la cadena de bits es menor a 8 caracteres
	for len(bits) < 8 {
		bits = "0" + bits
	}
	return bits
}

func main() {
	file, err := os.Open("/media/storage/ocr/elemento")
	if err != nil {
		fmt.Println("Error al abrir la imagen CCITT:", err)
		return
	}
	defer file.Close()
	fmt.Println("File open")
	fileReader := bufio.NewReader(file)

	cols := 264

	rows := 100

	blackIs1 := false
	encodedByteAlign := false

	opts := &ccitt.Options{Invert: blackIs1, Align: encodedByteAlign}
	mode := ccitt.Group4

	rd := ccitt.NewReader(fileReader, ccitt.MSB, mode, cols, rows, opts)

	var b bytes.Buffer
	written, err := io.Copy(&b, rd)
	if err != nil {
		fmt.Println("Error al proc:", err)
	} else {
		fmt.Println("OK written", written)
		index := 0
		for v, e := b.ReadByte(); v != 'e' && e == nil; v, e = b.ReadByte() {
			bits := byteToBits(v)
			fmt.Print(strings.Replace(strings.Replace(bits, "1", " ", -1), "0", "*", -1))
			//fmt.Print(bits)
			index += 1
			if index%(cols/8) == 0 {
				fmt.Println("")
				index = 0
			}
		}

	}
}

Thank you so much!

The text was updated successfully, but these errors were encountered:

s3bk · 2024-02-14T13:35:21Z

If you find more broken images, please reopen the issue.
I should add a test data folder...

maxpowel · 2024-02-14T16:58:38Z

Thank, you.

I found more files
files.zip
It contains the files:
33_1832 -> with of 1832
44_1984 -> with of 1984
65_1840 -> with of 1840
71_1880 -> with of 1880

If I can help with something else please tell me

s3bk · 2024-02-15T12:52:38Z

I fixed it mostly.

There is still something funny with the height of the images.

maxpowel · 2024-03-20T10:45:59Z

Is there something I can do to help?

s3bk · 2024-03-25T20:59:24Z

The bug is a missing vertical line at the end. Nothing terrible.
if you find more broken files, let me know.

maxpowel · 2024-03-29T12:52:44Z

I have a new batch
errors.zip
It contains files with this name format:
{id}-w{width}.raw, for example the filename 1066_0-w1192.raw which id is 1066_0 (useless for parsing) and width is 1192 (needed for parsing).

Thanks

s3bk · 2024-04-05T00:48:10Z

Most are fixed, but a few appear still broken. I need to convert them to pbm to check against.

s3bk · 2024-04-05T09:45:25Z

@maxpowel can you check the files 1106, 888, 786 and 746 ?
I wrapped them in a tiff file and then my image viewer shows strange lines. Is their width correct?

maxpowel · 2024-04-06T10:15:41Z

Thank you @s3bk!

These images are correct. They are just strange lines. These images come from a pdf, exactly from this page:

As you can see, it is a table and these strange lines correspond to the table borders and inner lines. The scanner somehow separated the table lines from the text. These images are the text:

The missing parts of the left column (some kind of ids) are also separated in other images. Looks like the monster of frankesntein.
It even separated the odd lines from the others. Weird but probably it is for optimizations stuff.

Now all files are being properly processed (it is big scanned PDF file with hundreds of images). Now I will test with a few thousand PDFs I have. I will notice you with the results.

Again, thanks for your effort

maxpowel · 2024-04-22T22:07:32Z

Hello, I have some new files but now the issue is with decode_g3. My testing code is this:

let g3_success = decoder::decode_g3(input.iter().cloned(),  |transitions| {
        for c in pels(transitions, width) {
            let bit = match c {
                Color::Black => Bits { data: 1, len: 1 },
                Color::White => Bits { data: 0, len: 1 }
            };
            writer.write(bit).unwrap();
        }
        writer.pad();
        //height += 1;
    }).is_some();

It is very similar to g4.

Here some samples
test.zip
And the command to convert them into tiff
fax2tiff -3 -X 1656 -M test/14_0_1656.raw -o ee.tif

Thank you

s3bk closed this as completed in 44005aa Feb 14, 2024

s3bk reopened this Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data not being processed correctly #5

Data not being processed correctly #5

maxpowel commented Feb 13, 2024

s3bk commented Feb 14, 2024

maxpowel commented Feb 14, 2024

s3bk commented Feb 15, 2024

maxpowel commented Mar 20, 2024

s3bk commented Mar 25, 2024

maxpowel commented Mar 29, 2024

s3bk commented Apr 5, 2024

s3bk commented Apr 5, 2024

maxpowel commented Apr 6, 2024

maxpowel commented Apr 22, 2024

Data not being processed correctly #5

Data not being processed correctly #5

Comments

maxpowel commented Feb 13, 2024

s3bk commented Feb 14, 2024

maxpowel commented Feb 14, 2024

s3bk commented Feb 15, 2024

maxpowel commented Mar 20, 2024

s3bk commented Mar 25, 2024

maxpowel commented Mar 29, 2024

s3bk commented Apr 5, 2024

s3bk commented Apr 5, 2024

maxpowel commented Apr 6, 2024

maxpowel commented Apr 22, 2024