Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Facilitate collaborative project work with ScanTailor #102

Open
derycck opened this issue Jan 18, 2022 · 3 comments
Open

Comments

@derycck
Copy link

derycck commented Jan 18, 2022

CONTEXT
A large book scanning project can cost tens of hours to maximize quality.

So a single person working on digitizing an ebook in their spare time using ScanTailor Universal could take several days or weeks to complete the project.

Another fact is the existence of groups of people interested in creating an ebook and willing to cooperate collaboratively to accelerate its completion.

THE PROBLEM
Today Scantailor Universal saves a project in an XML file containing various information. Among them, the fullpath of the folder where the original images are and the folder where the edited images are.

Thus, collaborative work using versioning technologies such as GIT is very arduous because each team member has a different fullpath for the folders that store the images in their saved project XML file.

SUGGESTION
Facilitate collaborative work with ScanTailor by keeping only relative pathfiles in the XML file, where absolute path_folder would be in a separate file, to allow project versioning without merge conflicts.

PURELY ILLUSTRATIVE EXAMPLE
When saving scantailor project

  • Create the '{project_name}.ScanTailor' XML without the 'outputDirectory' and 'directory path' variables.
  • And create a second 'project file' named '{project_name}.local' containing the mentioned variables.

RESULT
In this way, members of a team engaged in finishing the entire editorial process of the same book quickly, with ScanTailor Universal, could add the '.local' file to the '.gitignore' list and easily version the '.scantailor' file without conflict of merge using GIT.
Each member will be able to work on a set of book pages in parallel and take advantage of the full potential of versioning systems.

@trufanov-nok
Copy link
Owner

Hi,
An interesting thought! I've never heard about any attempts to use any ST forks in a collaborative way... Especially with CVS systems like git.

I suspect that the filenames problem is only the very first problem that one will face trying to use ST in a such way. By design ST can't produce output step until all pages in the project are processed. And it can't define the final pages sizes at Page Layout step if "Align with other page sizes" option is on (by default) and not all pages are processed in previous steps. I mean project's pages aren't always may be processed in parallel as at some processing steps the default processing values depend on the parameters of all pages at previous step.

Also I suspect one can't easily commit a partially changed project file even if the filenames are same. You may experiment with that. Create a project with for ex. 10 pages. Make initial git commit. Make a new branch. Process first 3 pages, commit. Switch to initial branch. Process 3 other pages, commit. Then try to merge these branches. The filepaths will be same but I doubt that these two branches could be merged automatically.

@derycck
Copy link
Author

derycck commented Jan 18, 2022

You're right : /
I did the test after your indication and it seems to me that the change in XML, even if it is only at the dewarping level, is much bigger than I imagined... It would take a lot of time to correct the conflicts.

My wish in practice would be to process a 500 pages completely through automatic batch processes (for deskew, dewarping...)... Then save the project and assign 'review and fine tuning' responsibility for dewarping and deskew to 10 different people .

As I noticed that the project file is XML, I imagined that by versioning it it would be possible to work cooperating with people in different places interested in accelerating the production of the same ebook.

Perhaps this would be possible if the editing metadata of each image was in an independent metadata file. Ensuring that if a person edits the dewarping or deskew of 1 pages, only one metadata_file (of that page) would change.

By the way, something similar to this 'page-independent metadata files' design exists in ABBY Finereader, when parsing
the change behavior of internal files when making small changes to the review process.

But unfortunately applying this would require a giant change in scantailor software design.

Would opening a crowdfunding campaign for this feature be something you would be interested in to make it viable?

Thanks for responding and I apologize for not 'testing the concept' before posting the suggestion here.

@derycck
Copy link
Author

derycck commented Jan 20, 2022

Hi, Alex.
I looked more closely at the XML structure and noticed that all the transformation steps are associated with the page id.

I have an idea that I personally plan to develop involving Scantailor and would appreciate your thoughts and advice if it's not too much trouble.

I noticed that it is structured with the following keys:

  • [files, images, pages, file-name-disambiguation, filters]

Where 'filters' contains the keys:

  • [fix-orientation, page-split, deskew, select-content, page-layout, output]

I noticed that each key listed contains a collection of data from each page oriented by page id.

In order to enable collaborative work with scantailor,
I'm thinking about creating a 'plugin' approach to allow versioning (like GIT).

I'm considering building a 'scantailor project' converter to a json collection. A json for each page of the project.

So tweaking 10 pages in the dewarp and deskew steps only affects 10 json files.

I considered the possibility of creating:

  • A metadata exporter that, from the original XML, performs a parser and reorders the data present in the filters keys, generating json files. One json for each page id. Each json containing the filter data of each page: fix-orientation, page-split, deskew, select-content, page-layout, output

  • An importer with inverse effect, which reads the json and updates the base XML.

Thus, versioning could be applied to the folder with the json collection, where users would apply their import and export in each repository update.

I have in mind the question of the 'Page layout' step and its need to render all the pages of the project. I will consider this in the plan.

I program in python and am motivated to develop this.
I would like to know your opinion on this. If you think it's promising and if you have any advice for my development to follow a good path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants