Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to move files to the framework #145

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

davidhorstmann-arm
Copy link
Contributor

Pushing a small script that moves files to the framework repo from Mbed TLS, while retaining history.

This script works but may not be in a fully-polished state.

This is useful for building file-moves on top of other not-yet-merged
file moves.

Signed-off-by: David Horstmann <[email protected]>
When there are no dirs to make we try anyway.

This is bad and causes an error. Fix this.

Signed-off-by: David Horstmann <[email protected]>
To allow branches to be built on other branches, resolve conflicts
'manually' when the 2 branches delete each others' files.

Signed-off-by: David Horstmann <[email protected]>
Allow passing the --except flag, which stops a file from being moved
(and perform the necessary path manipulation acrobatics to make that
work).

Created to be able to move tests/include except for
tests/include/alt-dummy, like so:

--path tests/include:include --except tests/include/alt-dummy

Signed-off-by: David Horstmann <[email protected]>
This was failing if no exceptions were supplied.

Signed-off-by: David Horstmann <[email protected]>
This causes the script to not perform the operation, but simply print
the self._file_map dictionary of files to rename (this is useful for
debugging the file-moving logic, especially with combinations of
--path and --except options).

Signed-off-by: David Horstmann <[email protected]>
When files are moved to the exact same location in the framework repo
as they were in Mbed TLS, there is a problem.

Since no git change has taken place on the Mbed TLS side, the most
recent change to the file is its deletion when a previous set of files
were moved to the framework (since all other files are deleted in this
kind of move).

As a result the moved files are deleted and do not appear in the final
branch. Deal with this edge case explicitly by taking files which are
not renamed from the file map and checking them out to the state in
the temporary branch, before amending the resulting commit.

Signed-off-by: David Horstmann <[email protected]>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a mostly full code review, more than we usually do in mbedtls-docs. Arguably this script could live in the framework repository; for that level of quality, I'd want most of my comments to be handled. For the level of quality in mbedtls-docs, all my comments are optional except one: this script seems to mess up if the destination repo is the framework subdirectory of the source.

def add_file_move(self, src_path: str, dst_path: str) -> None:
"""Move file at relative path src_path in the source repo to dst_path
in the destination repo"""
self._file_map[src_path] = dst_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no normalization like there is for the initialization of _file_map?

cmd = [self.GIT_EXE] + git_cmd
return subprocess.check_output(cmd, **kwargs)

def add_file_move(self, src_path: str, dst_path: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't used anywhere. __init__ should probably call it instead of setting _file_map on its own.

def __init__(self, source_repo: str, dest_repo: str,
source_branch_name: str, dest_branch_name: str,
file_map: Dict[str, str],
dest_base_branch_name: Optional[str] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to document the parameters. It's not obvious what you can put in file_map, or the difference between dest_branch_name and dest_base_branch_name.

def _direct_children(self, dir_path: str) -> List[str]:
"""Get the direct children of a subdirectory (at least the ones
that matter to git)"""
leaf_files = self.run_git(['ls-files', dir_path]).split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever have a file name containing a space, this will split on the space. (Control characters in file names would also not work properly.) The robust way would be

Suggested change
leaf_files = self.run_git(['ls-files', dir_path]).split()
leaf_files = self.run_git(['ls-files', '-z', '--', dir_path]).rstrip('\0').split('\0')

This applies to the other uses of ls-files.

It would be good to make an auxiliary method that calls ls-files and returns a list.

Comment on lines +73 to +80
leaf_files = self.run_git(['ls-files', dir_path]).split()
path_len = len(self._path_parts(dir_path)) + 1
direct_children = set([
os.path.join(
*self._path_parts(f)[:path_len]
)
for f in leaf_files
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be equivalent but simpler: run git -C dir_path ls-files and strip anything after a slash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though looking at how this method is used, it seems that you're stripping the paths down to base names here, only to reconstruct the full path later in _expand_dir_move?

def main() -> int:
"""Command line entry point."""
parser = argparse.ArgumentParser()
parser.add_argument('--path', action='append', nargs='?',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than require --path to be repeated for each path, I think this should be the positional arguments. In particular it would allow calls like mbedtls-move-to-framework … subdir/*.py.

GIT_EXE = 'git'
FRAMEWORK_DEFAULT_BASE_BRANCH = 'main'

def __init__(self, source_repo: str, dest_repo: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be consistent when abbreviating “destination”: either dst or dest, not both.

Comment on lines +295 to +296
parser.add_argument('--dst-base-branch', default=None, required=False,
help='Base branch in the framework repo to make changes from')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't actually need to be a branch name, it can be any revision. In practice I'd want whatever is the current head of the framework submodule, and I think that should be the default, rather than main.

In fact main is likely not to work in practice, because it'll typically be some old reference made when the working directory was originally cloned I delete the default branch name in my local clones to avoid keeping that old reference around, so when I didn't pass --dst-branch-name, I got the error

fatal: 'main' matched multiple (4) remote tracking branches

self.run_git(['checkout', self._framework_base_branch_name])

# Create a new branch with checkout -b
self.run_git(['checkout', '-b', self._framework_branch_name])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add checks for common errors before starting to do anything. At least check that the branches don't already exist, which is likely to be the case the second time the same person uses the script.

# Delete all files except the ones we are moving
for f in repo_files:
if os.path.normpath(f) not in files_to_retain:
self.run_git(['rm', f])
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER: As far as I understand it (more info in Slack thread), this will in particular remove framework. So from that point, when checking out the original head again, the framework submodule is no longer initialized. Then later the framework directory itself disappears, which I don't understand.

I don't fully understand what's going on, but as far as I can tell, you can't have the destination be the framework directory of the source, it has to be a separate worktree. This is a very unintuitive limitation that needs to be at least documented.

@gilles-peskine-arm
Copy link
Contributor

I'm marking this as priority-high because we need this to move files to the framework with the history preserved. Arguably this should be in the framework repository as an essential maintainer tool (even if it'll only be essential for a few months).

self.run_git(['mv', f, self._file_map[f]])

# Commit the result
commit_message = "Move some files to framework repository"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like the ability to specify a commit message. I can do it afterwards by rebasing, and that's not really a problem in the source branch, but in the destination branch, the commit message is behind a merge and git is reluctant to edit history beyond a merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if I want to edit the commit message, I need to redo the conflict resolution which is implemented here as resolve_deletion_conflicts. That's not easy!

@gilles-peskine-arm
Copy link
Contributor

I've now used the script successfully. As far as I can't tell it just can't work if the destination is the framework subdirectory of an mbedtls worktree. If the source is an mbedtls worktree, the script can work, but the source worktree is disrupted because the framework submodule gets deinitialized, and the resulting history may be wrong because any file that is neither tracked nor ignored gets added.

So this script should come at least with a warning, if not enforcement, that both the source and destination should be independent worktrees, freshly checked out, that are not used for doing anything else. Preferably, the script should create these worktrees itself (as mbedtls-rewrite-branch-style does, we can reuse some of its code).

@mpg
Copy link
Contributor

mpg commented Oct 10, 2024

I've not yet used this script myself, but looking at its result there seems to be extra commits that I wouldn't expect to be present in the history, see Mbed-TLS/mbedtls-framework#55 (comment)

@mpg
Copy link
Contributor

mpg commented Oct 10, 2024

I'm marking this as priority-high because we need this to move files to the framework with the history preserved. Arguably this should be in the framework repository as an essential maintainer tool (even if it'll only be essential for a few months).

Agreed. I've created Mbed-TLS/mbedtls-framework#57 and added it to the first framework epic. (I don't care much about where the script lives, as long as it meets quality requirements.)

@gilles-peskine-arm
Copy link
Contributor

I've not yet used this script myself, but looking at its result there seems to be extra commits that I wouldn't expect to be present in the history

We're preserving the commit history, not just the patch history. So each import branch must have the full history of the project up to the last commit that modifies one of the imported files.

The first time we did this in a pull request, GitHub showed basically all the mbedtls history as part of the pull request, because these were all new commits in the main branch of the framework. In each subsequent pull request that imports files into the framework, GitHub just shows recent commits that aren't yet in one of the imports.

There are tools that can rewrite a git branch to only retain commits that modify specific files, and construct a commit history with the same messages and the same patches for this subset of files. But you lose commit IDs (obviously), renamed files, content moved between files, etc. And these tools are harder to use. So I think we've made the right choice.

@mpg
Copy link
Contributor

mpg commented Oct 10, 2024

Ah OK, I had misunderstood what the script does.

There are tools that can rewrite a git branch to only retain commits that modify specific files, and construct a commit history with the same messages and the same patches for this subset of files. But you lose commit IDs (obviously), renamed files, content moved between files, etc. And these tools are harder to use. So I think we've made the right choice.

Yeah, that's what I thought the script was doing, and indeed now that you mention it that wouldn't be the right thing to do. Thanks for clearing that up.

This allows the user to write a message that they prefer. While we're at
it, we can supply a helpful template so that it's a bit clearer what
each commit message is for.

Signed-off-by: David Horstmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-work priority-high size-m Estimated task size: medium (~1w)
Projects
Development

Successfully merging this pull request may close these issues.

3 participants