Git Merge Duplication after Ineffective BFG Use

2019-02-05 03:45发布

问题:

I have somehow deeply borked by entire repository (used only by me) and could use some assistance in sorting it out.

Here is what I did. I realized that in my commit history, there were some files containing credentials that I did not want just laying around. So, I decided to be legit and try to use the BFG Repo-Cleaner to fix these issues. I threw all the credentials in .gitignores, and moved on to trying to scrub them out of the history. As per the documentation instructions, I executed these commands:

git clone --mirror myrepo.git
java -jar bfg.jar --delete-files stuffthatshouldbedeleted.txt  myrepo.git

At this point, BFG told me that x number of files had been found and removed. Sweet.

cd myrepo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

According to the terminal logs, it updated the repo. So far so good, right? I pop into my github account, and after a few clicks, find the credentials still there, file and all, in my history. I go back and try the same set of commands, but using this line instead of the file remover:

java -jar bfg.jar --replace-text passwords.txt  myrepo.git

where passwords.txt is a file containing string instances of all the credentials I would like gone. Again, BFG logs indicate that there are several instances that it has fixed. I push up, check, and the credentials are still there, sitting in Github. I notice that the SHA-1 keys for all of my commits have been altered, so presumably BFG did something, just not the thing I want it to do.

At this point, I give up and try to get back to work, figure I'll sort it out later. I do some work, try to push up, get a weird merge conflict (you are 50 ahead and 50 behind on commits). What? I try to pull and merge, and suddenly, every single commit in my git history is duplicated in name, and some of them are just blank. I check my Github network graph, and it looks like there is a second branch starting from my initial commit that exactly mirrors all of my commits that has been zippered in with my last commit (I have never branched, just been linearly chugging along).

I can't revert to a previous commit, because they are all chronologically duplicated. My credentials are still in there, with twice as many instances now, and my history is doubled and very confusing to try to understand. When I try to run BFG from the beginning now, cloning and mirroring the repo anew, it tells me that there are no credentials in it, despite the fact that I can see them in Github. I could really use some help in understanding what happened, and how, if at all, I can get back to a state of things again.

I am considering just deleting the entire repo and starting anew. I really don't want to do that.

tldr; Tried using BFG, somehow duplicated half-baked versions of all commits in my repo, can't untangle, and to add insult to injury, BFG did nothing and claims it's done its job.

回答1:

I'm the author of the BFG, I'll try to describe what I think happened step-by-step based on your account:

The pre-BFG manual cleaning...

First you:

threw all the credentials in .gitignores, and moved on to trying to scrub them out of the history.

This description of your actions omits two essential steps:

  1. Manually deleting the credentials from your current file-tree, and committing that change to your repo. If you didn't do this, The BFG would have eradicated the content from your old commits, but protected the dirt in your current commits. This behaviour is covered in the BFG documentation under the section titled 'Your current files are sacred...', and if you forget to do it, the BFG prints a warning message when you run it ("WARNING: The dirty content above may be removed from other commits, but as the protected commits still use it, it will STILL exist in your repository..." etc, etc). Did you see that message when you ran the BFG?

  2. That commit needs to be pushed up to your GitHub repository before you clone the full mirror of your repository. Did you forget that step?

If you didn't do those things, that would account for your credentials not being fully scrubbed from your repository.

Running BFG for the first time...

Moving on, then you:

  • made a fresh mirror clone of your repo from GitHub
  • ran the BFG, filtering using the --delete-files option (did you see a protected-content warning?)
  • pushed the updated repository to GitHub

...at which point :

According to the terminal logs, it updated the repo. So far so good, right? I pop into my github account, and after a few clicks, find the credentials still there, file and all, in my history

So, assuming you did correctly manually remove your bad content from your latest commits before running the BFG, what you saw is fairly weird. Some possible causes:

a) The repository wasn't cloned with the --mirror flag, so not all branches on GitHub were overwritten, leaving dirty history around in non-master branches. However, you've explicitly stated that you used the --mirror flag.

b) Even with a mirror push to GitHub, old commits are still available there when referenced by explicit commit-id (ie a GitHub url that has the commit-id in it), up until the point GitHub runs it's automatic garbage-collection on your repository. Pull-requests and forks can also preserve commits from the old history. That would be another possible explanation for the dirty commits you saw.

Running BFG for the second time...

In any case, at that point you were concerned, and:

  • ran the BFG again, this time with --replace-text passwords.txt, which updates file contents rather than deleting the entire file.

Again, BFG logs indicate that there are several instances that it has fixed. I push up, check, and the credentials are still there, sitting in Github.

It's a little curious that the BFG said that there was more content to clean away- possibly your credentials were in more places that you thought - but in any case, whatever the cause was for your seeing them still around after the first run, is the same reason you saw them around after the second run.

Going back to work

At this point, I give up and try to get back to work, figure I'll sort it out later.

So, at this point you've rewritten your Git repository history (twice!) and pushed it up to GitHub. But your account does not mention you deleting all your local old copies of the repo, as specified in the BFG instructions:

"At this point, you're ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data."

So, did you delete your old working copy of the Git repo on your work machine, and re-clone with the new Git repository history? The history in your old repo would have been different to the 'cleaned' history which would have been present in GitHub at that point (even if the 'cleaned' history was not as 'cleaned' as you would have liked it!).

I do some work, try to push up, get a weird merge conflict (you are 50 ahead and 50 behind on commits).

If you were doing the work in an old local copy of your Git repo (rather than a fresh re-clone from GitHub), then this is what you would see. You are essentially pushing up 50 commits of old, dirty history to GitHub, and to Git you seem blissfully unaware that there are 50 completely-different (to Git, which cares only about commit-ids here) commits on that branch already. Git thinks what you're doing is a bit weird ('50 ahead and 50 behind') and is trying to tell you that.

Making things worse...

What? I try to pull and merge, and suddenly, every single commit in my git history is duplicated in name, and some of them are just blank. I check my Github network graph, and it looks like there is a second branch starting from my initial commit that exactly mirrors all of my commits that has been zippered in with my last commit

So, by doing the pull and merge, you've joined together the cleaned history and the dirty history, unifying them with a merge commit. In terms of sorting your history out, this is a bad idea. A better idea would have been to rebase your new work on top of the cleaned history, push it, delete your old working repo, and do a fresh clone.

The aftermath

When I try to run BFG from the beginning now, cloning and mirroring the repo anew, it tells me that there are no credentials in it, despite the fact that I can see them in Github.

This is pretty weird, but I don't really have any explanation for it other than operator error, beyond the 'GitHub gc' explanation already given above. You can share the repository with me (if you like) so I can perform a more detailed inspection, or just send me a zipped copy of the '.bfg-report' directory so I can see what diagnostics the BFG captured on it's execution.

Recovery

I could really use some help in understanding what happened, and how, if at all, I can get back to a state of things again.

I hope I've managed to explain some of what's happened.

In terms of sorting out your history (ie getting rid of these two duplicate strands), you need to reset your Git history back to the (cleaned) point before you added in that merge commit. Look at the merge commit, and identify which parent history you prefer. What's the last commit (xxxx) in that history before you did the merge?

git reset --hard master xxxx

This may well lose the last bit of work you did on your old, dirty, history. Identify that commit (yyyy), and rebase it on top of your history, or just cherry-pick it:

git cherry-pick yyyy

Finally, push your recovered history up to GitHub with the 'force' flag:

git push origin master -f

...zip an archive of your old repo, and then delete all old local copies of your repo to prevent yourself further confusion. Do a fresh clone.