CVS2SVN

Intro

Here I'll describe how we transformed our ScummVM CVS to SVN. It may be helpful for other big projects facing the same challenge.

Of course, SF.net has this nice "import CVS repository", but that one runs an automatic cvs2svn script and we wanted more:

Restore connections between moved files
That includes the merge of scummvm-old and scummvm modules. scummvm module was created when we performed a major project directory tree restructuring
Keep subtree with tags and branches for each of our subprojects

Overall idea

Generally, the connection between moved files in SVN could be restored as shown in the following chunk of a real ScummVM repository dump:

 Revision-number: 89
 ...
 Node-path: trunk/scummvm-old/wince/missing/dirent.h
 Node-kind: file
 Node-action: add
 +Node-copyfrom-rev: 53
 +Node-copyfrom-path: trunk/scummvm-old/PocketSCUMM/missing/dirent.h
 ...
 Revision-number: 93
 Node-path: trunk/scummvm-old/PocketSCUMM/PocketSCUMM.vco
 Node-action: delete
 Node-path: trunk/scummvm-old/PocketSCUMM/missing/dirent.h
 Node-action: delete

Note lines marked by '+'. Those were added. This successfully restores the SVN history link. 53 is the last revision number where dirent.h file was altered.

The process

To successfully fullfill the task I wrote several simple scripts. I had to do it in scripts, not manually, because restoration of connections is pretty long process, and I wanted to minimize the repository freeze time. So I first modified the dump file locally, and later reapplied my changes to the fresh dump. Those scripts are intentionally kept simple and no error checking is performed.

The whole dump takes over 1.4GB, so there is no way to edit the file directly. Hence I came up with the simple idea of extracting just those Revision numbers and Node-paths, so they could be later replaced in the original dump.

automation-pass1.pl

 wget http://.../scummvm-cvsroot.tar.bz2
 rm -rf scummvm scummvm.cvs
 tar xjf scummvm-cvsroot.tar.bz2
 mv scummvm scummvm.cvs
 mkdir scummvm
 
 # We need to combine scummvm-old and scummvm modules
 mv scummvm.cvs/scummvm scummvm.cvs/scummvm-old scummvm
 
 # Due to bug in our CVS repository branch-0-5-0 is both tag and branch, so
 # we have to force it here
 cvs2svn --dump-only --force-branch=branch-0-5-0 --dumpfile scummvm.dump scummvm
 grep -E "^Node-|^Revision-number" scummvm.dump > scummvm.dump.nodes.in
 
 cvs2svn --dump-only --dumpfile tools.dump scummvm.cvs/tools
 grep -E "^Node-|^Revision-number" tools.dump > tools.dump.nodes.in
 
 # No efforts were performed with restoring links in following modules
 cvs2svn --dump-only --dumpfile web.dump scummvm.cvs/web
 cvs2svn --dump-only --dumpfile scummex.dump scummvm.cvs/scummex
 cvs2svn --dump-only --dumpfile residual.dump scummvm.cvs/residual
 cvs2svn --dump-only --dumpfile docs.dump scummvm.cvs/docs
 cvs2svn --dump-only --dumpfile engine-data.dump scummvm.cvs/engine-data
 
 # I performed all work offsite, so I had to transfer these dumps over slow dial-up line
 bzip2 -9 scummvm.dump.nodes.in tools.dump.nodes.in

This stage took about 1.5 hours. Bottleneck is the disk subsystem. The overall size of produced data is over 1.4GB.

Manual editing

Now I opened the dump.nodes.in files in XEmacs and started to add those links. First, I searched it for the word 'delete' and studied each case. I had to consult files layout and CVS log messages to see either those files were simply killed or really moved or renamed. Due to the fact that some files were really renamed, and there are name clashes between files in directories, it was not possible to fully automate the task, however a big chunk of it could be scripted.

So what I did is to specify that Node-copyfrom-path: manually and left Node-copyfrom-rev blank. Then I recorded a simple macro in XEmacs, as that was quicker to do than writing yet another script. The macro was something like this:

It starts on Node-copyfrom-path line.
Put path in yank buffer
Kill other windows
Split window (thus we have 2 views of buffer at the same place)
Switch to another window
Search backwards contents of yank buffer

With this I saw the revision number in another window. I doublechecked that this is the correct place and put that revision number to Node-copyfrom-rev field. However I didn't see any inconsistencies here, so I guess it could insert those numbers fully automatic.

Resulting .nodes file has inserted lines marked with leading + like on example at the top of this page.

automation-pass2.pl

This one is simple. What it does is merging back those inserted lines and modifying all internal paths, so it will put all modules into separate directories on the SVN repository:

 perl merge-dump.pl scummvm.dump.nodes scummvm <scummvm.dump >scummvm.dump.new
 perl merge-dump.pl tools.dump.nodes tools <tools.dump >tools.dump.new
 
 for i in web scummex residual docs engine-data
 do
   perl prepare-dump.pl $i <$i.dump >$i.dump.new
 done

merge-dump.pl

 $logfile = shift;
 $module = shift;
 
 open(LOG, $logfile) or die "Can't open file $logfile";
 
 $logline = <LOG>;
 
 while(<>) {
   $line = $lineorig = $_;
 
   $line =~ s/^Node-path: /Node-path: $module\//;
   $line =~ s#^Node-copyfrom-path: (/?)#Node-copyfrom-path: $1$module/#;
 
   print $line;
 
   if ($lineorig eq $logline) {
     $logline = <LOG>;
     while ($logline =~ /^\+(.*)/) {
       print "$1\n";
       $logline = <LOG>;
     }
   }
 }
 
 close LOG;

prepare-dump.pl

 $module = shift;
 
 while(<>) {
   $line = $_;
 
   $line =~ s/^Node-path: /Node-path: $module\//;
   $line =~ s#^Node-copyfrom-path: (/?)#Node-copyfrom-path: $1$module/#;
 
   print $line;
 }

The second regexp here is tricky, since Node-copyfrom-path could contain either /trunk/scummvm/blah or trunk/scummvm/blah and we have to keep that leading slash of it present.

So after this stage the amount of data on disk doubles since we have both merged and non-merged dumps. I kept non-merged dumps, so pass2 could be redone without performing lengthy pass1 over again.

automation-pass3.sh

At this straightforward stage I create local svn repository

 rm -rf svn
 svnadmin create svn
 svnadmin load svn < init.dump
 
 for i in scummvm tools web scummex residual docs engine-data
 do
   svnadmin load svn < $i.dump.new
 done

It takes another hour, and then I can dump resulting repository with

 svnadmin dump svn >scummvmrepo.dump

init.dump files contains skeleton of our new repository layout

init.dump

 SVN-fs-dump-format-version: 2
 
 Revision-number: 1
 Prop-content-length: 112
 Content-length: 112
 
 K 8
 svn:date
 V 27
 2001-10-09T14:30:12.000000Z
 K 7
 svn:log
 V 38
 New repository initialized by cvs2svn.
 PROPS-END
 
 Node-path: scummvm
 Node-kind: dir
 Node-action: add
 
 
 Node-path: tools
 Node-kind: dir
 Node-action: add
 
 
 Node-path: web
 Node-kind: dir
 Node-action: add
 
 
 Node-path: docs
 Node-kind: dir
 Node-action: add
 
 
 Node-path: scummex
 Node-kind: dir
 Node-action: add
 
 
 Node-path: engine-data
 Node-kind: dir
 Node-action: add
 
 
 Node-path: residual
 Node-kind: dir
 Node-action: add

Final step

Then the repository was dumped, bzipped and uploaded to sf.net. Now it is possible to import the existing dump, but at the time we did it, we had to submit a PR.