{"id":48,"date":"2020-09-27T10:00:49","date_gmt":"2020-09-27T00:00:49","guid":{"rendered":"http:\/\/localhost:8000\/?p=48"},"modified":"2020-09-27T10:00:49","modified_gmt":"2020-09-27T00:00:49","slug":"java8-path-filevisit-intro","status":"publish","type":"post","link":"http:\/\/www.cheerfulprogramming.com\/?p=48","title":{"rendered":"Java8 Path Filevisit Intro"},"content":{"rendered":"<h2>Goal of this Exercise<\/h2>\n<p>In this article we will look at Java\\&#8217;s<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/SimpleFileVisitor.html\">SimpleFileVisitor<\/a><br \/>\nas a tool to solve this problem: how to consolidate multiple backups of<br \/>\ndata, photos, and videos, from multiple devices such as phones and<br \/>\nlaptops. Suppose you (or your not-so-tech-savvy friends) use multiple<br \/>\ndevices and copy files between them, and edit different versions of<br \/>\nthose files but don\\&#8217;t maintain a single \\&quot;source of truth\\&quot;. How do you<br \/>\nget a set of unique files without manually checking each one? You could<br \/>\nwrite a shell script but for this example, we will work in Java, and<br \/>\nwrite a small program that achieves this and shows off SimpleFileVisitor<br \/>\nand a little of two excellent Java libraries: Google<br \/>\n<a href=\"https:\/\/github.com\/google\/guava\">Guava<\/a>, and<br \/>\n<a href=\"https:\/\/picocli.info\">picocli<\/a> (ie \\&quot;pico CLI\\&quot;).<\/p>\n<p>We will use Guava\\&#8217;s file hashing method to calculate a hash of every<br \/>\nfile the program encounters as it traverses the file system, and use<br \/>\nthat as an approximation for uniqueness, assuming that any two files<br \/>\nwith the same hash are identical files. We will use picocli to make an<br \/>\nelegant command-line front-end for our code. The program will read in<br \/>\nfrom the command line a list of directories to traverse, and it will<br \/>\ntraverse those directories in the supplied order. It will copy only the<br \/>\nfirst file for each hash code calculated, ignoring subsequent files with<br \/>\nthe same hash code, to the user-supplied output directory (make sure<br \/>\nthis directory is empty or non-existent before executing the program).<br \/>\nWhen the program is finished you will be able to call it like this (in a<br \/>\nBash shell):<\/p>\n<pre><code class=\"\" data-line=\"\">$ java -jar unique-merge-0.0.1-SNAPSHOT.jar \\\n&gt; -o output_dir \\\n&gt; first_src_dir \\\n&gt; second_src_dir \\\n&gt; third_src_dir<\/code><\/pre>\n<p>and the resulting output_dir could look like this:<\/p>\n<pre><code class=\"\" data-line=\"\">output_dir\n|-first_dir\n|-second_dir\n|-third_dir<\/code><\/pre>\n<h2>Dependencies<\/h2>\n<p>If you use a build tool such as Maven or Gradle you can easily download<br \/>\nthe dependencies from MVN Repository:<\/p>\n<ul>\n<li>Guava: <a href=\"https:\/\/mvnrepository.com\/artifact\/com.google.guava\/guava\">https:\/\/mvnrepository.com\/artifact\/com.google.guava\/guava<\/a><\/li>\n<li>picocli: <a href=\"https:\/\/mvnrepository.com\/artifact\/info.picocli\/picocli\">https:\/\/mvnrepository.com\/artifact\/info.picocli\/picocli<\/a><br \/>\nFor this example I used Gradle 6.6.1, Guava 29, and picocli 4.5.1, so my<br \/>\nGradle dependencies section looks as follows:<\/li>\n<\/ul>\n<pre><code class=\"\" data-line=\"\">dependencies {\n    compile &#039;com.google.guava:guava:29.0-jre&#039;\n    compile &#039;info.picocli:picocli:4.5.1&#039;\n}<\/code><\/pre>\n<h2>Walking the File System with SimpleFileVisitor<\/h2>\n<p>Java\\&#8217;s new file I\/O package,<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/package-frame.html\">java.nio.file<\/a>,<br \/>\nprovides new classes and interfaces for dealing with files and<br \/>\ndirectories, including<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/Paths.html\">Paths<\/a>,<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/Path.html\">Path<\/a>,<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/Files.html\">Files<\/a><br \/>\n(but not<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/io\/File.html\">File<\/a> &#8211;<br \/>\nthat belongs to the old I\/O package), and<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/SimpleFileVisitor.html\">SimpleFileVisitor<\/a>.<br \/>\nThe SimpleFileVisitor provides a way to recurse through a directory and<br \/>\nits subdirectories without needing to implement a recursive method<br \/>\nyourself. In our example, we will walk through each directory that the<br \/>\ncaller supplies to our program, and use Guava to calculate the hash of<br \/>\neach file under that directory, including sub-directories. To use the<br \/>\nSimpleFileVisitor, extend it, and override those methods which represent<br \/>\nthe events you wish to intercept:<\/p>\n<pre><code class=\"\" data-line=\"\">public class MySimpleFileVisitor extends SimpleFileVisitor&lt;Path&gt; {\n    @Override\n    public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException { ... }\n\n    @Override\n    public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException { ... }\n\n    @Override\n    public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException { ... }\n\n    @Override\n    public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException { ... }\n}<\/code><\/pre>\n<p>Note that the first two methods above have as parameters a Path and a<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/attribute\/BasicFileAttributes.html\">BasicFileAttributes<\/a>,<br \/>\nwhile the last two accept a Path and an IOException, in case there was<br \/>\nan error while traversing the file system tree that you want to handle<br \/>\nbefore continuing. BasicFileAttributes are useful for learning about the<br \/>\nfile\\&#8217;s owner and permissions, but a proper treatment of that class is<br \/>\noutside of the scope of this article. Each of the methods above returns<br \/>\nan instance of the<br \/>\n<a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/nio\/file\/FileVisitResult.html\">FileVisitResult<\/a><br \/>\nenum, which determine what happens next during file system traversal.<br \/>\nYour options are:<\/p>\n<table>\n<thead>\n<tr>\n<th>Enum<\/th>\n<th>Action<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code class=\"\" data-line=\"\">FileVisitResult.CONTINUE<\/code><\/td>\n<td>keep traversing the file system.<\/td>\n<\/tr>\n<tr>\n<td><code class=\"\" data-line=\"\">FileVisitResult.SKIP_SIBLINGS<\/code><\/td>\n<td>don\\&#8217;t examine any other files and directories within the parent of the current path object.<\/td>\n<\/tr>\n<tr>\n<td><code class=\"\" data-line=\"\">FileVisitResult.SKIP_SUBTREE<\/code><\/td>\n<td>don\\&#8217;t examine any other files or directories under the current Path object.<\/td>\n<\/tr>\n<tr>\n<td><code class=\"\" data-line=\"\">FileVisitResult.TERMINATE<\/code><\/td>\n<td>stop traversing the file system.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So with these things in mind, here is an example of a SimpleFileVisitor<br \/>\nthat uses Guava to calculate the hash code of every file it traverses,<br \/>\nand store the hash and the file\\&#8217;s path in a map. If the hash code<br \/>\nalready exists, then the file is assumed to be a duplicate, and the<br \/>\nfile\\&#8217;s path is appended to the list of paths for that hash code,<br \/>\nalthough at present we will not use it. Later, as an exercise, you might<br \/>\nwant to make the program print out a list of duplicate files to the<br \/>\nconsole. Using the hash code as an approximation for uniqueness is valid<br \/>\nfor most personal use cases, whether it is suitable for industrial scale<br \/>\ndata mining I am not qualified to comment.<\/p>\n<pre><code class=\"\" data-line=\"\">package com.cheerfulprogramming.edwhiting.uniquemerge;\n\nimport com.google.common.hash.HashCode;\nimport com.google.common.hash.Hashing;\n\nimport java.io.IOException;\nimport java.nio.file.FileVisitResult;\nimport java.nio.file.Path;\nimport java.nio.file.SimpleFileVisitor;\nimport java.nio.file.attribute.BasicFileAttributes;\nimport java.util.Deque;\nimport java.util.LinkedList;\nimport java.util.Map;\nimport java.util.Optional;\n\npublic class FileHasherVisitor extends SimpleFileVisitor&lt;Path&gt; {\n    private Map&lt;HashCode, Deque&lt;Path&gt;&gt; fileHashes;\n\n    public FileHasherVisitor(Map&lt;HashCode, Deque&lt;Path&gt;&gt; fileHashes) {\n        this.fileHashes = fileHashes;\n    }\n\n    @Override\n    public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {\n        HashCode hash = com.google.common.io.Files.asByteSource(file.toFile()).hash(Hashing.sha512());\n        Deque&lt;Path&gt; fileHash = Optional.ofNullable(fileHashes.get(hash)).orElseGet(() -&gt; new LinkedList&lt;&gt;());\n        fileHash.addLast(file);\n        fileHashes.put(hash, fileHash);\n        return FileVisitResult.CONTINUE;\n    }\n}<\/code><\/pre>\n<p>By itself, this class does nothing. We need to instantiate it and supply<br \/>\nit to a method that can traverse the file system. Fortunately, Java<br \/>\nprovides such a method, Files.walkFileTree(Path, SimpleFileVisitor), as<br \/>\nshown here:<\/p>\n<pre><code class=\"\" data-line=\"\">final Map&lt;HashCode, Deque&lt;Path&gt;&gt; fileHashes = new HashMap&lt;&gt;();\nfinal FileHasherVisitor visitor = new FileHasherVisitor(fileHashes);\nfinal Path path = Paths.get(&quot;path\/to\/your\/directory&quot;);\ntry {\n    Files.walkFileTree(path, visitor);\n} catch (IOException e) {\n    e.printStackTrace();\n}<\/code><\/pre>\n<h2>Resolving Paths<\/h2>\n<p>Here is a more complete example, which wraps up the FileHasherVisitor<br \/>\ninvocation in a private method called walkFileTrees, resolves relative<br \/>\nfile system paths into absolute ones, copies only the first file for<br \/>\neach hash code from its source to the output directory, and exposes the<br \/>\npublic method mergePaths, whose parameters resemble the command line<br \/>\ninterface we are planning to build.<\/p>\n<pre><code class=\"\" data-line=\"\">package com.cheerfulprogramming.edwhiting.uniquemerge;\n\nimport com.google.common.hash.HashCode;\n\nimport java.io.IOException;\nimport java.nio.file.Files;\nimport java.nio.file.Path;\nimport java.nio.file.Paths;\nimport java.nio.file.StandardCopyOption;\nimport java.util.*;\nimport java.util.stream.Collectors;\n\npublic class UniqueMerger {\n\n    public void mergePaths(Path outputDir, Path... srcDirs) {\n        \/* Get the directory from where the user called the program.*\/\n        final Path currentDir = Paths.get(System.getProperty(&quot;user.dir&quot;));\n        \/* \n          * Resolve the output directory and source directories to absolute \n          * file system paths.\n          *\/\n        final Path absOutputDir = currentDir.resolve(outputDir);\n        final List&lt;Path&gt; absSrcDrs = Arrays.stream(srcDirs)\n                .map(srcDir -&gt; currentDir.resolve(srcDir))\n                .collect(Collectors.toList());\n        mergeAbsPaths(absOutputDir, absSrcDrs);\n    }\n\n    private void mergeAbsPaths(final Path absOutputDir, final List&lt;Path&gt; absSrcDrs) {\n        \/* Calculate hash codes for all files under the supplied source directories. *\/\n        final Map&lt;HashCode, Deque&lt;Path&gt;&gt; fileHashes = this.walkFileTrees(absSrcDrs);\n        \/* \n          * For each hash code, copy only the first file that has it, \n          * ignoring the duplicates.\n          *\/\n        fileHashes.values().stream().forEach(fileHash -&gt; {\n            final Path src = fileHash.peekFirst();\n            absSrcDrs.stream()\n                    .filter(path -&gt; src.startsWith(path))\n                    .forEach(path -&gt; {\n                        final Path dest = absOutputDir.resolve(path.getParent().relativize(src));\n                        try {\n                            Files.createDirectories(dest.getParent());\n                            Files.copy(src, dest, StandardCopyOption.COPY_ATTRIBUTES);\n                        } catch (IOException e) {\n                            e.printStackTrace();\n                        }\n                    });\n        });\n    }\n\n    private Map&lt;HashCode, Deque&lt;Path&gt;&gt; walkFileTrees(List&lt;Path&gt; absSrcDrs) {\n        final Map&lt;HashCode, Deque&lt;Path&gt;&gt; fileHashes = new HashMap&lt;&gt;();\n        final FileHasherVisitor visitor = new FileHasherVisitor(fileHashes);\n        \/* \n          * Look at all the files under each of the supplied\n          * source directories, and calculate their hash codes.\n          *\/\n        absSrcDrs.stream().forEach(path -&gt; {\n            try {\n                Files.walkFileTree(path, visitor);\n            } catch (IOException e) {\n                \/* Don&#039;t stop for any bad files! *\/\n                e.printStackTrace();\n            }\n        });\n        return fileHashes;\n    }\n}<\/code><\/pre>\n<p>The example above could be bundled up into a JAR with its dependencies<br \/>\nand made into a library, although it might not be very useful in that<br \/>\nform. In the next section, we will use picocli to add a command-line<br \/>\ninterface to our program so that we can call it from a shell.<\/p>\n<h2>Make it Executable: Add a Command-Line Interface<\/h2>\n<p>Picocli is an elegant and well-documented little library that uses Java<br \/>\nannotations to make writing standard-compliant command line Java<br \/>\nprograms easy.<\/p>\n<p>There are many examples on the picocli website, so we\\&#8217;ll only give a<br \/>\nbrief introduction here. To start, create a class that will hold your<br \/>\nmain method, and make it implement Runnable. Annotate it with picocli\\&#8217;s<br \/>\n\\@CommandLine.Command. Add non-static properties to your class to hold<br \/>\nthe values of the arguments to the application. In this example, we need<br \/>\na Path[] array to hold the names of the source directories, and a Path<br \/>\nobject to hold the output directory. Picocli takes care of instantiation<br \/>\nfor us, provided we annotate these class members correctly. See the full<br \/>\nexample below, followed with some remarks:<\/p>\n<pre><code class=\"\" data-line=\"\">package com.cheerfulprogramming.edwhiting.uniquemerge;\n\nimport picocli.CommandLine;\n\nimport java.nio.file.Path;\n\n@CommandLine.Command(\n        name = &quot;unique-merge&quot;,\n        mixinStandardHelpOptions = true,\n        version = &quot;Unique Merge v1.0&quot;,\n        description = &quot;Recursively copies supplied directories into [outputDir] &quot; +\n                &quot;directory so that only unique files get transferred, and identical duplicates &quot; +\n                &quot;do not get transferred, even if they have different names.&quot;\n)\npublic class UniqueMergeApplication implements Runnable {\n\n    @CommandLine.Parameters(\n            index = &quot;0..*&quot;,\n            arity = &quot;1..*&quot;,\n            description = &quot;Directories from which to copy files recursively.&quot;)\n    private Path[] directories;\n\n    @CommandLine.Option(\n            names = { &quot;-o&quot;, &quot;--output-dir&quot;},\n            required = true,\n            description = &quot;Output directory where copied directories will be placed, with only unique files.&quot;)\n    private Path outputDir;\n\n    public static void main(String[] args) {\n        final int exitCode = new CommandLine(new UniqueMergeApplication()).execute(args);\n        System.exit(exitCode);\n    }\n\n    @Override\n    public void run() {\n        UniqueMerger u = new UniqueMerger();\n        u.mergePaths(this.outputDir, this.directories);\n    }\n}<\/code><\/pre>\n<p>The <code class=\"\" data-line=\"\">@CommandLine.Parameters<\/code> picocli annotation accepts many arguments<br \/>\nbut I have only listed in the example the ones needed for this<br \/>\napplication to work. The index argument tells picocli what position it<br \/>\nshould expect the relevant command line arguments to be. First position<br \/>\nis zero, and the argument also accepts a range, eg <em>m..n<\/em> or <em>m..*<\/em> for<br \/>\nany position after <em>m<\/em>. The arity argument tells picocl how many values<br \/>\nto expect in the argument, which allows the user to pass in a list up to<br \/>\na size specified by you, or of an unlimited size if you specify an open<br \/>\nrange such as 1..*. In our example, the source directories are an array<br \/>\nargument of unlimited size, starting at the 0^th^ position.<\/p>\n<p>The <code class=\"\" data-line=\"\">@CommandLine.Option<\/code> picocli annotation allows you to specify<br \/>\nswitches that modify the behaviour of your program, or let the user pass<br \/>\nin named arguments. In our example, the output directory is given as a<br \/>\nmandatory switch, since the source directories are an unlimited array.<br \/>\nAlthough we could have designed our application to treat the last<br \/>\ndirectory in the array of directories as the output directory, much like<br \/>\nthe way the Unix cp program works, this way makes handling the arguments<br \/>\nsimpler, and makes it harder for your users to do something regrettable<br \/>\naccidentally.<\/p>\n<h2>Building a JAR<\/h2>\n<p>There we have it, a working application! Now to build it. Below is the<br \/>\nGradle build file. Make sure you have OpenJDK11 or higher installed, and<br \/>\nGradle 6 with the Gradle wrapper, gradlew.<\/p>\n<pre><code class=\"\" data-line=\"\">plugins {\n    id &#039;java&#039;\n}\n\ngroup = &#039;com.cheerfulprogramming.edwhiting&#039;\nversion = &#039;1.0.0-SNAPSHOT&#039;\nsourceCompatibility = &#039;11&#039;\n\nrepositories {\n    mavenCentral()\n}\n\ndependencies {\n    \/\/ https:\/\/mvnrepository.com\/artifact\/com.google.guava\/guava\n    compile &#039;com.google.guava:guava:29.0-jre&#039;\n    \/\/ https:\/\/mvnrepository.com\/artifact\/info.picocli\/picocli\n    compile &#039;info.picocli:picocli:4.5.1&#039;\n    testImplementation(platform(&#039;org.junit:junit-bom:5.7.0&#039;))\n    testImplementation(&#039;org.junit.jupiter:junit-jupiter&#039;)\n}\n\ncompileJava {\n    options.compilerArgs &lt;&lt; &quot;-Xlint:unchecked&quot;\n}\n\njar {\n    manifest {\n        attributes &quot;Main-class&quot;: &quot;com.cheerfulprogramming.edwhiting.uniquemerge.UniqueMergeApplication&quot;\n    }\n    from {\n        configurations.compile.collect { it.isDirectory() ? it : zipTree(it) }\n    }\n}\n\ntest {\n    useJUnitPlatform()\n}<\/code><\/pre>\n<p>Build it with Gradle by running:<\/p>\n<pre><code class=\"\" data-line=\"\">$ .\/gradlew clean jar<\/code><\/pre>\n<p>Note that in your project directory, Gradle will put the JAR in<br \/>\nbuild\/libs. Execute your application like this:<\/p>\n<pre><code class=\"\" data-line=\"\">$ java -jar build\/libs\/unique-merge-1.0.0-SNAPSHOT.jar<\/code><\/pre>\n<p>The program should instruct you on how to supply the correct options and<br \/>\narguments, so you should then run it on the folders that you want to<br \/>\nmerge, like this:<\/p>\n<pre><code class=\"\" data-line=\"\">$ java -jar build\/libs\/unique-merge-1.0.0-SNAPSHOT.jar -o output_dir first_src_dir second_src_dir third_src_dir<\/code><\/pre>\n<h2>Further Reading<\/h2>\n<p>Sierra, Bates, and Robson, <em>OCP Java SE 8 Programmer II Exam Guide<\/em>,<br \/>\nOracle Press, 2018, pp.268, 271, 277-279, 289-292<\/p>\n<p>Oracle Java API: <a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/\">https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/<\/a><\/p>\n<p>Google Guava: <a href=\"https:\/\/github.com\/google\/guava\">https:\/\/github.com\/google\/guava<\/a><\/p>\n<p>picocli: <a href=\"https:\/\/picocli.info\/\">https:\/\/picocli.info\/<\/a><\/p>\n<h2>Acknowledgements<\/h2>\n<p>The author acknowledges the traditional custodians of the Daruk and the<br \/>\nEora People and pays respect to the Elders past and present.<\/p>\n<p>Oracle\u00ae and Java are registered trademarks of Oracle and\/or its<br \/>\naffiliates.<\/p>\n<p>Google and Guava are registered trademarks of Alphabet and\/or its<br \/>\naffiliates.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Goal of this Exercise In this article we will look at Java\\&#8217;s SimpleFileVisitor as a tool to solve this problem: how to consolidate multiple backups of data, photos, and videos, from multiple devices such as phones and laptops. Suppose you (or your not-so-tech-savvy friends) use multiple devices and copy files between them, and edit different [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":49,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[11],"class_list":["post-48","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-java","tag-java"],"_links":{"self":[{"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=\/wp\/v2\/posts\/48","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=48"}],"version-history":[{"count":0,"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=\/wp\/v2\/posts\/48\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=\/wp\/v2\/media\/49"}],"wp:attachment":[{"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=48"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=48"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.cheerfulprogramming.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=48"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}