This is a bit tricky because multiple ways of doing it are documented. This is the way that eventually worked for me.

The top-level SConstruct is as normal for an out-of-source build, it reads

SConscript('src/SConscript', variant_dir='build')

You need a header so that your program can recognize the version number. In C++ this is as follows, in src/version.hh:

extern char const* const version_string;

You can define the version that you want to update in a file named version which is in the root of the repository. It should have no other content other than the version number, perhaps along with a newline.

0.0.1

Now the src/SConscript file should look like this:

env = Environment()

# The version file is located in the file called 'version' in the very root
# of the repository.
VERSION_FILE_PATH = '#version'

# Note: You absolutely need to have the #include below, or you're going to get
# an 'undefined reference' message due to the use of const.  (it's the second
# const in the type declaration that causes this.)
#
# Both the user of the version and this template itself need to include the
# extern declaration first.

def version_action(target, source, env):
    source_path = source[0].path
    target_path = target[0].path

    # read version from plaintext file
    with open(source_path, 'r') as f:
        version = f.read().rstrip()

    version_c_text = """
    #include "version.hh"

    const char* const version_string = "%s";
    """ % version

    with open(target_path, 'w') as f:
        f.write(version_c_text)

    return 0

env.Command(
    target='version.cc',
    source=VERSION_FILE_PATH,
    action=version_action
)

main_binary = env.Program(
    'main', source=['main.cc', 'version.cc']
)

The basic strategy here is to designate the version file as the source file for version.cc, but we just hardcode the template for the actual C++ definition inside the SConscript itself. Note that the include within the template is crucial, due to an 'aspect' of the C++ compilation process.

Posted 2018-06-08

This is really tricky. There are several hurdles you face.

First hurdle: importing the Leaflet CSS files from your node_modules folder and incorporating this into your Webpack build.

The canonical form for this is as follows:

@import url("~leaflet/dist/leaflet.css");

The tilde is a documented but obscure shortcut for a vendored module found under node_modules. There's no way to avoid hardcoding the path dist/leaflet.css.

Once you've done this, you'll have a non-broken map view, but you still won't be able to view marker images. You'll be seeing that the CSS attempts to load images but isn't able to load them. Then you'll try to apply file-loader, but due to a similar issue to one described on React, you'll note that file-loader or url-loader generate broken paths with strange hashing symbols in them.

Luckily, there's a fix for this! You'll notice this solution in the thread, from user PThomir:

import L from 'leaflet';

L.Icon.Default.imagePath = '.';
// OR
delete L.Icon.Default.prototype._getIconUrl;

L.Icon.Default.mergeOptions({
  iconRetinaUrl: require('leaflet/dist/images/marker-icon-2x.png'),
  iconUrl: require('leaflet/dist/images/marker-icon.png'),
  shadowUrl: require('leaflet/dist/images/marker-shadow.png'),
});

This is now getting very close. However, you'll try to adapt this, using import instead of require, because TypeScript doesn't know about require.

You'll get examples like this:

Cannot find module 'leaflet/dist/images/marker-icon-2x.png'

But you'll look for the file and it'll clearly be there. Puzzling. Until you realize you've missed a key point: Webpack's require and TypeScript's import are completely different animals. More specifically: Only Webpack's require knows about Webpack's loaders. So when you might try to import the PNG,

import iconRetinaUrl from 'leaflet/dist/images/marker-icon-2x.png';

This is actually intercepted by the TypeScript compiler and causes a compile error. We need to find some way to use Webpack's require from typescript. Luckily this isn't too difficult. You need to create a type signature for this call as such.

// This is required to use Webpack loaders, cf https://stackoverflow.com/a/36151803/257169

declare function require(string): any;

Put this somewhere in your search path for modules, as webpack-require.d.ts. Remember you don't explicitly import .d.ts file. So now just use require in your entry.ts file as before.

My eventual snippet looked as follows:

const leaflet = require('leaflet');

delete leaflet.Icon.Default.prototype._getIconUrl;

const iconRetinaUrl = require('leaflet/dist/images/marker-icon-2x.png');
const iconUrl = require('leaflet/dist/images/marker-icon.png');
const shadowUrl = require('leaflet/dist/images/marker-shadow.png');

leaflet.Icon.Default.mergeOptions({ iconRetinaUrl, iconUrl, shadowUrl })

But remember, none of this will work without that .d.ts file, otherwise tsc is just going to wonder what the hell you mean by require.

Posted 2018-05-22

The basic question is, how do we read an entire graph from a Neo4j store into a NetworkX graph? And another question is, how do we extract subgraphs from Cypher and recreate them in NetworkX, to potentially save memory?

Using a naive query to read all relationships

This is based on cypher-ipython module. This uses a simple query like the following to obtain all the data:

MATCH (n) OPTIONAL MATCH (n)-[r]->() RETURN n, r

This can be read into a graph using the following code. Note that the rows may duplicate both relationships and nodes, but this is taken care of by the use of neo4j IDs.

def rs2graph(rs):
    graph = networkx.MultiDiGraph()

    for record in rs:
        node = record['n']
        if node:
            print("adding node")
            nx_properties = {}
            nx_properties.update(node.properties)
            nx_properties['labels'] = node.labels
            graph.add_node(node.id, **nx_properties)

        relationship = record['r']
        if relationship is not None:   # essential because relationships use hash val
            print("adding edge")
            graph.add_edge(
                relationship.start, relationship.end, key=relationship.type,
                **relationship.properties
            )

    return graph

There's something about this query that is rather inelegant, that is that the result set is essentially 'denormalized'.

Using aggregation functions

Luckily there's another more SQL-ish way to do it, which is to COLLECT the relationships of each node into an array. This then returns lists which represent a distinct node and the complete set of relationships for that node, similar to something like the ARRAY_AGG() and GROUP BY combination in PostgreSQL. This seems much cleaner to me.

# this version expects a collection of rels in the variable 'rels'
# But, this version doesn't handle dangling references
def rs2graph_v2(rs):
    graph = networkx.MultiDiGraph()

    for record in rs:
        node = record['n2']
        if not node:
            raise Exception('every row should have a node')

        print("adding node")
        nx_properties = {}
        nx_properties.update(node.properties)
        nx_properties['labels'] = list(node.labels)
        graph.add_node(node.id, **nx_properties)

        relationship_list = record['rels']

        for relationship in relationship_list:
            print("adding edge")
            graph.add_edge(
                relationship.start, relationship.end, key=relationship.type,
                **relationship.properties
            )

    return graph

Trying to extend to handle subgraphs

When we have relationship types that define subtrees, which are labelled something like :PRECEDES in this case, we can attempt to materialize this sub-graph selected from a given root in memory. In the query below, the Token node with content nonesuch is taken as the root.

This version can be used with a Cypher query like the following:

MATCH (a:Token {content: "nonesuch"})-[:PRECEDES*]->(t:Token)
WITH COLLECT(a) + COLLECT(DISTINCT t) AS nodes_
UNWIND nodes_ AS n
OPTIONAL MATCH p = (n)-[r]-()
WITH n AS n2, COLLECT(DISTINCT RELATIONSHIPS(p)) AS nestedrel
RETURN n2, REDUCE(output = [], rel in nestedrel | output + rel) AS rels

And the Python code to read the result of this query is as such:

# This version has to materialize the entire node set up front in order
# to check for dangling references.  This may induce memory problems in large
# result sets
def rs2graph_v3(rs):
    graph = networkx.MultiDiGraph()

    materialized_result_set = list(rs)
    node_id_set = set([
        record['n2'].id for record in materialized_result_set
    ])

    for record in materialized_result_set:
        node = record['n2']
        if not node:
            raise Exception('every row should have a node')

        print("adding node")
        nx_properties = {}
        nx_properties.update(node.properties)
        nx_properties['labels'] = list(node.labels)
        graph.add_node(node.id, **nx_properties)

        relationship_list = record['rels']

        for relationship in relationship_list:
            print("adding edge")

            # Bear in mind that when we ask for all relationships on a node,
            # we may find a node that PRECEDES the current node -- i.e. a node
            # whose relationship starts outside the current subgraph returned
            # by this query.
            if relationship.start in node_id_set:
                graph.add_edge(
                    relationship.start, relationship.end, key=relationship.type,
                    **relationship.properties
                )
            else:
                print("ignoring dangling relationship [no need to worry]")

    return graph
Posted 2018-05-09

This is something of a pain in the arse, there are several main points to remember. These points apply to the version 8.5.8+dfsg-5, from Ubuntu universe.

Note that the Debian-derived gitlab package has been REMOVED from Debian bionic.

Install the packages from universe

Work around bug 1574349

You'll come across a bug: https://bugs.launchpad.net/ubuntu/+source/gitlab/+bug/1574349

The tell-tale sign of this bug is a message about a gem named devise-two-factor. As far as I can tell, there's no way to work around this and stay within the package system.

You have to work around this, but first:

Install bundler build dependencies

apt install cmake libmysqlclient-dev automake autoconf autogen libicu-dev pkg-config

Run bundler

Yes, you're going to have to install gems outside of the package system.

  • # cd /usr/share/gitlab
  • # bundler

And yes, this is a bad situation.

Unmask all gitlab services

[Masking] one or more units... [links] these unit files to /dev/null, making it impossible to start them.

For some reason the apt installation process installs all the gitlab services as masked. No idea why but you'll need to unmask them.

systemctl unmask gitlab-unicorn.service
systemctl unmask gitlab-workhorse.service
systemctl unmask gitlab-sidekiq.service
systemctl unmask gitlab-mailroom.service

Interactive authentication required

You're going to face this error, too. You need to create an override so that gitlab gets started with the correct user. You can do that with systemctl edit gitlab, this will create a local override.

Insert this in the text buffer:

[Service]
User=gitlab

Save and quit and now you need to reload & restart.

systemctl daemon-reload
systemctl start gitlab

Purging debconf answers

Since gitlab is sometimes an interactively configured package, sometimes stale information can get stored in the debconf database, which will hinder you. To clear this out and reset them, do the following:

debconf-show gitlab
echo PURGE | debconf-communicate gitlab

This is the first time I've had to learn about this in a good 10 years of using and developing Debian-derived distributions. That's how successful an abstraction debconf is.

Update 2018-05-04: Also pin ruby-mail package to artful version 2.6.4+dfsg1-1

Posted 2018-03-09

[Originally written 2017-09-22. I don't have time to finish this post now, so I might as well just publish it while it's still not rotted.]

While coding large backend applications in Clojure I noticed a pattern that continued to pop up.

When learning FP initially, you initially learn the basics: your function should not rely on outside state. It should not mutate it, nor observe it, unless it's explicitly passed in as an argument to the function. This rule generally includes mutable resources in the same namespace, e.g. an atom, although constant values are still allowed. Any atom that you want to access must be passed in to the function.

Now, this makes total sense at first, and it allows us to easily implement the pattern described in Gary Bernhardt's talk "Boundaries", of "Functional Core, Imperative Shell" [FCIS]. This means that we do all I/O at the boundaries.

(defn sweep [db-spec]
  (let [all-users (get-all-users db-spec)]
    (let [expired-users (get-expired-users all-users)]
      (doseq [user expired-users]
        (send-billing-problem-email! user)))))

This is a translation of Gary's example. A few notes on this implementation.

  1. sweep as a whole is considered part of the imperative shell.
  2. get-all-users and send-billing-problem-email! are what we'll loosely refer to as "boundary functions".
  3. get-expired-users is the "functional core".

The difference that Gary stresses is that the get-expired-users function contains all the decisions and no dependencies. That is, all the conditionals are in the get-expired-users function. That function purely operates on a data in, data out basis: it knows nothing about I/O.

This is a small-scale paradigm shift for most hackers, who are used to interspersing their conditionals with output; consider your typical older-school PHP bespoke system, which is bursting with DB queries that have their result spliced directly into pages. But, this works very well for this simple example. It accomplishes the goal of making everything testable pretty well. And you'd be surprised how far overall this method can take you.

It formalizes as this: Whenever you have a function that intersperses I/O with logic, separate out the logic and the I/O, and apply them separately. This is usually harder for output than for input, but it's usually possible to construct some kind of data representation of what output operation should in fact be effected -- what I'll call an "output command" -- and pipe that data to a "dumb" driver that just executes that command.

You can reconstruct most procedures in this way. The majority of problems, particularly in a backend REST system, break down to "do some input operation", "run some logic", "do some output operation". Here I'm referring to the database as the source and target of IO. This is the 3-tier architecture described by Fowler in PoEAA.

However, you probably noticed an inefficiency in the code above. Likely we get all users and then decide within the language runtime whether a given user is expired or not. We've given up the ability of the database to answer this question for us. Now we're reading the entire set of users into memory, and mapping them to objects, before we make any decision about whether they're expired or not.

Realistically, this isn't likely to be a problem, depending on the number of users. Obviously Gmail is going to have a problem with this approach. But surely you're fine until perhaps 10,000 users, assuming that your mapping code is relatively efficient.

Anyway, this isn't the problem that led me to discover this. The problem happened when I was implementing the basics of the REST API, and attempting to be as RESTfully-correct as possible, I wanted to use linking. This seems easy, when you only need to produce internal links, right? In JSON, we chose a certain representation (the Stormpath representation).

GET /users/1

{
   "name": "Dave",
   "pet": {"href": <url>},
   "age": 31
}

Now, assume we also have a separate resource for a user's pet. In REST, that's represented by the URL /pets/1 for a pet with identifier 1. We have the ability to indicate this pet through either relative or absolute URLs. Assume that our base URL for the API is https://cool-pet-tracker.solasistim.net/api.

  • The relative URL is /pets/1.
  • The absolute URL is https://cool-pet-tracker.solasistim.net/api.

If you search around a bit, you'll find that from what small amount of consensus exists, REST URLs that get returned are always required to be absolute. This pretty much makes sense, given that a link represents a concrete resource that is available at a certain point in time, in the sense of "Cool URLs Don't Change".

Now the problem becomes, say we have a function that attempt to implement the /users/:n API. We'll write this specifically NOT in the FCIS style, so we'll entangle the I/O. (Syntax is specific to Rook.)

 (defn show [id ^:injection db-spec]
   (let [result (get-user db-spec {:user (parse-int-strict id)})]
     {:name (:name result)
      :pet nil
      :age (:age result)}))

You'll notice that I left out the formation of the link. Let's add the link.

 (defn show [id request ^:injection db-spec]
   (let [result (get-user db-spec {:user (parse-int-strict id)})]
     {:name (:name result)
      :pet (make-rest-link request "/pet" (:pet_id result))
      :age (:age result)}))

Now, we define make-rest-link naively as something like this.

 (defn make-rest-link [request endpoint id]
    (format "%s/%s/%s" (get-in request [:headers "host"])
                       endpoint
                       id))

Yeah, there's some edges missed here but that's the gist of it. The point is that we use whatever Host URI was requested to send back the linked result. [This has some issues with reverse proxy servers that sometimes calls for a more complicated solution, but that's outside the scope of this document.]

Now did you notice the issue? We had to add the request to the signature of the function. Now, that's pretty much a small deal in this case: the use of the request is a key part of the function's purpose, and it makes sense for every function to have knowledge of it. But just imagine that we were dealing with a deeply nested hierarchy.

(defn form-branch []
  {:something-else 44})

(defn form-tree []
  {:something-else 43
   :branch (form-branch)})

(defn form-hole [id]
   {:something 42
    :tree (form-tree)})

(defn show [id ^:injection db-spec]
  (form-hole id))

As you can see, this is a nested structure: a hole has a tree and that tree itself has a branch. That's fine so far, but we don't really want to go any deeper than 3 layers. Now, the branch gets a "limb" (this is a synonym for "bough", a large branch). But we only want to put a link to it.

(defn form-limb [request]
  (make-rest-link request "/limbs" 1))

(defn form-branch [request]
  {:something-else 44
   :limb (form-limb request)})})

(defn form-tree [request]
  {:something-else 43
   :branch (form-branch request)})

(defn form-hole [id request]
   {:something 42
    :tree (form-tree request)})

(defn show [id request ^:injection db-spec]
  (form-hole id request))

Now we have a refactoring nightmare. All of the intermediate functions, that mirror the structure of the entity, had to be updated to know about the request. Even though they themselves did not examine the request at all. This isn't bad just because of the manual work involved: it's bad because it clouds the intent of the function.

Now anyone worth their salt will be thinking of ways to improve this. We could cleverly invert control and represent links as functions.

(defn form-branch []
   {:something-else 44
    :limb #(make-rest-link % "/limbs" 1)})

Then, though, we need to run over the entire structure before coercing it to REST and specially treat any functions. This could be accomplished using clojure.walk and it would probably work OK.

What's actually being required here? What's happened is a function deep in the call stack has a need for context that's only available in the outside of the stack. But, that information is really only peripheral to its purpose. As you can see we were able to form an adequate representation of the link as a function, which by no means obscures its purpose from the reader. If anything the purpose is clearer.

This problem can also pop up in other circumstances that seem less egregious. In general, any circumstance where you need to use I/O for a small part of the result at a deep level in the stack will result in a refactoring cascade as all intervening functions end up with added parameters. There are several ways to ameliorate this.

1: The "class" method

This method bundles up the context with the functionality as a record. The context then becomes referrable to by any function within that protocol.

(defprotocol HoleShower
  (show [id] "Create JSON-able representation of the given hole."))

(defrecord SQLHoleShower [request db-spec]
  HoleShower
  (show [this id]
    {:something 42
     :tree (form-tree id)})
  (form-tree [this id]
    {:something-else 44
     :branch (make-rest-link request "/branches" 1)}))

As you can see, we don't need to explicitly pass request because every instance of an SQLHoleShower automatically has access to the request that was used to construct it. However, it has the very large downside that these functions then become untestable outside of the context of an SQLHoleShower. They're defined, but not that useful.

2. The method of maker

This is a library by Tamas Jung that implements a kind of implicit dependency resolution algorithm. Presumably it's a topo-sort equivalent to the system logic in Component.

(ns clojure-playground.maker-demo
  (:require [maker.core :as maker]))

(def stop-fns (atom (list)))

(def stop-fn
  (partial swap! stop-fns conj))

(maker/defgoal config []
  (stop-fn #(println "stop the config"))
  "the config")

;; has the more basic 'config' as a dependency
;; You can see that 'defgoal' actually transparently manufactures the dependencies.
;; After calling (make db-conn), (count @stop-fns) = 2:
;; that means that both db-conn AND its dependency config were constructed.

(maker/defgoal db-conn [config]
 (stop-fn #(println "stop the db-conn"))
  (str "the db-conn"))

;; This will fail at runtime with 'Unknown goal', until we also defgoal `foo`
(maker/defgoal my-other-goal [foo]
  (str "somthing else"))

The macro defgoal defines a kind of second class 'goal' which is only known about by the maker machinery. When a single item anywhere in the graph is "made" using the make function, the library knows how to resolve all the intermediaries. It's kind of isomorphic to the approach taken by Claro, although it relies on more magical macrology.

https://www.niwi.nz/2016/03/05/fetching-and-aggregating-remote-data-with-urania/ https://github.com/kachayev/muse https://github.com/facebook/Haxl https://www.youtube.com/watch?v=VVpmMfT8aYw

See this: Retaking Rules for developers: https://www.youtube.com/watch?v=Z6oVuYmRgkk&feature=youtu.be&t=9m54s

And of course, the "Out of the Tar Pit" paper.

Posted 2018-02-27

The most interesting thing about this piece is the butter/saffron/milk mixture, pictured below. This turns out delicately spiced and exotic, with a kind of floral note from the cardamom. I don't know how you're supposed to eat it: for me, I may have failed immediately by not using sufficient rice.

I don't really buy the pastry rim, as pictured here. This is supposed to create a more tight seal on the dish while it's in the oven. But it seems like a bit of a waste; the visual presentation is wonderful, though.

I found it interesting to read, in the history of the biryani, that beef biryani is a favourite in Kerala. This would be a nice next try.

I deviated from the recipe by not including cauliflower, this was unintentional. I'd say the overall result is dominated by the chana dal. It has that kind of 'grainy' taste associated with a dal.

In a way, I can't agree that this is 'perfect': in a way, it's too subtle for me. The flavours don't quite punch through enough. I think an addition of deep-fried onions, another addition from Cloake that I couldn't add, would improve it. Ironically this is kind of the opposite of Cloake's chilli, that I made previously, which was if anything too pungent.

Actually I'd say the flavour issue with this is the sour balance. It's just a touch too sour in a way that's not mitigated by the other flavours. I suppose you have to remember that yoghurt is sour. I think that I may have overdone the amount of yoghurt in this recipe. It was supposed to be 200ml of yoghurt for ~700g of main ingredient, but I have probably done about 500ml for 500g instead. And then also added lime juice on top of this. Cloake actually anticipates that the yoghurt will be insufficient and recommends diluting it. So the lesson here is to go for the balance of 1/4 yoghurt to main ingredient ratio.

This means that you'd normally want to buy smaller portions of yoghurt, about 250ml containers, and thin them with water or milk when you want to marinate. And perhaps leave the addition of other souring agents to later, when you already use yoghurt in a curry.

Posted 2018-02-25

Deploying desktop applications on a Mac, for us Linux guys, can be strange. I didn't even know what the final deliverable file was.

Well, that file is the DMG, or Apple Disk Image.

To create the DMG, you first need to create an intermediate result, the App Bundle.

I assume you're using Scons as your build system. If you're not, which admittedly is quite likely, then go and read another post.

To create the App Bundle you can use an Scons tool from a long time ago which doesn't seem to have a real home at the moment. It'd be a good project to try to rehabilitate this.

In the meantime, I've created a Gist that contains the code for that. Download it and put it in the same directory as your SConstruct.

To use it, you have to bear in mind that it's going to overwrite your environment quite heavily. So I suggest using a totally new environment for it.

Your final SConstruct is going to look something like this:

# SConstruct

import os
from osxbundle import TOOL_BUNDLE

def configure_qt():
    qt5_dir = os.environ.get('QT5_DIR', "/usr")

    env = Environment(
        tools=['default', 'qt5'],
        QT5DIR=qt5_dir
    )
    env['QT5_DEBUG'] = 1
    maybe_pkg_config_path = os.environ.get('PKG_CONFIG_PATH')
    if maybe_pkg_config_path:
        env['ENV']['PKG_CONFIG_PATH'] = maybe_pkg_config_path

    env.Append(CCFLAGS=['-fPIC', '-std=c++11'])
    env.EnableQt5Modules(['QtCore', 'QtWidgets', 'QtNetwork'])

    return env

# A Qt5 env is used to build the program...
env = configure_qt()
env.Program('application', source=['application.cc'])

# ... but a different env is needed in order to bundle it.
bundle_env = Environment()
TOOL_BUNDLE(bundle_env)

bundledir = "the_bundle.app"
app = "application"   # The output object?
key = "foobar"
info_plist = "info_plist.xml"
typecode = 'APPL'

bundle_env.MakeBundle(bundledir, app, key, info_plist, typecode=typecode)

As an explanation of these arguments, bundledir is the output directory, which must always end in .app. app is the name of your executable program (the result of compiling the C++ main function). key is unclear, some other context suggests that it's used for a Java-style reversed domain organization identifier, such as net.solasistim.myapplication.

You can also provide icon_file (a path) and resources (a list) which are then folded into the /Contents/Resources path inside the .app.

Once you've got your .app you now need to create a DMG file. Like this.

$ macdeployqt the_bundle.app -dmg

You should now find the_bundle.dmg floating in your current directory. Nice.

Posted 2018-02-19

How simple can a spinner be, and still be devoid of hacks? Let's see:

Markup in a Vue component, using v-if to hide and show:

<svg v-if="inProgressCount > 0" height="3em" width="3em" class="spinner">
  <circle cx="50%"
          cy="50%"
          r="1em"
          stroke="black"
          stroke-width="0.1em"
          fill="#001f3f" />
</svg>

The CSS to create the animation, and "pin" it so that it's always visible:

svg.spinner {
    position: fixed;
    left: 0px;
    top: 0px;
}

svg.spinner circle {
    animation: pulse 1s infinite;
}

@keyframes pulse {
    0% {
        fill: #001f3f;
    }

    50% {
        fill: #ff4136;
    }

    100% {
        fill: #001f3f;
    }
}

The only thing that I don't understand here is why it's necessary to list duplicate fill values for 0% and 100%. That's needed to create a proper loop. Answers on a postcard.

Posted 2018-02-05

To record my FFXII builds pre Act 8.

Vaan - Time Battlemage / Monk

Penetrator Crossbow, Lead Shot, Giant's Helmet, Carabineer Mail, Hermes Sandals

Balthier - White Mage / Machinist

Spica, Celebrant's Miter, Cleric's Robes, Sage's Ring

Fran - Archer / Uhlan

Yoichi Bow, Parallel Arrows, Giant's Helmet, Carabineer Mail, Sash

Basch - Knight / Foebreaker

Save the Queen, Dragon Helm, Dragon Mail, Power Armlet

Ashe - Red Battlemage / Bushi

Ame-no-Murakumo, Celebrant's Miter, Cleric's Robes, Nishijin Belt

Penelo - Black Mage / Shikari

Platinum Dagger, Aegis Shield, Celebrant's Miter, Cleric's Robes, Sash

Posted 2018-02-02

Sometimes you may need to extract content from a word document. You will need to be aware of the structure. Extremely simplified, a Word document has the following structure:

  1. At the top level is a list of "parts".
  2. One part is the "main document part", m.
  3. The part m contains some w:p elements, represented in Docx4j as org.docx4j.wml.P objects. Semantically this represents a paragraph.
  4. Each paragraph consists of "runs" of text. These are w:r elements. I think that the purpose of these is to allows groups within paragraphs to have individual stylings, roughly like span in HTML.
  5. Each run contains w:t elements, or org.docx4j.wml.Text. This contains the meat of the text.

Here's how you define a traversal against a Docx file:

public class TraversalCallback extends TraversalUtil.CallbackImpl {
    @Override
    public List<Object> apply(Object o) {
        if (o instanceof org.docx4j.wml.Text) {
            org.docx4j.wml.Text textNode = (org.docx4j.wml.Text) o;

            String textContent = textNode.getValue();

            log.debug("Found a string: " + textContent);

            root.appendChild(element);
        } else if (o instanceof org.docx4j.wml.Drawing) {
            log.warn("FOUND A DRAWING");
        }
        return null;
    }

    @Override
    public boolean shouldTraverse(Object o) {
        return true;
    }
}

Note that we inherit from TraversalUtil.CallbackImpl. This allows us to avoid implementing walkJAXBElements() method ourselves -- although you still might need to, if your algorithm can't be defined in the scope of the apply method. It seems like the return value of apply is actually ignored by the superclass implementation of walkJAXBElements, so you can just return NULL.

To bootstrap it from a file, you just do the following:

URL theURL = Resources.getResource("classified/lusty.docx");

WordprocessingMLPackage opcPackage = WordprocessingMLPackage.load(theURL.openStream());
MainDocumentPart mainDocumentPart = opcPackage.getMainDocumentPart();

TraversalCallback callback = new TraversalCallback();
callback.walkJAXBElements(mainDocumentPart);

By modifying the apply method, you can special-case each type of possible element from Docx4j: paragraphs, rows, etc.

Posted 2018-01-04

This blog is powered by coffee and ikiwiki.