Traversing Word documents with docx4java

Sometimes you may need to extract content from a word document. You will need to be aware of the structure. Extremely simplified, a Word document has the following structure:

At the top level is a list of "parts".
One part is the "main document part", m.
The part m contains some w:p elements, represented in Docx4j as org.docx4j.wml.P objects. Semantically this represents a paragraph.
Each paragraph consists of "runs" of text. These are w:r elements. I think that the purpose of these is to allows groups within paragraphs to have individual stylings, roughly like span in HTML.
Each run contains w:t elements, or org.docx4j.wml.Text. This contains the meat of the text.

Here's how you define a traversal against a Docx file:

public class TraversalCallback extends TraversalUtil.CallbackImpl {
    @Override
    public List<Object> apply(Object o) {
        if (o instanceof org.docx4j.wml.Text) {
            org.docx4j.wml.Text textNode = (org.docx4j.wml.Text) o;

            String textContent = textNode.getValue();

            log.debug("Found a string: " + textContent);

            root.appendChild(element);
        } else if (o instanceof org.docx4j.wml.Drawing) {
            log.warn("FOUND A DRAWING");
        }
        return null;
    }

    @Override
    public boolean shouldTraverse(Object o) {
        return true;
    }
}

Note that we inherit from TraversalUtil.CallbackImpl. This allows us to avoid implementing walkJAXBElements() method ourselves -- although you still might need to, if your algorithm can't be defined in the scope of the apply method. It seems like the return value of apply is actually ignored by the superclass implementation of walkJAXBElements, so you can just return NULL.

To bootstrap it from a file, you just do the following:

URL theURL = Resources.getResource("classified/lusty.docx");

WordprocessingMLPackage opcPackage = WordprocessingMLPackage.load(theURL.openStream());
MainDocumentPart mainDocumentPart = opcPackage.getMainDocumentPart();

TraversalCallback callback = new TraversalCallback();
callback.walkJAXBElements(mainDocumentPart);

By modifying the apply method, you can special-case each type of possible element from Docx4j: paragraphs, rows, etc.