Sometimes you may need to extract content from a word document. You will need to be aware of the structure. Extremely simplified, a Word document has the following structure:
- At the top level is a list of "parts".
- One part is the "main document part",
m
. - The part
m
contains somew:p
elements, represented in Docx4j asorg.docx4j.wml.P
objects. Semantically this represents a paragraph. - Each paragraph consists of "runs" of text. These are
w:r
elements. I think that the purpose of these is to allows groups within paragraphs to have individual stylings, roughly likespan
in HTML. - Each run contains
w:t
elements, ororg.docx4j.wml.Text
. This contains the meat of the text.
Here's how you define a traversal against a Docx file:
public class TraversalCallback extends TraversalUtil.CallbackImpl {
@Override
public List<Object> apply(Object o) {
if (o instanceof org.docx4j.wml.Text) {
org.docx4j.wml.Text textNode = (org.docx4j.wml.Text) o;
String textContent = textNode.getValue();
log.debug("Found a string: " + textContent);
root.appendChild(element);
} else if (o instanceof org.docx4j.wml.Drawing) {
log.warn("FOUND A DRAWING");
}
return null;
}
@Override
public boolean shouldTraverse(Object o) {
return true;
}
}
Note that we inherit from TraversalUtil.CallbackImpl. This allows us to avoid
implementing walkJAXBElements()
method ourselves -- although you still might
need to, if your algorithm can't be defined in the scope of the apply
method.
It seems like the return value of apply
is actually ignored by the superclass
implementation of walkJAXBElements
, so you can just return NULL.
To bootstrap it from a file, you just do the following:
URL theURL = Resources.getResource("classified/lusty.docx");
WordprocessingMLPackage opcPackage = WordprocessingMLPackage.load(theURL.openStream());
MainDocumentPart mainDocumentPart = opcPackage.getMainDocumentPart();
TraversalCallback callback = new TraversalCallback();
callback.walkJAXBElements(mainDocumentPart);
By modifying the apply
method, you can special-case each type of possible
element from Docx4j: paragraphs, rows, etc.