I have developed a component in Java that requires an HTML parser. The component goes through around 2000 webpages and gets some data.
It was quite easy to implement it using the org.htmlParser (http://htmlparser.sourceforge.net/). Even though some of the webpages are quite big (some of a size of up to a few hunders of MBs) the memory of the component seemed to grow inexplicably leading to a Java heap out of memory error. I spent a good deal of time trying to figure out the source of the leak thinking it was my code. After a few attempts to identify the problem, I used the IMB Support Assistant workbench and took a heap dump using the command:
jmap -dump:format=b,file=heap.bin processID
I was able to identify a lot of Finalizer objects referencing the org.htmlParser.lexer. This looks like a memory leak, where the garbage collector can’t collect the objects properly?
Well.. the fact of the matter is I haven’t spent an enormous amount of time reading the documentation and/or source code of the project. It seems there is a close() method that can be called on the Page reference of the lexer and I haven’t used it. So, at the end of my method that does the parsing I added:
parser.getLexer().getPage().close();
parser.setInputHTML("");
The first statement closes the Page object. I added the second statement just to be on the safe side, even though it’s probably redundant.
And the “Memory Leak” seems to have vanished…