I have developed a component in Java that requires an HTML parser. The component goes through around 2000 webpages and gets some data.

It was quite easy to implement it using the org.htmlParser (http://htmlparser.sourceforge.net/). Even though some of the webpages are quite big (some of a size of up to a few hunders of MBs) the memory of the component seemed to grow inexplicably leading to a Java heap out of memory error. I spent a good deal of time trying to figure out the source of the leak thinking it was my code. After a few attempts to identify the problem, I used the IMB Support Assistant workbench and took a heap dump using the command:

jmap -dump:format=b,file=heap.bin processID

I was able to identify a lot of Finalizer objects referencing the org.htmlParser.lexer. This looks like a memory leak, where the garbage collector can’t collect the objects properly?

Well.. the fact of the matter is I haven’t spent an enormous amount of time reading the documentation and/or source code of the project.  It seems there is a close() method that can be called on the Page reference of the lexer and I haven’t used it. So, at the end of my method that does the parsing I added:

parser.getLexer().getPage().close();
parser.setInputHTML("");

The first statement closes the Page object. I added the second statement just to be on the safe side, even though it’s probably redundant.

And the “Memory Leak” seems to have vanished…

2 thoughts on “org.htmlparser “Memory Leak”?

  1. It collects some reports in HTML format over the internet, uses the org.htmlparser to parse some data from the reports and stores the collected data in a DB. But, I am afraid any more details is confidential 😀

    Reply

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url=""> 

required

Page last modified: 05:52 on November 6, 2013 (UTC+2)