org.htmlparser “Memory Leak”? « KYRIAKOS ANASTASAKIS – ΚΥΡΙΑΚΟΣ ΑΝΑΣΤΑΣΑΚΗΣ

I have developed a component in Java that requires an HTML parser. The component goes through around 2000 webpages and gets some data.

It was quite easy to implement it using the org.htmlParser (http://htmlparser.sourceforge.net/). Even though some of the webpages are quite big (some of a size of up to a few hunders of MBs) the memory of the component seemed to grow inexplicably leading to a Java heap out of memory error. I spent a good deal of time trying to figure out the source of the leak thinking it was my code. After a few attempts to identify the problem, I used the IMB Support Assistant workbench and took a heap dump using the command:

jmap -dump:format=b,file=heap.bin processID

I was able to identify a lot of Finalizer objects referencing the org.htmlParser.lexer. This looks like a memory leak, where the garbage collector can’t collect the objects properly?

Well.. the fact of the matter is I haven’t spent an enormous amount of time reading the documentation and/or source code of the project. It seems there is a close() method that can be called on the Page reference of the lexer and I haven’t used it. So, at the end of my method that does the parsing I added:

parser.getLexer().getPage().close(); parser.setInputHTML("");

The first statement closes the Page object. I added the second statement just to be on the safe side, even though it’s probably redundant.

And the “Memory Leak” seems to have vanished…

2 thoughts on “org.htmlparser “Memory Leak”?”

So what does your tool do? Sounds quite interesting.

Reply ↓

It collects some reports in HTML format over the internet, uses the org.htmlparser to parse some data from the reports and stores the collected data in a DB. But, I am afraid any more details is confidential 😀

Reply ↓

harry said on June 29, 2011 at 02:11:

So what does your tool do? Sounds quite interesting.
Reply ↓
Kyriakos Anastasakis said on June 29, 2011 at 21:16:

It collects some reports in HTML format over the internet, uses the org.htmlparser to parse some data from the reports and stores the collected data in a DB. But, I am afraid any more details is confidential 😀
Reply ↓

2 thoughts on “org.htmlparser “Memory Leak”?”

Leave a reply to Kyriakos Anastasakis Cancel reply