Go to content Go to menu

Yesterday, a sysadmin noticed a server that runs Tomcat for us was filling up with log files, and I started digging around in the Tomcat docs to figure out how to set up log file rotation—I eventually gave up and decided to delete old log files after a week, because this is a development server we’re talking about, anyway, that’s not the story. The story is, while skimming the Tomcat documentation, I spied the section on the Crawler Session Manager Valve . For years, we’ve had random out of memory errors when someone tries to run a big job on our application… and we’ve always traced the cause to the server also being hit by a web crawler at the same time. Crawlers can spawn many sessions during their web site crawl. Amazingly, the Tomcat developers have figured out a clever solution to this dilemma, by creating a single session for a bot and even if the IP address changes, the deal is, one session per bot. It’s a great idea. And the Valve is just a single line that gets dropped into your server.xml file, most likely in the Engine block, but it can go elsewhere if you want. Or, you know, two lines if you want to explain it with a comment, like so:

    <!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
    <Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve" />

I don’t really understand why I had to blunder across this amazing thing in a skim of the documentation. People should be shouting this from the rooftops! You don’t have to wait for some random web crawler to kick over your website. You don’t.

Server performance has been on my mind lately, specifically Solr server performance. We use Solr as our search and browse infrastructure for MOspace, our DSpace -based institutional repository. I’ve heard that search performance on MOspace has been slow, so I wanted to get some data on that, and then find ways to improve the performance. I discovered that our Solr log files contain information on how long it takes the server to respond to a query. Which is great, I just needed a way to analyze that data. I’ve been meaning to play with Miller — which is a handy little text parsing tool, with a little bit of data analysis added in. It’s a lovely tool, but I couldn’t figure out the syntax to do what I wanted. Normally, when I run into this issue with open source software, I’ll reach out on a related mail list, and ask for help. Miller is new, it doesn’t have a community yet, and no mail list. So, I sent an e-mail to John Kerl, the developer of Miller. Astoundingly, John responded right away, with a little snippet that turned into this:

for f in solr.log*; do echo "Processing $f..."; grep search $f | sed 's/.*} //' | \
mlr --repifs --ifs space stats1 -a min,mode,mean,stddev,max -f QTime; echo; echo; done

Which outputs something like this:

Processing solr.log...

Processing solr.log.2015-08-26...

Processing solr.log.2015-08-27...

Processing solr.log.2015-08-28...

Processing solr.log.2015-08-29...

Processing solr.log.2015-08-30...

Processing solr.log.2015-08-31...

Which is kinda cool. But those max response times! I googled around a bit, and found this StackOverflow thread

Which in turn lead me to this wiki page and this wiki page

That last page mentions these settings for tuning the GC to maximize Solr performance:

-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \

This is info I can use. So, I tried those OPTS in our Tomcat server’s setenv.sh script:


# suggestion from http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded
# and here https://plumbr.eu/outofmemoryerror/gc-overhead-limit-exceeded
# though, really, we ought to figure out where the memory leak is a some point...

# This is interpreted as a hint to the garbage collector that pause times
# of <nnn> milliseconds or less are desired. The garbage collector will
# adjust the  Java heap size and other garbage collection related parameters
# in an attempt to keep garbage collection pauses shorter than <nnn> milliseconds.
# http://java.sun.com/docs/hotspot/gc5.0/ergo5.html
export CATALINA_OPTS="$CATALINA_OPTS -XX:MaxGCPauseMillis=250"

# A hint to the virtual machine that it.s desirable that not more than:
# 1 / (1 + GCTimeRation) of the application execution time be spent in
# the garbage collector.
# http://themindstorms.wordpress.com/2009/01/21/advanced-jvm-tuning-for-low-pause/

# more tweaks for Solr performance optimization, from
# http://wiki.apache.org/solr/ShawnHeisey
export CATALINA_OPTS="$CATALINA_OPTS -XX:+PerfDisableSharedMem"
export CATALINA_OPTS="$CATALINA_OPTS -XX:+ParallelRefProcEnabled"
export CATALINA_OPTS="$CATALINA_OPTS -XX:G1HeapRegionSize=8m"
export CATALINA_OPTS="$CATALINA_OPTS -XX:InitiatingHeapOccupancyPercent=75"
export CATALINA_OPTS="$CATALINA_OPTS -XX:+AggressiveOpts"

And… so far… so good. I can see we are having far fewer big GC events by watching our system profiler. We are having many more small GC events.

Today, I’ve been playing a bit more with Miller, spefically grouping and sorting, so I can get a more fine-grained picture of query response, with the tuning noted above in place.

for f in solr.log; do echo "Processing $f..."; grep search $f | sed 's/.*} //' | \
mlr --repifs --ifs space stats1 -a min,mode,mean,stddev,max -f QTime -g hits then sort -nr hits; echo; echo; done

Which gets me lots of info like this (grouping by hits, reverse-sorting by hits):


I can then see that there might be certain queries that are taking an unusually long time (mode of 184, minimum of 6). I will see if I can find out what that query is and why it’s taking so long.

Oh, I should note that John Kerl, the developer of Miller, is right now asking for people to make suggestions for his FAQ page. Miller is awesome, if you have a text parsing or data analysis task you think Miller might be just the tool for, post your challenge to that issue. Someone might help you answer your question.

10/28/2015 UPDATE: it was brought to my attention that the Lucene folks have posted a rather amusing warning against using the G1GC garbage collector and similarly, the Elastic Search folks have done the same (though it’s likely just an echo of the Lucene warning). Now, I don’t know what to make of this, but I do know that search performance has measurably increased since switching to the G1GC. For now, I think I will choose to quietly worry.