Thursday, July 14, 2011

Working through Oracle WebCenter Analytics 10.3.0.1 VISITID Issues

First, a little back story...
For the past 6 years we've been running a standard intranet/extranet environment, but in May we launched a public site (https://www.prematurityprevention.org) running WCI and one of the requirements was that we track what documents users users are downloading.  As it turned out, being on 10.3.0 prevented one from retrieving a list of who downloaded what so I did some research and realized this was a bug that could only be solved up upgrading to 10.3.0.1.

No sweat!
A little bit of testing in dev, a relatively quick upgrade in prod and the who/what for documents metrics came through for the currently recorded data! Yay!

Which led me to...
Each month I run a series of reports from the out of the box Analytics Console.  I toss the numbers into a spreadsheet, run some elementary formulas and display some charts which track things like our top 100 searches, popularity of collab, and stuff like that.  When July rolled around and I ran the June reports I noticed that the data wasn't quite complete - in fact barely anything came through!  Obviously my first impression was - here we go again, another bug (On a side not, dear Oracle, please hire me to do your testing I seem to be a bug magnet, just ask Support!).  Now before anyone thinks I'm bashing the big guy, let me be clear, this is a really big bug, not some obscure little thing that only a goofy kid from NJ that drinks too much coffee (me) would find.

The bug involves a null value being inserted into each entry in the ASFACT tables for the VISITID field (Critical fix 373256, resolved by patch 10364127) which results in a lack of results being displayed in the Analytics Console.

Whew that was a lot...
OK! So I waded through the 10.3.0.1 patches and noticed they all claim to be critical and that they each must be installed on all instances of ptcollector and ptanalytics which turns out to be slightly inaccurate as some of the patches are based on specific environments (Oracle DB customers, etc).
For our vanilla SQL/Windows environment it turned out that the key patch was the one I already mentioned, 10364127 which fixed 7 different bugs.  Additionally patches 8764675 and 9095551 also looked worthwhile so I applied them as well.

7/14 Update
Going through all of this patching forced me to open lots of log files, analyze PTSPY and generally investigate settings that normally are ignored.  While going through all of this I noticed that our collector log files were generating well over 100mb of data a day because 2 of the entries in the \ptcollector\10.3.0.1\settings\config\collector-log4j.properties file were set to debug! I swapped them down to WARN and that instantly fixed that issue, logs are now generating <10mb per day.

I added a server CPU alert monitor a few months ago to our various boxes and I've been noticing the CPU spiking fairly often, but much more so recently.  In looking through the logs I made a correlation between a JVM timeout and these spikes.  The log error looks something like this:
 Exception in thread "PMB Message Processor" java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object size: 65552, Num elements: 65536
 Exception in thread "Thread-10" java.lang.OutOfMemoryError: nativeGetNewTLA
 Server daemon died!
 JVM appears hung: Timed out waiting for signal from JVM.
Ok, so out of memory issues shouldn't be that hard to fix, right? All you need to do is edit the ptcollector\10.3.0.1\bin\collectord.bat file and change the JVM parameters from the default to something larger (I jumped right to "large"!). 
rem default settings:
rem set JVM_1=-Xms48m -Xmx160m -Xgc:genpar

rem Medium settings
rem set JVM_1=-Xms96m -Xmx256m

rem Large Settings -- For large scale deployments with very heavy component usage
set JVM_1=-Xms300m -Xmx300m
Restarting the collector service is insufficient for this change to go through, the .bat file has a note that says "Changing the below arguments will take effect immediately if run from the console.  If run as a service, the service must be removed and re-installed ("<mystartupbat>.bat remove" and "<mystartupbat>.bat install")".

In case you are curious what the change looks like on the Task Manager. We are running a standard 32bit 4gb RAM server which has Analytics UI /collector ptadaws, ptntcws, ptsearchserver, and ptupload on it, and this is what the switch did to the RAM graph:


A few hours after boosting the memory I was disappointed to see another out of memory error (7/14 11:34AM) but fortunately that has been the last one since the this posting (7/15 1pm)

The next to figure out is why new ASFACT entries are STILL showing null VISITIDs, and why results from the Analytics Console are choking on queries dealing with the last 2 months. Oye vei....Analytics you are my bane.

My next step is to install Oracle JRockit Mission Control 3.1.2 which "includes tools to monitor, manage, profile, and eliminate memory leaks in your Java application without introducing the performance overhead normally associated with tools of this type." Instructions on configuring Mission Control can be found on the Oracle Support site under knowledge base article "JRockit Mission Control Monitoring/logging For Use With WCI Products [ID 1222573.1]" (login required)

BTW: the "JRockit JDK 28.1.3" link includes Mission Control 4.1, however when I tried running recorders on Mission Control 4.1 I ran into a ton of licensing complaints so I figured I would just stick with 3.1.2 and try to tough it out.

To get it running you need to install it (anywhere) and configure the ptcollector to post information. This is done fairly simply by modifying the bea\alui\ptcollector\10.3.0.1\bin\collectord.bat file and adding the following at the end of the memory set line:
-Xmanagement:autodiscovery=true,authenticate=false,ssl=false,port=7091

After that is in place you'll need to remove and install the service again (just as we did before for the memory increase).  Then launch the Mission Control tool, configure a connector to point to that host and port and you are off to the races!

Unfortunately I ran into a license issue when I tried to record a session.
The FAQ for says, "If you are running on a JRockit R27.6 or later, you do not need a license file. However, if you are using an version earlier than R27.6, you will need to obtain and install a legacy license."
The FAQ is accurate in that the Mission Control isn't licensed, but the JRE needs to be.
The URL in the Mission Control FAQ section points to a dead link: http://download2.bea.com/pub/license/All%20Products/BEA_WebLogic.zip


Solution: Posted on Twitter and got a response within 15 minutes pointing me to a license file discussion #kudosTwitter


I downloaded the license files from the link @marcushirt provided but wasn't sure where exactly to put the file. The instructions say to put it under (%JROCKIT_HOME%/jre/) but since I don't have a JRockit folder it became a trial/error situation with the series of JRE's installed under C:\bea\alui\common\jre.



After testing on dev I figured out that the only spot the license file needs to be added (at least for ptcollector) is  bea\alui\common\jre\1.5.0_32  once the file is in there restart the collector service and when you run Mission Control you'll be able to get a report (assuming you've properly set the ptcollector to notify Mission Control. Not being familiar with Java processes will make it very difficult to make heads or tails of the report dashboard that is generated.


7/15 Update
Today the analytics console is returning results as expected and the VISITIDs are appearing in the database for the current month - as expected.  Shame on me <slap/> for creating a Level 1 support ticket today (due to lots of pressure from above) only to find out that everything was working perfectly!
I can only assume that the out of memory issues were holding back a process that attaches the VISITIDs to the current month database, and some other processes.

There is only one final issue (worth addressing), and that is that when I run the analytics on document views the results fail to return when a folder is specified. Ironically this is the reason we had to upgrade from 10.3.0 to 10.3.0.1 in the first place.....we have now gone full circle and are back to the original problem, although I think we can manage until 10.3.0.2 comes out which rumor has it, is relatively soon.

As always working with Oracle's WCI support team was a great experience that helped fill in a lot of the gaps on the issues above! Thanks Brandon and Jon!

No comments:

Post a Comment