GigaSpaces uses the notion of Event-containers for matching and handling data in the Space as events. This can be done in either Polling mode (the container polls the Space for data matching the criteria), or Notify (the Space pushes matching data as events to the container).
In case of a polling-container, the object should be changed and returned to the Space, to ensure it no longer matches the criteria. Otherwise the event would be re-triggered.
Now, what would happen in case the event-handler encounters a (permanent or temporary) failure during processing, such as a down-stream system is not available?
Well, clearly, the object is not handled, transactions will be rolled-back, etc, but most importantly: the object is returned to the Space in its original state. This means the object will still match the polling-criteria. This can cause endless loops in your polling container, unnecessarily consuming resources, and the problem will only disappear when the underlying problem is resolved.
This problem is known as the Poison Message problem (see for example this link).
A solution is to add try-catch blocks in your event-handler, and handle the problematic events appropriately to avoid loops. Simply catching and handling the exception has an issue though: It does not take in account that the problem may only be temporary. Therefore adding a retry-mechanism may make sense in your application.
However, adding a retryCount and maxRetries to the problematic object will not work: We cannot modify the erroneous object and write it back to the space, as we must revert the transaction under which it failed – there might have been very viable reasons for the exception in the first place!
For this reason we should use a separate object, which I call a Hospital object. This Hospital object holds a uid-reference to the erroneous object, and can include fields like retryCount and maxRetries.
Here’s a snippet of what this Hospital object might possibly look like:
public class Hospital implements Serializable {
private String id;
private String poisonObjectId;
private Integer retryCount;
private Integer maxRetries;
private Integer routingId;
public Hospital() {
}
// Getters and setters beyond this line
}
Although a try-catch block in your event-containers will work, I have used an Aspect that intercepts any exception from a (polling) event-container:
@Pointcut("@annotation(org.openspaces.events.adapter.SpaceDataEvent)")
private void spaceDataEvent() { }
@Around(“spaceDataEvent()”)
public Object exceptionHandling(ProceedingJoinPoint pjp)
throws EventContainerRetryException {
Object o = null ;
try {
o = pjp.proceed();
} catch (Throwable e) {
eventContainerExceptionHandler.handleException(
(BaseObject)pjp.getArgs()[0],
e
);
}
return o ;
}
In the aspect we use an exception handler, that is invoked whenever any exception occurs. This handler takes care of finding and writing the hospital object, incrementing the retry-count, etc:
public void handleException(BaseObject baseObject, Throwable e)
throws EventContainerRetryException
{
Hospital template = new Hospital();
template.setPoisonObjectId(baseObject.getId());
Hospital hospital = hospitalGigaSpace.read(template);
if ( hospital == null ) {
hospital = new Hospital();
hospital.setPoisonObjectId(baseObject.getId());
// Set other fields
}
if ( hospital.getMaxRetries().equals(hospital.getRetryCount()) ) {
// Maybe you want to do something generic here to avoid the event
// from being triggered again, for example setting a status flag.
} else {
// Increase the retry count on the hospital object
hospital.setRetryCount(hospital.getRetryCount() + 1) ;
// Write the hospital object back, if you want with a lease-time
hospitalGigaSpace.write(hospital, 120000);
// Rethrow the initial exception caused in the event container
// This will cause its transaction to be rolled back
. throw new EventContainerRetryException(e) ; } }
What is important to realize is that the operations on the hospitalGigaSpace bean should work under a different transactional context from the event container: the polling container transaction must be rolled back while the operations on the hospital object must succeed.
BTW, in a next release of GigaSpaces a generic exception handler will be implemented, which can be used instead of try-catch blocks or AOP.