BPMS Lesson Learned – Design

This blog contribution tries to summarize experiences gained when designing BPMS solution using ActiveVOS v. 6.1. and highlight key design decesions. It describes the consequences of these decisions in the context of a bigger picture (impact on production, maintainability, day-to-day routines, …).  As the overall architecture is a set of design decisions which depends on a business context and objectives you are trying to achieve there is no simple copy-paste solution applicable in every situation. Let’s make long story short and have a look at those key areas:
  • BPMS is not a web flow framework! It can be tempting to use a WS-HT interface for each interaction with a client/user. Beside the fact that you have to initiate somehow the interaction with the client ( create a human task) you are not aware of, it has a negative impact on DB space consumption. Every human task has several underlying processes consuming space  depending on a log level and persistence settings but moreover this doesn’t hold any business relevant info. In newer versions this is treated in a better way.
  • Think of your error handling strategy. Used implementation offers feature “suspend on fail” so in case of an error the process is put in suspended state and waiting for a manual interaction. The operator can replay the process. Be very careful with this feature. Overuse of this pattern leads to high database space consumption as a precondition of this feature is a full process logging and persistence enabled. More over what the operator can do with the failing process? In a “properly tested” system  this can happen due to the two main reasons: either data missing or technical reason like service is unavailable. In the first case do you really think that the operator will know your client’s  insurance number somewhere in the system? Certainly no. What about auditing in such a system, that’s another really important question. Fail over to human task seems like a reasonable solution. In the second case, implementation offers a feature called “retry policy”. Taking advantage of this feature you can achieve short-time outage immunity of the system. 
  • Structure your processes to smaller blocks. Although I can understand why people tend to design long processes, experiences prove the opposite. With one long process which realizes the whole business you’ll gain  readability but you’ll lose re-usability, maintainability  and more importantly scalability of the system. Not all the processes and sub-processes have the same level of importance at least from the technical perspective and realized business as well. You can use different levels of process/sub-process logging, persistence and polices such as retry policy. All this has a positive impact on lowering the database  consumption ratio, improving stability and robustness of the system. One important thing which we shouldn’t leave aside is spatial scalability. Every sub-process is just another web service so in case we need to improve throughput of the system we are free to setup a new instance and deploy those processes there. We are absolutely free from the point of infrastructure like load balancing, clustering, …  The only thing we need to keep in mind is that created human tasks are running within “BPEL engine” (no different component). The current version of implementation wasn’t able to read human tasks from more than one “BPEL engine” .
  • Sort your data. In every business domain there are some legal requirements such as auditing and archiving those information for a number of years. Naive approach I’ve met can be: “BPMS holds all the info”. That certainly does. But how the data are structured? Are all the data stored in BPMS relevant to business hence worth of archiving? BPMS is definitively keeping a lot of system information not relevant to business in his own internal structure. E.g. incoming messages as xml objects in variables. That means to get a specific info you would need to find that variable out and within xml object locate that specific piece of information. Moreover this information doesn’t have a time scope guaranteed. If the process doesn’t have full logging enabled then just the latest status is kept and no track of changes. The best approach is to solve this auditing requirements in advance. Store all your audit info into a separate DB schema in a well-defined readable structure. You save a significant amount of consumed DB space and you don’t have to create a “data mining” solution from the BPMS schema. No need to talk about expenses for additional disk capacity.
  • Reason over every feature used in the human task. Human tasks offer a lot of features which support interaction with users e.g. email, task escalation, task notification, etc. Especially pay attention to the task notification feature. Internal implementation is equal to a human task which simply doesn’t show up in the inbox as a new task but only as a notification. It was measured that one human task consumes roughly 1MB. Overuse of this feature can have a big impact on disk space consumed by the DB. The most dangerous thing about notifications is that underlaying system processes are in the running state until the user confirm delivery. Hence it is difficult to delete them from the console. Also associate the notification to the human task feature is missing so there is no other way to cope with the problem then a manual cancellation of the internal system process. This process can be really time-consuming when your system generates three thousand messages over the night.
  • Use “proxy back-end call pattern”. Every back-end system has his own domain model and message structure. It is really tedious and time-consuming to build messages on the BPMS side where the only tools available are Xpath, XQuery, etc. Proven by experience the better and more efficient approach is calling a “proxy method” which is responsible for building the message up, sending the message out and processing the result passed back to BPMS.