Rome_institute
Conv High Performance Batch Processing with Java Enterprise Edition
belongs to ROME Insights  Home_xsm
by Colin Rank_participant on Jun 19, 2007 - 11:13 AM read 2362 times
 
Enterprise software developers and corporate IT architects have established the Java Enterprise Edition platform as a leading choice for building enterprise software applications. The platform is widely used for everything from eCommerce websites to back office data aggregation systems. Its versatility and reliability as an enterprise computing platform is well established. 

But this wasn’t always so. Sun initially trumpeted Java as a desktop platform that would bring rich content to web applications in the form of Java applets that run locally in a user’s web browser. It was also touted as a thick-client desktop application development tool that would be widely used to build applications that could run on any computer (remember write once, run anywhere?).
 

Sometime in the late nineties, Java application development took a ninety degree turn and ended up resulting in software that mostly runs on corporate servers instead of corporate workstations. Today, a substantial portion of web applications are delivered on the JEE platform. 

Despite the “Enterprise” in its name, the JEE platform was principally designed for handling HTTP requests from web browsers and performing some business logic in response to each request. It now includes many other technologies but most of them are related to this mission.
 

However, as the complexity and disparate uses of web applications has grown, users and designers of these systems have found many users for JEE beyond just responding to requests from a browser. Many of these uses include common enterprise back office tasks such as batch processing of large volumes of data and while the JEE platform was not originally designed for such purposes, it is versatile enough to provide viable solutions to these problems.
 

What is a Batch?
Batch scenarios arise often in business software applications because of a conflict between the enterprise’s desire to respond immediately to customer requests but also analyze the resulting transactions. This requires the speedy capture of the initial transaction with no analysis and then a later batch process to aggregate or optimize the data for reporting, analysis, archive or some other large volume process. It is a safe assumption that every business in the world does some kind of batch processing on their data.
 

The characteristics of the typical batch process include:
  • It is a long-running process that must occur on a regularly scheduled basis.
  • The volume of data to be processed is high, usually on the order of thousands to millions of database rows.
  • There may be complex logic or calculations to perform on the data.
  • The process may require a large set of data from some other system that is delivered at a specific time in a large set.
  • The process is run asynchronously from user interactions. It is not part of a user session in an online system. A user does not start it and is not waiting on it to complete.
Why do Batch Processing in JEE?
The JEE specification was designed for online web applications and has several limitations with respect to batch processing. For instance, JEE containers are required to manage the lifecycle of Enterprise JavaBeans (EJB) and as such might limit the ability to create threads from within these classes.
 

However, this limitation can be overcome in a couple ways. First, while most JEE containers discourage developers from creating and managing their own threads, they do not prohibit the practice especially outside the bounds of EJB classes. Therefore, the batch process can do its own threading using the java.util.Concurrent package (available as of Java 5) and on most JEE platforms this causes no trouble. This package provides user-friendly thread pool classes and thread management facilities that make it easier than ever before to create multi-threaded application in Java.
 

Second, a more spec compliant approach to multithreading is to use Java Message Service (JMS) messages to create worker threads within the JEE context. This approach is a little more complex to implement but provides the benefits of complying with the JEE specification while also allowing the batch process to span multiple Java Virtual Machine (JVM) instances in a clustering situation. This will be discussed in more detail below.
 

Another issue with batch processing on the JEE platform is that by default the container manages transactions and session timeouts. The JEE container is inclined to limit how long resources such as database connections, transactions and beans can be monopolized. This is meant to guarantee a high level of service to all users within an online application, but can be problematic for a long-running batch process.
 

This issue can be addressed by correctly configuring a batch process to not require transactions and to avoid the use of entity beans and stateful session beans that might have timeout or locking problems.  Also, be sure to use the pooled resources such as database connections judiciously, releasing them back to the pool when not in use.
 

In addition to these limitations there is a performance question. Other methods can achieve higher performance than the JEE platform. Batch processing typically involves operations on large volumes of rows stored in a relational database and a stored procedure implemented directly in the database might offer the fastest performance for most applications. However, there are legitimate reasons to implement the logic in JEE instead.
  • Stored procedures are typically implemented in the version of SQL specific to the database platform and are not portable to other databases. This may not matter for a departmental application but is usually not acceptable for an enterprise software product that must be supported on many different databases.
  • The JEE platform provides complimentary technology such as JCA connections to other systems, Web Service calls to other services and other features that might be useful.
  • Logic implemented in Java can reuse other application logic that is also present in the business layer tier of the application.
  • Well-written Java code is usually easier to understand, maintain and enhance than a collection of stored procedures.
  • JEE servers usually include clustering capabilities providing the ability to federate multiple, cheap, commodity servers to improve batch processing performance.
These benefits will often outweigh any performance gain that might be achieved using stored procedures. Furthermore, the difference in performance between a Java solution and a database stored procedure solution can be minimized using the techniques described below. 

Techniques for High Performance Batch Processing on JEE
Now that we’ve covered the limitations and the alternatives let’s discuss how to architect a batch process on the JEE platform for maximum performance. Batch problems are clearly candidates for multi-threaded solutions because the objective is to complete as much work as possible in the shortest time possible and no human user interaction is necessary. Parallel processing using multiple threads is necessary to bring all available computing resources to bear on the problem. Today’s multiple core, multiple CPU servers are especially well-suited for multi-threaded processing.
 

The basic software architecture pattern to follow is a dispatcher-worker pattern. This pattern consists of a centralized controller (the dispatcher) that sends chunks of work to multiple worker threads. This basic pattern is often used in software applications that must do a high volume of work. 


App Server.JPG

The central problem is to determine how to partition the chunks of work to maximize the efficiency of threads and database interactions. Clearly, each thread must operate on an independent logical unit of work. Otherwise, concurrent threads might end up waiting on one another or incorrectly altering results related to other threads. Each thread must be able to complete its job independently from all other threads. 

Subsequently, the units of work should be designed to minimize database interactions because these are very expensive. They involve asking the database to retrieve some data (which could require spinning a physical hard drive) and then moving that data over a network to the JEE server. Minimizing the frequency and volume of these interactions is the single most important factor in JEE batch processing performance. There are many ways to minimize database access and optimize the necessary database interactions.
 

Be Lazy (Avoid unnecessary work)
Only get the required data
The easiest way to minimize database interactions is to carefully construct the batch algorithm to only get and operate upon the data it really needs. This may seem obvious but modern software often has layers such as Data Access Objects (DAOs) or perhaps an object model based on Hibernate that tend to return fully populated objects rather than just the few fields required. It’s often convenient to reuse an existing data layer that does these things, but only do so if the time required to retrieve the extra data is acceptable. Otherwise, create new DAOs or JDBC statements to get just the specific data required.
 

Only do the Work Required for each Run
Another way to avoid bringing back unnecessary data and doing pointless work is to create a configurable batch process. Batch processes often do several different but related operations and not all of them are always necessary. A little extra development work is required to provide input parameters that allow certain operations to be switched off for certain batch runs, but avoiding unnecessary work can provide worthwhile performance improvements.
 

Only work on data that has changed
In this same vein of avoiding unnecessary work, it is often possible to implement a feature that tracks what data has changed (and requires new batch operations) and what data has not changed (and can safely be ignored). Depending on the rate of change of the data, ignoring unchanged values can lead to a large performance improvement.
 

Use data warehousing techniques to compress data over time
The size of the dataset can further be reduced if by exploiting common data warehouse data modeling techniques such as the concept of slowly changing dimensions. Data warehouses are often modeled to contain dimension tables and fact tables. The dimension tables contain all the descriptive attributes upon which data is sliced. Fact tables contain the actual aggregated data. For example, there may be a fact table containing order totals with a foreign key to a dimension table that captures the name of the salesperson for the order allowing the creation of a report to slice order totals by salesperson.
 

Slowly changing dimensions and slowly changing facts are methods that can be used to compress the volume of this data if the data changes over time. The idea is to put date ranges on the dimensions and facts rather than repeating the same values for each date in the time period. For example, a salesperson’s name could change over time if she gets married. Without date range effectiveness on this dimension, it is necessary to capture the name as it was at each batch run to preserve historical data even though the data likely does not change often. This is repetitive and wasteful. If the dimension has a date range, then the batch process need only store a row for each different value.
 

The same can be done with facts if the model requires storing facts at different points in time. If the result of the computation happens to be the same value as it was the last time, the batch process could just store a date range with the answer rather than storing the same answer multiple times.
 

Optimize Database Interactions
Eliminating unnecessary work is the best way to limit database interactions, but clearly, some interactions must happen. Further strategies can be used to make sure those interactions are as efficient as possible.
 

Caching
One approach is to take advantage of caching technologies. Batch processes often require access to some set of master data that is reused throughout the process. This master data should be loaded from the database just once and then cached in memory within the application server context and reused. This can be done using singletons or static variables that hold the data, or caching tools like JBoss Cache, GigaSpaces, Tangosol Coherence, etc. These latter tools provide benefits such as replicating the cached values across multiple JVM instances but introduce added complexity to the application.
 

One caveat for caching is that it may solve a database interaction problem but create a memory constraint problem because the in-memory cache in the application server tier may grow too large. RAM has become much cheaper in recent years, but most JVMs are still limited to 2-4GB of heap space. Be careful that the cache will not exceed the memory space available and cause disk swapping on the application server.
 

Data Streaming
Another approach for optimizing database interactions is to favor a smaller number of denormalized queries that retrieve large volumes of data over a larger number of more granular queries that retrieve small volumes of data. Relational databases are very good at creating an execution plan for a few complex queries and then streaming back the results as quickly as possible. They perform less well when asked to execute lots of small queries that appear to be randomly organized.
 

For example, consider a batch process that needs to compute the shipping cost on a large set of orders. One could choose to define each chunk of work to be a single order. The dispatcher could ask the database for the master list of order IDs and send each ID to a worker thread for processing. That worker thread could then ask the database for the details of each order, do its work, and save the answer back to the database.
 

To the database, this approach will feel like it’s getting slammed by lots of concurrent users asking for different orders all at the same time. There will be high contention for resources such as database connections and access to the order table. It will sort of look like a denial of service attack.
 

On the other hand, one could choose to define the chunks as an arbitrary number of orders, perhaps 1000. The dispatcher could query the database for all required order columns rather than just the order ID. As each row is returned the dispatcher would send all the order data required to compute shipping costs to a worker. Each worker would NOT have to query the database to do its work because everything it needs is provided as input.
 

As each worker completes its work it would place the results on a persistence queue rather than immediately sending an individual insert or update to the database. Every time this queue reaches 1000 entries, the batch process would send a bulk insert/update statement to the database. The result is that the database is allowed to do a few, high volume things as fast as it is able, rather than swapping between numerous small tasks.
 

Optimize Physical Database Access
Databases often respond slowly when they receive multiple requests that contend for data located on the same physical media. Avoiding this contention will speed the batch process. It is often possible to specify how database tables are segregated on different physical disk drives and divide tables that are likely to receive large numbers of concurrent requests onto different physical drives.
 

Use Database Tricks
Relational databases offer many configuration options and interaction methods that can be used intelligently to optimize a batch process. Performance monitoring tools should be used to watch the behavior of the database as the batch process runs. This will allow the optimal configuration of settings such as how much memory to commit to the database’s shared cache and so on.
 

Transactions
A database uses transactions to group multiple changes into a single logical unit of work. These changes are then all committed and stored or all rolled back and thrown away together. The database must maintain a log of these changes to keep track of what things belong together. Large transactions result in a large transaction log. Large logs can negatively impact performance. There are a couple ways to avoid this problem in a batch process. One could choose to not use transactions at all. Most databases include an autocommit feature allowing all changes to be committed immediately. Alternatively, one could make sure that each thread independently commits its own relatively small transaction. In any case, it is not wise to have large, long-running transactions as part of a batch process. 

Prepared Statements
Most databases support the idea of pre-compiled SQL statements called prepared statements. A prepared statement is a SQL statement with placeholders for parameters that will be supplied later on with actual data values. The statement can be compiled once and then reused even if the parameters change. This saves compilation time on the database platform and improves performance.
 

Batch processes usually involve multiple executions of the same SQL statements over and over with different parameter values. This is a perfect situation for prepared statements. Dynamic statements should always be avoided.
 

Application Server Clustering
One of the benefits of using the JEE platform for batch processing is that one  can leverage its ability to cluster multiple application servers. If JMS is used as the transport mechanism to move messages from the dispatcher to the workers and the JMS implementation supports clustered, distributed queues (as many do), then the workers can reside on different physical machines. This provides a method to scale the performance of the batch process by adding application servers. A powerful cluster can be assembled using multiple, inexpensive commodity application servers and it can grow with the requirements of the batch process.  

Engine Controller.JPG

Conclusion
While the JEE platform was originally designed for building enterprise web applications it has grown into a versatile Java server platform that can successfully solve many problems. Batch processing is a common enterprise requirement. The JEE platform can provide an excellent batch processing platform as long as care is taken to optimize database interactions.


Sponsors

Portfolio

Author Profile

Colin  Feed_small Secure_feed_16

Participant Rank_participant

Recent

Subscribe

Feed for ROME Institute:
Feed_small Public Secure_feed_16 Secure