
|
High Performance Batch Processing with Java Enterprise Edition
belongs to ROME Insights ![]() by Colin on 2007-06-19 04:13 PM read 2458 times
|
The central problem is to determine how to partition the chunks of work to maximize the efficiency of threads and database interactions. Clearly, each thread must operate on an independent logical unit of work. Otherwise, concurrent threads might end up waiting on one another or incorrectly altering results related to other threads. Each thread must be able to complete its job independently from all other threads.
Subsequently, the units of work should be designed to minimize database interactions because these are very expensive. They involve asking the database to retrieve some data (which could require spinning a physical hard drive) and then moving that data over a network to the JEE server. Minimizing the frequency and volume of these interactions is the single most important factor in JEE batch processing performance. There are many ways to minimize database access and optimize the necessary database interactions.
Be Lazy (Avoid unnecessary work) Only get the required data
The easiest way to minimize database interactions is to carefully construct the batch algorithm to only get and operate upon the data it really needs. This may seem obvious but modern software often has layers such as Data Access Objects (DAOs) or perhaps an object model based on Hibernate that tend to return fully populated objects rather than just the few fields required. It’s often convenient to reuse an existing data layer that does these things, but only do so if the time required to retrieve the extra data is acceptable. Otherwise, create new DAOs or JDBC statements to get just the specific data required.
Only do the Work Required for each Run
Another way to avoid bringing back unnecessary data and doing pointless work is to create a configurable batch process. Batch processes often do several different but related operations and not all of them are always necessary. A little extra development work is required to provide input parameters that allow certain operations to be switched off for certain batch runs, but avoiding unnecessary work can provide worthwhile performance improvements.
Only work on data that has changed
In this same vein of avoiding unnecessary work, it is often possible to implement a feature that tracks what data has changed (and requires new batch operations) and what data has not changed (and can safely be ignored). Depending on the rate of change of the data, ignoring unchanged values can lead to a large performance improvement.
Use data warehousing techniques to compress data over time
The size of the dataset can further be reduced if by exploiting common data warehouse data modeling techniques such as the concept of slowly changing dimensions. Data warehouses are often modeled to contain dimension tables and fact tables. The dimension tables contain all the descriptive attributes upon which data is sliced. Fact tables contain the actual aggregated data. For example, there may be a fact table containing order totals with a foreign key to a dimension table that captures the name of the salesperson for the order allowing the creation of a report to slice order totals by salesperson.
Slowly changing dimensions and slowly changing facts are methods that can be used to compress the volume of this data if the data changes over time. The idea is to put date ranges on the dimensions and facts rather than repeating the same values for each date in the time period. For example, a salesperson’s name could change over time if she gets married. Without date range effectiveness on this dimension, it is necessary to capture the name as it was at each batch run to preserve historical data even though the data likely does not change often. This is repetitive and wasteful. If the dimension has a date range, then the batch process need only store a row for each different value.
The same can be done with facts if the model requires storing facts at different points in time. If the result of the computation happens to be the same value as it was the last time, the batch process could just store a date range with the answer rather than storing the same answer multiple times.
Optimize Database Interactions
Eliminating unnecessary work is the best way to limit database interactions, but clearly, some interactions must happen. Further strategies can be used to make sure those interactions are as efficient as possible.
Caching
One approach is to take advantage of caching technologies. Batch processes often require access to some set of master data that is reused throughout the process. This master data should be loaded from the database just once and then cached in memory within the application server context and reused. This can be done using singletons or static variables that hold the data, or caching tools like JBoss Cache, GigaSpaces, Tangosol Coherence, etc. These latter tools provide benefits such as replicating the cached values across multiple JVM instances but introduce added complexity to the application.
One caveat for caching is that it may solve a database interaction problem but create a memory constraint problem because the in-memory cache in the application server tier may grow too large. RAM has become much cheaper in recent years, but most JVMs are still limited to 2-4GB of heap space. Be careful that the cache will not exceed the memory space available and cause disk swapping on the application server.
Data Streaming
Another approach for optimizing database interactions is to favor a smaller number of denormalized queries that retrieve large volumes of data over a larger number of more granular queries that retrieve small volumes of data. Relational databases are very good at creating an execution plan for a few complex queries and then streaming back the results as quickly as possible. They perform less well when asked to execute lots of small queries that appear to be randomly organized.
For example, consider a batch process that needs to compute the shipping cost on a large set of orders. One could choose to define each chunk of work to be a single order. The dispatcher could ask the database for the master list of order IDs and send each ID to a worker thread for processing. That worker thread could then ask the database for the details of each order, do its work, and save the answer back to the database.
To the database, this approach will feel like it’s getting slammed by lots of concurrent users asking for different orders all at the same time. There will be high contention for resources such as database connections and access to the order table. It will sort of look like a denial of service attack.
On the other hand, one could choose to define the chunks as an arbitrary number of orders, perhaps 1000. The dispatcher could query the database for all required order columns rather than just the order ID. As each row is returned the dispatcher would send all the order data required to compute shipping costs to a worker. Each worker would NOT have to query the database to do its work because everything it needs is provided as input.
As each worker completes its work it would place the results on a persistence queue rather than immediately sending an individual insert or update to the database. Every time this queue reaches 1000 entries, the batch process would send a bulk insert/update statement to the database. The result is that the database is allowed to do a few, high volume things as fast as it is able, rather than swapping between numerous small tasks.
Optimize Physical Database Access
Databases often respond slowly when they receive multiple requests that contend for data located on the same physical media. Avoiding this contention will speed the batch process. It is often possible to specify how database tables are segregated on different physical disk drives and divide tables that are likely to receive large numbers of concurrent requests onto different physical drives.
Use Database Tricks
Relational databases offer many configuration options and interaction methods that can be used intelligently to optimize a batch process. Performance monitoring tools should be used to watch the behavior of the database as the batch process runs. This will allow the optimal configuration of settings such as how much memory to commit to the database’s shared cache and so on.
Transactions
A database uses transactions to group multiple changes into a single logical unit of work. These changes are then all committed and stored or all rolled back and thrown away together. The database must maintain a log of these changes to keep track of what things belong together. Large transactions result in a large transaction log. Large logs can negatively impact performance. There are a couple ways to avoid this problem in a batch process. One could choose to not use transactions at all. Most databases include an autocommit feature allowing all changes to be committed immediately. Alternatively, one could make sure that each thread independently commits its own relatively small transaction. In any case, it is not wise to have large, long-running transactions as part of a batch process.
Prepared Statements
Most databases support the idea of pre-compiled SQL statements called prepared statements. A prepared statement is a SQL statement with placeholders for parameters that will be supplied later on with actual data values. The statement can be compiled once and then reused even if the parameters change. This saves compilation time on the database platform and improves performance.
Batch processes usually involve multiple executions of the same SQL statements over and over with different parameter values. This is a perfect situation for prepared statements. Dynamic statements should always be avoided.
Application Server Clustering
One of the benefits of using the JEE platform for batch processing is that one can leverage its ability to cluster multiple application servers. If JMS is used as the transport mechanism to move messages from the dispatcher to the workers and the JMS implementation supports clustered, distributed queues (as many do), then the workers can reside on different physical machines. This provides a method to scale the performance of the batch process by adding application servers. A powerful cluster can be assembled using multiple, inexpensive commodity application servers and it can grow with the requirements of the batch process.
Conclusion
While the JEE platform was originally designed for building enterprise web applications it has grown into a versatile Java server platform that can successfully solve many problems. Batch processing is a common enterprise requirement. The JEE platform can provide an excellent batch processing platform as long as care is taken to optimize database interactions.