The conversation usually starts with confidence. "We need to load 1 M records from a CSV file on startup. Should be straightforward. Stream it in, batch insert, we're done." Then production happens, startup hangs, and the database connection pool exhausts. Your API becomes unresponsive because some background task is saturating the database.
The lie is this: you can load massive datasets without architecture. You can't. The naive approach dies not because Java is slow but because you have made invisible assumptions about memory, concurrency, and what the database can actually do at startup time.
As AI implementations handle more of the routine work, this kind of architectural judgment becomes the differentiator. Knowing when to serialize, when to parallelize, and how to build systems that don't masquerade as simple when they are not, matters more than ever.
The Problem: Naive Bulk Loading Doesn't Scale
You have seen this pattern:
List all = csvParser.parseAll();
repository.saveAll(all);
On a small dataset, it works. On 1 M records:
The parser loads everything into heap memory. JVM heap exhaustion. Out of memory errors.
Even if it fits, one saveAll() triggers one database transaction, one lock acquisition, N individual insert row operations.
The database connection pool becomes a bottleneck. Concurrent operations (API requests, other batch jobs) starve.
Startup hangs and your system appears broken.
Related Articles
Shared topics and tags
Newsletter
Expert notes in your inbox
Subscribe for new articles.
The real problem is not the data volume. It is that you have made the system's resource constraints invisible. You have pretended the load is simple when it demands coordination.
The Pattern: Event-Driven Batching
The solution is to decompose the load into stages and coordinate them through events.
Stage 1: Parse and Batch
Read the CSV file and accumulate records in memory, but only in chunks. When a chunk reaches a threshold (say, 20,000 records), emit an event and reset the accumulator.
Stage 2: Async Event Processing
Listen for the batch event on a separate, non-blocking thread pool. Process each batch independently. If batch 1 fails, batch 2 can still proceed (or retry, or be logged as failed).
Stage 3: Concurrency Control
Before any batch writes to the database, it acquires a permit from a semaphore. The semaphore limits concurrent database operations to a safe number (e.g., 85 permits when your connection pool has 100 total). This prevents one bulk job from starving other workloads.
The result: Your startup no longer blocks. The load proceeds asynchronously, respects database constraints, and other operations continue normally.
Implementation: The Bootstrap Layer
The entry point is an ApplicationRunner. This hook runs after Spring has initialized but before the app accepts traffic.
@Component
public class BulkDataBootstrap implements ApplicationRunner {
private final BulkDataRepository repository;
private final ApplicationEventPublisher eventPublisher;
@Override
public void run(ApplicationArguments args) throws Exception {
if (repository.count() == 0) {
loadData();
}
}
private void loadData() throws IOException {
ClassPathResource resource = new ClassPathResource("data/bulk-data.csv.zip");
try (ZipInputStream zis = new ZipInputStream(resource.getInputStream())) {
ZipEntry entry = zis.getNextEntry();
while (entry != null) {
if (!entry.isDirectory() && entry.getName().endsWith(".csv")) {
parseAndBatch(zis);
}
entry = zis.getNextEntry();
}
}
}
private void parseAndBatch(InputStream stream) {
List<Record> records = csvParser.parse(stream);
AtomicInteger count = new AtomicInteger();
List<Entity> batch = new ArrayList<>();
for (Record record : records) {
batch.add(mapper.toEntity(record));
count.incrementAndGet();
if (count.get() >= 20000) {
eventPublisher.publishEvent(new BatchLoadEvent(batch));
batch = new ArrayList<>();
count.set(0);
}
}
if (!batch.isEmpty()) {
eventPublisher.publishEvent(new BatchLoadEvent(batch));
}
}
}
Key decisions:
Idempotent guard (count() == 0): Skip loading if data already exists. Safe for re-deployment.
ClassPathResource + ZipInputStream: Load the ZIP from within the JAR. Stream entries without extracting to disk.
Batch size of 20,000: Large enough to amortize database overhead, small enough to fit comfortably in heap.
Events, not direct saves: Decouple the parser from the database. The listener can now be tested independently, and you can add multiple listeners if needed.
Async Processing with Virtual Threads
The event listener runs asynchronously on a dedicated thread pool.
@Component
public class BatchLoadListener {
private final BulkDataRepository repository;
private final TransactionTemplate transactionTemplate;
@Async(value = "bulkLoadExecutor")
@EventListener
@DatabaseSemaphoreGuard
public void handleBatchLoad(BatchLoadEvent event) {
transactionTemplate.executeWithoutResult(status -> {
repository.saveAllAndFlush(event.getRecords());
});
}
}
The executor uses Java 21+ virtual threads:
@Configuration
public class BulkLoadConfig {
@Bean(name = "bulkLoadExecutor")
public AsyncTaskExecutor bulkLoadExecutor() {
SimpleAsyncTaskExecutor executor = new SimpleAsyncTaskExecutor();
executor.setVirtualThreads(true);
executor.setThreadNamePrefix("BulkLoad-");
return executor;
}
}
Why virtual threads? They are lightweight OS-like threads that scale to thousands without the overhead of traditional threads. For I/O-bound operations (like database inserts), they allow multiple batches to proceed concurrently without exhausting system resources.
I have seen teams spawn hundreds of background tasks and assume they need a massive thread pool. Virtual threads changed that assumption. You can now spawn broadly without fear of resource exhaustion.
Concurrency Control: The Semaphore Guard
Multiple async operations can run simultaneously. Without coordination, they will exhaust the database connection pool.
The solution is an annotation-based semaphore guard:
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface DatabaseSemaphoreGuard {
}
An AOP aspect enforces it:
@Aspect
@Component
public class SemaphoreGuardAspect {
private final Semaphore semaphore;
@Around("@annotation(DatabaseSemaphoreGuard)")
public Object guard(ProceedingJoinPoint pjp) throws Throwable {
semaphore.acquire();
try {
return pjp.proceed();
} finally {
semaphore.release();
}
}
}
The semaphore is configured with a safe permit count:
@Bean
public Semaphore databaseSemaphore() {
return new Semaphore(85); // Leave headroom for other operations
}
This is transparent concurrency control. Any method annotated with @DatabaseSemaphoreGuard will automatically queue if all permits are in use. The load slows slightly, but the database stays responsive and other workloads do not starve.
Structural Integrity: Spring Modulith
Keep the bulk load logic contained within a module. Use @ApplicationModuleListener to enforce boundaries.
@Component
public class BatchLoadListener {
@EventListener
@ApplicationModuleListener
@DatabaseSemaphoreGuard
public void handleBatchLoad(BatchLoadEvent event) {
// ...
}
}
This prevents accidental cross-module coupling. The event is published within the module; it stays within the module.
Decisions and Why They Matter
Event-driven batching: Decouples parsing from persistence; allows async, independent batch processing. Easier to test.
Batch size = 20,000: Balances amortization (fewer round-trips) with memory footprint (fits safely in heap). Empirically fast.
Virtual threads: Allows many concurrent batches without spawning thousands of OS threads. Startup stays responsive.
Semaphore gating: Prevents one bulk job from starving other database operations. Maintains system responsiveness.
Idempotent initialization: Safe for re-deployment. No need for manual cleanup or schema versioning.
The shift in how engineers should think about bulk operations is clear. Implementation complexity moves from "how do I write the code" (now often handled by AI) to "what architecture prevents this from becoming a footgun later."
This pattern of event-driven, asynchronous, with explicit concurrency control is not novel. But it is increasingly important because it is the difference between a system that appears simple on the surface but hides real complexity, and one that is honest about what it demands.
Stop pretending your bulk loads are simple. Architect them properly. Make concurrency visible. Build in guards. Your future self, and your on-call team, will thank you.
If you work with large-scale data loads or build systems that coordinate multiple concurrent workloads, share this with your team. There is always someone learning this lesson the hard way.