Gap and Island Problems

Writing this blog reminded me that I need a holiday. Anyway, there’s a class of data problem that shows up all over the place:

  • Website analytics tracking consecutive visit streaks
  • Manufacturing systems detecting equipment downtime periods
  • Financial systems identifying continuous trading activity
  • Healthcare databases finding gaps in patient monitoring
  • Network logs grouping consecutive error bursts

On the surface, these seem like different problems. But they’re all the same problem in disguise:

Gaps and Islands

The gaps and islands problem is deceptively simple to describe. You have a sequence of values, usually timestamps or integers, and you need to identify the continuous groups (islands) separated by breaks in the sequence (gaps). A user logged in on days 1, 2, 3, then missed day 4, then logged in again on days 5, 6. That’s two islands of consecutive logins separated by a gap.

Simple concept, surprisingly tricky to implement efficiently because the majority of our data lives in a relational database.

[1,2,3], ,[5,6]

What makes gaps and islands interesting isn’t the problem itself, but what solving it reveals about database thinking. There are at least a half dozen fundamentally different approaches, each showcasing different SQL capabilities and performance characteristics. The approach you reach for first says something about how you conceptualize data problems and which database features you understand deeply. Ok cool, but why do we care and why is the tricky?

Why This Problem Is Hard

The challenge with gaps and islands is that SQL is fundamentally set-oriented, not sequence-oriented. Databases excel at operating on entire sets of rows simultaneously. They struggle with problems that require understanding the relationship between consecutive rows in a sequence. You can’t just say “give me groups of consecutive values” because SQL doesn’t have a native concept of consecutiveness. The underlying storage is not orientated in that manner. See my post on. B-Trees to get a sense of how a relational database orientates its data.

Your first instinct might be to use GROUP BY, since you’re trying to identify groups. But GROUP BY aggregates rows with identical values, not consecutive values. Consecutive days like 1, 2, 3 are different values, so GROUP BY doesn’t help directly. You need to somehow transform the problem so that consecutive sequences share a common grouping key, then GROUP BY that transformed value.

The other challenge is performance. A naive solution might join each row to the next row in the sequence, checking if they’re consecutive. It’s brute force and has a > O(n) time complexity. This works for small datasets but scales poorly. When you have millions of rows, these self-joins become expensive quickly. The database needs to compare each row against many other rows, leading to quadratic or worse complexity. Good luck with that.

Modern SQL provides window functions that operate across sequences of rows, but using them correctly for gaps and islands isn’t that intuitive. You need to think carefully about what information each row needs from surrounding rows to determine which island it belongs to. Different window functions provide different perspectives on the sequence, and choosing the right one is the key to an elegant solution.

The Classic Approach

The most elegant solution to gaps and islands is the difference method, which feels like magic the first time you see it. The insight is that if you subtract the row number from the sequential value, consecutive values produce the same difference while gaps produce different differences. Confused?

Consider a user login sequence of days:

1, 2, 3, 5, 6, 7

When you assign row numbers

(1, 2, 3, 4, 5, 6)

and subtract them from the day numbers, you get

0, 0, 0, 1, 1, 1.

The first island has a consistent difference of 0, the second island has a consistent difference of 1. Now you can GROUP BY that difference to identify islands.

This works because consecutive sequences maintain a constant delta between the sequential value and the row number. When a gap appears, that delta changes. Each island has its own unique delta that serves as its grouping key. It’s conceptually simple once you see it, but arriving at this solution requires a mental leap that isn’t obvious.

The SQL implementation is remarkably concise. You use ROW_NUMBER() to assign sequential row numbers, subtract it from your sequence column, then group by the result. The entire solution is typically under ten lines of SQL, and it performs well because it requires only a single pass through the data with window functions that databases optimize effectively.

WITH numbered AS (
  SELECT 
    user_id,
    login_date,
    login_date - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS island_id
  FROM user_logins
)
SELECT 
  user_id,
  island_id,
  MIN(login_date) AS streak_start,
  MAX(login_date) AS streak_end,
  COUNT(*) AS streak_length
FROM numbered
GROUP BY user_id, island_id

The beauty of this approach is that it scales linearly. The database scans the data once, applies window functions, and groups the results. There are no self-joins, no recursive queries, no complex comparisons. It’s efficient and elegant, which is why it’s become the standard solution for gaps and islands when you’re working with evenly spaced sequences.

When the Classic Approach Breaks Down

The difference method has a hidden assumption: your sequence values are evenly spaced integers or date values. Days are separated by 1, hours by 1, sequential IDs by 1. But real-world sequences aren’t always so clean.

What if you’re tracking equity trading and want to identify periods of continuous trading activity, defining “continuous” as trades within 5 minutes of each other? The difference method doesn’t work because the gaps aren’t uniform.

Consider trades at timestamps

100, 103, 108, 115, 117

The first three are within 5 minutes and should form an island. Then there’s a 7-minute gap to the next trade, starting a new island. The difference method would give you

99, 101, 105, 111, 112

These differences don’t group the timestamps the way you need.

Irregular sequences require different approaches that can handle variable gaps. You need to explicitly identify where gaps occur, mark those boundaries, and then group based on being between the same boundaries. This is conceptually more complex and typically requires multiple passes through the data or more sophisticated window functions.

The LAG Approach

When dealing with irregular sequences, the LAG window function provides a more explicit solution. LAG lets you access the previous row’s value within your window. You can directly compare each row to the previous row to determine if there’s a gap, mark those gap boundaries, and then group accordingly. According to Wikipedia;

Lag windowing is a technique that consists of windowing the autocorrelation coefficients prior to estimating linear prediction coefficients (LPC).

Those with a mathematical inclination may want to explore further, but in laymens terms, the logic is: look at the previous row’s value using LAG. If the current row’s value minus the previous row’s value exceeds your gap threshold, you’re starting a new island. Mark that with a flag. Then do a running sum of those flags to create island identifiers. Each new island increments the sum by one, giving each island a unique identifier.

WITH gaps_marked AS (
  SELECT 
    user_id,
    trade_timestamp,
    CASE 
      WHEN trade_timestamp - LAG(trade_timestamp) OVER (PARTITION BY user_id ORDER BY trade_timestamp) > INTERVAL '5 minutes'
        OR LAG(trade_timestamp) OVER (PARTITION BY user_id ORDER BY trade_timestamp) IS NULL
      THEN 1 
      ELSE 0 
    END AS is_gap
  FROM trades
),
islands_numbered AS (
  SELECT 
    user_id,
    trade_timestamp,
    SUM(is_gap) OVER (PARTITION BY user_id ORDER BY trade_timestamp) AS island_id
  FROM gaps_marked
)
SELECT 
  user_id,
  island_id,
  MIN(trade_timestamp) AS island_start,
  MAX(trade_timestamp) AS island_end,
  COUNT(*) AS island_size
FROM islands_numbered
GROUP BY user_id, island_id

This approach is more verbose than the difference method but more flexible. You can define “gap” however you need. Trades more than 5 minutes apart, log entries from different sessions, temperature readings that differ by more than a threshold. The logic is explicit and understandable, which matters for code that others will maintain.

The performance characteristics are similar to the difference method for moderately sized datasets. You’re still doing a single scan with window functions. However, you’re computing LAG twice in the first CTE Common Table Expressions), which some query optimizers handle better than others. The running sum in the second CTE adds a bit more work, but it’s still linear complexity.

The Recursive CTE Approach

Common Table Expressions with recursion offer another approach that’s particularly useful when you need complex logic for determining island membership or when you want to process islands incrementally.

The recursive approach builds islands one row at a time, explicitly deciding whether each row extends the current island or starts a new one.

The recursive CTE starts with a base case, typically the first row of each sequence. Then it recursively adds rows, maintaining state about the current island. Each iteration can look at the current island’s properties and decide whether the next row belongs to it or starts a new island. This gives you fine-grained control over island formation logic.

This approach is more powerful but also more complex and potentially slower. Recursive CTEs can be hard to reason about, and their performance depends heavily on how well the database optimizes recursion. For simple gaps and islands problems, recursion is overkill. But for problems with complex island membership rules or where you need to compute running statistics as islands form, recursion provides capabilities that other approaches struggle with.

The main downside is that recursive CTEs process data sequentially in logical terms, which doesn’t parallelize as well as set-based operations. If you have a huge dataset that could benefit from parallel processing, the recursive approach might be slower than window function approaches that the database can parallelize across multiple cores.

The Self-Join Approach

Before window functions became standard SQL, the typical approach to gaps and islands involved self-joins. You’d join the table to itself, comparing each row to other rows to identify which ones formed consecutive sequences. This is conceptually straightforward but typically performs poorly.

The self-join approach joins each row to the next row in the sequence, checking if they’re consecutive. Rows that successfully join are part of the same island. Rows that don’t find a next consecutive row are island boundaries. You then need additional logic to group the rows between boundaries into islands.

The performance problem is that self-joins on large tables are expensive. Even with proper indexes, the database needs to probe the index for each row to find matching rows. On a million-row table, you’re potentially doing millions of index lookups. Modern window functions avoid this by processing the sorted data sequentially, which is much more efficient.

That said, self-joins occasionally make sense for very specific scenarios. If your data is partitioned or you’re only analyzing a small subset, and if window functions aren’t available in your SQL dialect, self-joins might be the practical solution. They’re also sometimes easier to understand for people less familiar with window functions, which has value for maintainability.

Finding Gaps – The Inverse Problem

Sometimes you don’t care about the islands themselves but about the gaps between them. You want to identify missing values in a sequence, periods of inactivity, or breaks in data collection. In our Big Brother esque world some employers are tracking mouse movement and keyboard presses. A three hour gap is something they would be keen tonidentify. Finding gaps is actually easier than finding islands, because you’re looking for exceptions rather than grouping continuous sequences.

The typical approach is to join the sequence to a complete reference sequence and identify values that don’t exist in your data. For date sequences, you might generate all dates in a range and left join your data to it, filtering for nulls. For numeric sequences, you might use a numbers table or generate series to create the complete sequence.

WITH all_dates AS (
  SELECT generate_series(
    '2024-01-01'::date,
    '2024-12-31'::date,
    '1 day'::interval
  )::date AS date
)
SELECT d.date AS missing_date
FROM all_dates d
LEFT JOIN user_logins l ON d.date = l.login_date AND l.user_id = 123
WHERE l.login_date IS NULL

Another approach is to use LAG or LEAD to identify where the sequence breaks. If the difference between consecutive values exceeds one (or your expected interval), there’s a gap. You can then calculate the gap size and identify the missing values within it.

The LAG() function is used to get value from the row that precedes the current row

The LEAD() function is used to get value from a row that succeeds the current row

Gap finding is common in data quality checks, monitoring systems, and reporting. You’re not doing complex grouping; you’re simply identifying where expected values are missing. The techniques are simpler than island detection, but they’re closely related problems that often appear together in real applications.

Performance Considerations

The theoretical complexity of gaps and islands algorithms is similar across the main approaches, all roughly linear in the number of rows. But practical performance varies significantly based on how well your database optimizes different SQL constructs and how your data is structured.

Window functions generally perform well because databases have invested heavily in optimizing them. Modern query optimizers can often execute window functions in a single pass through sorted data, which is about as efficient as you can get. The difference method and LAG approach both benefit from these optimizations.

Partitioning matters enormously. If you’re finding islands per user across millions of users, partitioning by user_id in your window functions is critical. This lets the database process each user’s sequence independently, potentially in parallel. Without proper partitioning, you’re forcing the database to consider all rows together, which is much slower.

Indexes on the sequence column are important but not sufficient. The database needs to sort by that column for window functions, so having an index helps. But window functions still need to scan all relevant data; they just do it in sorted order. You can’t skip rows with indexes the way you can with range queries.

The actual gap and island patterns in your data affect performance too. If you have many short islands, you’re grouping many small sets, which is fast. If you have a few enormous islands, you’re grouping huge sets, which takes more memory and time. If gaps are rare, some approaches that explicitly track gaps waste computation. If gaps are common, approaches optimized for continuous sequences might struggle.

Real-World Applications

Gaps and islands problems appear constantly in production systems, often in disguised forms. Session tracking in web analytics is a classic example. You have a stream of user events with timestamps, and you need to group them into sessions, defined as sequences of events within some timeout period of each other. That’s an islands problem with variable gaps.

Manufacturing systems tracking equipment uptime and downtime face this continuously. A machine produces status events while running. Gaps in those events indicate downtime. You need to identify downtime periods, calculate their duration, and potentially trigger alerts. The reliability of your detection logic directly impacts operational efficiency.

Financial systems do this for trading activity, market hours, and transaction patterns. Fraud detection often involves identifying unusual gaps or islands in transaction sequences. A credit card suddenly used continuously after periods of inactivity might indicate fraud. The speed of detection matters because fraudulent transactions accrue costs quickly.

Healthcare monitoring systems track patient vitals continuously. Gaps in data collection might indicate sensor failures or patient disconnection, which could be clinically significant. Identifying these gaps quickly and reliably is literally a matter of life and death in intensive care settings.

Resource scheduling systems need to identify available time slots by finding gaps in reservation schedules. Meeting room booking systems, appointment schedulers, and resource allocation tools all deal with finding gaps where resources are available between periods where they’re reserved.

Database-Specific Considerations

Different database systems have different SQL dialects and optimization characteristics that affect how you should approach gaps and islands. PostgreSQL’s window function implementation is generally excellent, making the difference method and LAG approach highly performant. The generate_series function is invaluable for gap detection.

SQL Server has had window functions for a long time and optimizes them well. The LAG and LEAD functions are efficient, and SQL Server’s query optimizer often produces excellent plans for gaps and islands queries. The ability to create indexed computed columns can help in cases where you’re repeatedly querying for islands.

MySQL historically had weaker window function support, only adding them in version 8. Older MySQL versions required more creative solutions using user variables or self-joins, which were slower and more error-prone. Modern MySQL handles window functions reasonably well, though perhaps not as optimized as PostgreSQL or SQL Server.

Oracle has sophisticated window function capabilities including extensions like MODEL clauses that can simplify complex gaps and islands problems. Oracle’s optimizer is aggressive about rewriting queries, sometimes transforming gaps and islands logic into more efficient execution plans automatically.

Cloud data warehouses like Snowflake, BigQuery, and Redshift all support modern SQL window functions, but their optimization characteristics differ. BigQuery’s columnar storage and massive parallelism means different approaches may perform better than in row-based databases. Testing is important.

Common Mistakes and How to Avoid Them

A frequent mistake is forgetting to partition window functions properly. If you’re finding islands per user but don’t partition by user_id, you’re finding islands across all users simultaneously, which gives wrong results. Always partition window functions by the grouping column that defines independent sequences.

Null handling trips people up regularly. The first row in each partition has no previous row, so LAG returns NULL. If you don’t handle this explicitly, your gap detection logic might incorrectly mark the first row. Always account for NULLs from LAG when identifying gaps.

Off-by-one errors are common when defining what constitutes a gap. If you want days to be considered consecutive, the difference should be exactly 1. But if you’re using timestamps, you need to account for time zones, daylight saving, and the granularity of your data. A gap of 24 hours isn’t always one day.

Performance testing with realistic data volumes is critical. A query that works fine on a thousand rows might be unusably slow on a million rows. The approach that seems most elegant might not be the most performant. Always test with production-scale data before committing to an implementation.

Advanced Variations – When Simple Isn’t Enough

Some gaps and islands problems have additional constraints that complicate the basic approaches. You might need islands that meet certain criteria, like minimum length or maximum gap size. You might need overlapping islands based on different definitions. You might need to track additional attributes of islands beyond just their boundaries.

Conditional islands are common: find sequences of increasing values, or consecutive days where some condition holds. The basic difference method doesn’t work because the sequence itself is conditional. You typically need to first filter or flag qualifying rows, then apply gaps and islands logic to only those rows while maintaining sequence order.

Multi-dimensional islands extend the concept to multiple sequences simultaneously. Find time periods where multiple conditions hold concurrently across different data streams. These problems often require joining multiple gaps and islands results or applying the logic across multiple columns with coordinated window functions.

Hierarchical islands involve nested sequences, like finding consecutive weeks that contain consecutive days of activity. You solve the inner islands problem, then treat the island IDs as a sequence and solve islands at the next level. These problems require multiple passes through the data with different grouping and ordering.

The Mental Model – How to Think About Sequence Problems

What makes someone good at solving gaps and islands problems isn’t memorizing specific SQL patterns. It’s developing a mental model for sequence-based problems that lets you adapt approaches to specific situations. You need to think about what information each row needs to determine its island membership and how to efficiently provide that information.

The key insight is that island membership is determined by relationships to surrounding rows, not by absolute values. A row belongs to an island based on how it relates to the previous row, the next row, or both. Window functions exist to give each row access to information about its neighbors efficiently.

When you encounter a gaps and islands problem, the questions to ask are: What defines consecutive in this context? What information from surrounding rows does each row need to determine its island? Can I transform the problem into one where consecutive rows share a common attribute? If not, can I mark island boundaries explicitly and group based on those markers?

Practice helps enormously. Once you’ve solved gaps and islands problems a few times, you start recognizing the patterns in new problems. That user engagement question is islands. That data quality issue is gaps. The solution space becomes familiar, and you spend less time figuring out whether a problem is solvable and more time choosing the best approach.

Summary

Gaps and islands problems are everywhere in data work, and knowing how to solve them efficiently is a fundamental skill. The difference method is elegant and performant for uniform sequences. The LAG approach handles variable gaps with explicit logic. Recursive CTEs provide maximum control for complex rules. Self-joins are occasionally practical despite performance drawbacks.

The best approach depends on your specific situation: the database you’re using, the characteristics of your data, the complexity of your gap definition, and the performance requirements of your query. There’s no universally best solution, which is part of what makes these problems interesting.

Understanding multiple approaches and their trade-offs lets you choose appropriately for each situation. More importantly, it develops your ability to think about sequence-based problems in SQL, which extends beyond gaps and islands to all sorts of time-series analysis, trend detection, and sequential pattern recognition.

These problems reveal how you think about data because they require bridging the gap between SQL’s set-based nature and the sequential reality of many business problems. Getting good at this bridge is what separates data professionals who struggle with temporal queries from those who handle them fluently.

Discover more from Where Data Engineering Meets Business Strategy

Subscribe now to keep reading and get access to the full archive.

Continue reading