Skip to main content

Command Palette

Search for a command to run...

Detecting Connections As They Form: An Introduction to Streaming Graph Pattern Matching

Updated
β€’14 min read
Detecting Connections As They Form: An Introduction to Streaming Graph Pattern Matching

In my previous posts, I explored how SurrealDB blends document, graph, relational, and vector models into a single engine, and how you can combine it with Rig.rs to build LLM-native agents in Rust. I've written about SurrealDB more than once because I genuinely believe it's one of the most exciting databases out there right now β€” and the more I use it, the more I find myself reaching for it as the obvious foundation whenever I need something flexible, expressive, and Rust-native. Its graph capabilities β€” RELATE, record IDs, graph traversals β€” kept nagging at me with a question: what if you could continuously watch a graph as it grows and fire off actions the moment a meaningful pattern completes?

That question sent me down a rabbit hole. This post is the first in a series where I'll walk you through the concept, the configuration I'd want for such a tool, the actual implementation, and where things could go from there.

Let's start at the beginning. πŸ”


πŸ•ΈοΈ What Is a Labeled Property Graph?

If you've read my SurrealDB deep-dive, you already know that SurrealDB natively models data as a labeled property graph. But let's make it concrete for this series.

A labeled property graph has two building blocks:

  • Vertices β€” entities with a type and properties. Think person:clint_eastwood with a name field, or movie:unforgiven with a title and year.

  • Edges β€” directed, typed relationships between vertices. Think person:clint_eastwood -> acted_in -> movie:unforgiven.

Edges are first-class citizens: they have their own type (e.g., acted_in, directed) and can carry their own properties (e.g., the role a person played).

This model is perfect for representing the kind of messy, interconnected real-world data that relational tables struggle with β€” social networks, supply chains, knowledge graphs, movie databases.

If you want to get a feel for how expressive SurrealDB's query language gets when working with this kind of data, take a look at the idioms page in the docs. It's a compact showcase of what SurrealQL can do β€” and honestly, reading through it is a big part of what got me thinking about building this in the first place.


⚑ What Is Streaming Graph Pattern Matching?

A graph pattern is a template describing a subgraph you care about. For example:

"Find any person who both acted in and directed the same movie."

In a static graph, you'd run a query once and get your results. But what if the graph is continuously growing? Edges arrive one by one as events stream in. You want to know the moment that pattern becomes complete β€” not minutes later, not after a full scan.

That's streaming graph pattern matching: detecting when a pattern is satisfied in real time, incrementally, as new edges are added.

The classic use cases are things like:

  • Fraud detection β€” flag the moment a money flow completes a suspicious cycle

  • Recommendation engines β€” create a "you might also like" edge the second two users share enough common favorites

  • Knowledge graph enrichment β€” derive new facts automatically as raw data arrives

  • Event correlation β€” detect that a sequence of system events matches a known failure signature

The challenge is doing this without re-scanning the entire graph every time a single edge lands.


🎬 A Running Example: The Movie Database

Throughout this series I'll use a movie dataset. It's a flat CSV file where each row is tagged with an Entity type:

Entity,tmdbId,movieId,name,Work,role
Person,190,,Clint Eastwood,,
Person,192,,Morgan Freeman,,
Movie,,33,Unforgiven,,
Join,190,33,,Acting,William Munny
Join,192,33,,Acting,Ned Logan
Join,190,33,,Directing,

A few things to note:

  • Person rows describe people.

  • Movie rows describe films.

  • Join rows describe relationships β€” acting roles, directing credits β€” linking people to movies.

This flat structure is typical of real-world export formats. The goal is to lift it into a proper graph and then continuously match patterns against it.

The pattern I want to detect: a person who both acted in and directed the same movie. Clint Eastwood fits. Morgan Freeman does not.


πŸ—‚οΈ Imagining the Configuration

If I were to design a tool for this, I'd want the entire thing controlled by a single YAML configuration file. Two top-level sections: sources to describe how raw data becomes a graph, and patterns to express what I'm looking for in that graph. Let me walk through both in detail.

Sources β€” Loading the Graph

The sources section describes where your data comes from and how to turn each row into graph elements. Each source entry points to a file and defines a format block with two optional scripts: filter and load.

For the scripting language, I'd reach for VRL β€” Vector Remap Language β€” a safe, sandboxed expression language originally built for the Vector observability pipeline. It's expressive enough for real data wrangling, and it could be extended with a handful of custom functions to express graph operations directly.

Here's what a complete sources block would look like for the movie dataset:

sources:
  # --- Source 1: Load Person vertices ---
  - type: file
    path: $movie_file          # resolved at runtime via --variable movie_file=...
    format:
      filter: msg.Entity == "Person"   # only process rows where Entity is "Person"
      load: |
        .id = record_id!("person", [msg.tmdbId])   # build a typed record ID: person:190
        .name = msg.name
        + @                            # mark this object as a vertex to be created

  # --- Source 2: Load Movie vertices ---
  - type: file
    path: $movie_file
    format:
      filter: msg.Entity == "Movie"
      load: |
        .id = record_id!("movie", [msg.movieId])   # e.g. movie:33
        .title = msg.name
        + @

  # --- Source 3: Acting edges ---
  - type: file
    path: $movie_file
    format:
      filter: msg.Entity == "Join" && msg.Work == "Acting"
      load: |
        .id = record_id!("acted_in", [msg.tmdbId, msg.movieId, msg.role])
        .role = msg.role
        .person.id = record_id!("person", [msg.tmdbId])
        .movie.id  = record_id!("movie",  [msg.movieId])
        + @person->@->@movie    # create the edge; the root object becomes the edge record

  # --- Source 4: Directing edges ---
  - type: file
    path: $movie_file
    format:
      filter: msg.Entity == "Join" && msg.Work == "Directing"
      load: |
        .person.id = record_id!("person", [msg.tmdbId])
        .movie.id  = record_id!("movie",  [msg.movieId])
        + @person->directed->@movie    # anonymous edge β€” no edge record object

Let me unpack each key concept.

filter β€” Row-Level Gating

filter: msg.Entity == "Person"

This is a VRL boolean expression evaluated against each incoming row. If it returns false, the row is skipped entirely. The raw CSV row is available as msg. Multiple sources can read the same file with different filters β€” that's intentional. The file would be read once and fanned out to all matching sources in parallel.

record_id!() β€” Typed Identifiers

.id = record_id!("person", [msg.tmdbId])
# Produces: person:190

This custom VRL function would build a SurrealDB-style record ID from a table name and a list of key components. If you pass multiple components, they're combined into a composite key. The result is a strongly typed identifier like person:190 or acted_in:[190, 33, "William Munny"]. This ensures that re-ingesting the same data is idempotent β€” the same row always produces the same ID.

+ @ β€” Creating a Vertex

.id = record_id!("person", [msg.tmdbId])
.name = msg.name
+ @          # "upsert me as a vertex"

The @ symbol refers to the current root object β€” the thing you've been building up with .field = value assignments. Prefixing it with + marks it for creation in the graph.

One important thing to be explicit about: nested objects are not pushed automatically. If your root object has a .person sub-object, doing + @ only persists the root β€” .person is ignored. To push a nested object as its own vertex, you have to say so explicitly with + @person. This keeps the behaviour predictable and avoids accidentally creating vertices you didn't intend.

You can also push multiple nested objects in the same script β€” + @movie, + @genre, and so on β€” each as a separate vertex, containing only its own fields.

+ @from->edge->@to β€” Creating an Edge

.person.id = record_id!("person", [msg.tmdbId])
.movie.id  = record_id!("movie",  [msg.movieId])
+ @person->directed->@movie

This creates a directed edge between two vertices. The vertices themselves don't have to be created in the same source entry β€” they can come from a different one. What matters is that @person and @movie resolve to objects with a valid .id field at execution time.

Worth knowing: an edge can be written to the database even if the vertices it points to don't exist yet. The edge record is created regardless, but any pattern that tries to traverse it will come back empty until both endpoints are actually present. This is why source order doesn't have to be strict β€” edges landing before their vertices are harmless, they just won't produce matches until the graph is complete enough to satisfy the pattern.

If you want the edge itself to carry properties (like a role name), you use the root object as the edge record:

.id    = record_id!("acted_in", [msg.tmdbId, msg.movieId, msg.role])
.role  = msg.role
.person.id = record_id!("person", [msg.tmdbId])
.movie.id  = record_id!("movie",  [msg.movieId])
+ @person->@->@movie    # @ in the middle = "use the root object as the edge record"

Direction is explicit: -> means left-to-right, <- means right-to-left. You can chain them:

+ @movie<-acted_in<-@person->directed->@movie
# Two edges created from a single line

Patterns β€” Matching the Graph

The patterns section is where things get interesting. Each pattern entry has a match field containing a small DSL that describes the subgraph you're looking for.

patterns:
  - match: |-
      /
        m:movie<-acted_in<-p:person->directed->m:movie
        + p->acted_and_directed->m
      / -> m as movieId, m.title as movie, p as personId, p.name as actor

Let me break this apart piece by piece.

Aliases and Join Conditions

m:movie<-acted_in<-p:person->directed->m:movie
  • m:movie means: match any vertex from the movie table and call it m.

  • p:person means: match any vertex from the person table and call it p.

  • The same alias used in multiple places means the same vertex. Here m appears on both sides β€” so the movie that p acted in must be the exact same movie that p directed. That's your join condition, expressed naturally.

This single-line form is just a convenience. You can spread the same pattern across multiple lines and the meaning is identical β€” each line describes one edge, and shared aliases still enforce the join. All three of these are equivalent:

p:person->acted_in->m:movie
p:person->directed->m:movie
m:movie<-acted_in<-p:person
m:movie<-directed<-p:person
m:movie<-acted_in<-p:person
p:person->directed->m:movie

Pick the direction that reads most naturally for the relationship you're describing. The engine doesn't care β€” it resolves the aliases and finds the join either way.

This is the heart of the pattern language. You describe structure; the engine figures out the join.

Actions β€” Deriving New Edges

+ p->acted_and_directed->m

When the full pattern is satisfied, this line fires. It creates a new acted_and_directed edge between the matched p and m vertices.

Duplicate prevention is built into the ID scheme: the derived edge gets a composite record ID based on the two endpoint vertices β€” something like acted_and_directed:[person:190, movie:33]. That ID is always the same for the same pair, so re-triggering the match (say, because new edges arrived and the checker ran again) produces the exact same record rather than a second one. The operation is naturally idempotent.

Output Projections

/ -> m as movieId, m.title as movie, p as personId, p.name as actor

After the closing /, you declare what gets emitted when a match is found. You can project full vertices (m as movieId) or individual fields (m.title as movie). Omit this section entirely and all matched aliases come back.

More Pattern Expressiveness

The DSL could support more than simple linear paths. A few things worth being able to express:

Multiple edge types β€” match either rated or disliked:

p:person->(rated, disliked)->m:movie

Wildcard edge β€” match any edge type between two vertices:

p1:person->?->p2:person

Edge filters β€” only match edges that satisfy a property condition:

p:person->directed[WHERE award = "Oscar"]->m:movie

Vertex filters β€” only match vertices satisfying a condition:

p:person[WHERE birth_date >= "2000-01-01"]->acted_in->m:movie

Multiple independent patterns in a single block β€” all must be satisfied for the action to fire:

/
  p:person->acted_in->m:movie
  p:person->directed->m:movie
  + p->acted_and_directed->m
/ -> m.title as movie, p.name as actor

πŸ”„ What Would It Look Like to Run This?

At a high level, the tool would:

  1. Read the configuration file and resolve any $variable placeholders passed on the command line.

  2. Stream data from each source into the graph β€” filtering rows, transforming them into vertices and edges via the load scripts, and persisting everything to the database.

  3. Watch the graph for pattern completions β€” as edges land, the tool continuously checks whether any of the declared patterns are now fully satisfied.

  4. Fire actions and print results immediately β€” the moment a match is found, the + edges are created and the projected output is flushed to stdout right away, without waiting for ingestion to finish.

This last point matters: matches don't accumulate and print at the end. They appear as soon as they're found, interleaved with ongoing ingestion. If you're streaming a large dataset, you'd start seeing output long before the last row is processed.

For the movie example, as soon as both the acted_in and directed edges involving Clint Eastwood and Unforgiven are present, the pattern is satisfied and a line appears on stdout. Morgan Freeman only acted, so he never completes the pattern β€” no false match, no output for him.

The system is also tolerant of edges arriving out of order. If a directed edge lands before the corresponding acted_in edge exists yet, the check is retried automatically once the missing piece shows up. You don't have to pre-sort your data or reason about arrival order.

The key design intent is that you never have to write a query. You declare the shape of what you're looking for and let the tool figure out when it exists.


🧠 Why This Configuration Style?

A few deliberate design choices worth calling out:

  • One file per source, multiple sources per file. The same CSV can produce vertices, edges, and genre tags all at once β€” by duplicating the source entry with different filters. No preprocessing step required.

  • VRL for transformations. It's expressive enough to handle type coercions, array splits, conditional logic, and nested structures, while being safe and sandboxed. No need to invent yet another mini-language for data wrangling.

  • Aliases as join keys. Using the same alias name to enforce vertex identity across pattern lines is a small syntax choice that turns out to be remarkably readable. The pattern says what you mean.

  • Declarative actions. The + syntax for derived edges mirrors the + syntax in the load scripts. Once you understand one, you understand the other.

  • Idempotent by construction. Derived edges get a deterministic ID from their endpoints, so running the same match twice is harmless. No deduplication logic to write, no risk of polluting the graph with duplicate relationships.

  • Results stream, they don't batch. Output is flushed to stdout the moment a match is confirmed. You can pipe the output, process it downstream, or just watch it in a terminal β€” it behaves like any other streaming Unix tool.


πŸ‘‰ TL;DR

Streaming graph pattern matching is about detecting when a meaningful subgraph completes β€” incrementally, as edges arrive, without scanning the whole graph.

A YAML config maps raw data to graph elements via VRL, and a declarative DSL expresses the patterns you care about. When a pattern matches, derived edges are created automatically.

In the next post, I'll show you the actual implementation: the tech stack, how the pieces fit together under the hood, and what the database looks like once everything is loaded. Stay tuned. πŸ› οΈ