When asked whether a graph represents structured or unstructured data, the answer might not be immediately clear. At first glance, graphs appear unstructured. They lack fixed topologies and can exhibit various configurations, including hierarchical, cyclic, scale-free, sparse, or dense. Efforts have been invested in visualizing graphs using 2D and 3D layouts to enhance understanding.
However, the denotation of a graph as G = (V, E), representing a collection of nodes (vertices) and edges, implicitly introduces a structured framework. Connectedness lies at the heart of graphs. Traversing a graph through relationships forms the core of sophisticated graph analysis. The depth of connections between entities (nodes) provides different dimensions of information. For instance, if node A can reach node B within 1, 10, or 20 steps, each individual path involves varying amounts and types of intermediate players (nodes and edges), yielding unique analytical insights.
In the era of AI, advanced models excel at processing unstructured data such as audio, images, and videos. However, even in this context, structured frameworks and algorithms remain essential for extracting meaningful insights. Whether dealing with structured or unstructured data, AI models rely on underlying structures.
A state-of-the-art graph system should encompass two critical components that are highly relevant to graph DBMS users/DBAs:
- Structured Storage: This involves a logical yet agile schema design.
- Powerful Computing Capabilities: These capabilities enable deep (link analytics), and performant data processing.
Ultipa addresses these requirements through its “demi-schema” design. In this post, let’s take a closer look into why this approach accelerates graph processing.
The Polarization
Under the longstanding reign of traditional SQL databases, the concept of schema has been a double-edged sword. On the one hand, the clear definition of schema (including tables, columns within each table, and constraints) provides excellent organization and governance over data. On the other hand, it lacks adaptability to the ever-evolving landscape of data diversity and scope, not to mention the rigidness with tables. Even though the ER-model itself is high dimensional, the gist here is that the ER modeling represents the meta form of tabular data, however the RDBMS/SQL implementations are low dimensional, and that has caused long-lasting problems.
Figure 1. Schema in SQL databases
Entering the world of NoSQL databases, which tend to champion varying levels of flexibility when it comes to schema management, you’ve likely encountered the buzzword “schema-less.” However, it’s crucial to clarify that “schema-less” doesn’t imply the complete absence of schema. While a schema-less NoSQL database theoretically allows you to write data without an upfront schema declaration, some form of schema or data model is still necessary during read operations. This schema might be managed by the database management system (DBMS) or defined at the application level.
Taking the document database MongoDB as an example, it is often hailed as schema-less. Imagine we have a collection named “user” that stores user profiles in the database. In MongoDB, there are no structural restrictions imposed on the documents within each collection. Consequently, we might encounter two user documents that appear vastly different:
// Document 1
{
"_id": ObjectId("61e268da16a7ad8a843605d4"),
"name": "John Doe",
"age": 30,
}
// Document 2
{
"_id": ObjectId("61e268da16a7ad8a843605d5"),
"first_name": "Joe",
"last_name": "Yeoh",
"address": {
"country": "USA",
"city": "New York"
}
}
However, when reading data from the “user” collection, the applications responsible for consuming this data must interpret and dynamically infer the schema based on the retrieved documents. Thus, it expects that each user document contains an “_id” field and optionally includes fields like “name,” “first_name,” “last_name,” “age,” and the nested “address” object.
In the realm of Ultipa graph database, we’ve introduced the term “demi-schema” - a concept that underscores the unique fusion of schema-full and schema-less features inherent in our approach to constructing, storing, and querying graph data. This innovative approach ensures optimal performance and flexibility.
Structuring Graphs: A Framework with Schemas
A "schema" in Ultipa resembles a "table" in relational databases, but with the relationships directly stored (rather than joined) as edges. Therefore, Ultipa features both node schemas and edge schemas. Figure 2 illustrates of an example graph structure corresponding to Figure 1. Additionally, each node or edge schema can be linked with a set of key-value properties (attributes) for more descriptions.
Figure 2. Graph structure in Ultipa
Creating this graph structure with the following UQL, there are two important points to note. First, properties are defined within the context of a particular schema and cannot exist independently of it. Second, Ultipa does not enforce constraints on which edge schemas are allowed between two node schemas (or within the same node schema), nor does it limit the direction in which the edge should start and end.
For example, while @Places edges (the symbol @ denotes a schema in Ultipa) are expected to appear only between @User and @Order nodes, and from the former to the latter, Ultipa doesn’t mandate this limitation. Such restrictions can be implemented at the application level when writing data to Ultipa to maintain the desired graph structure.
// Create node and edge schemas
create()
.node_schema("User")
.node_schema("Order")
.node_schema("Product")
.node_schema("ProductCategory")
.edge_schema("Places")
.edge_schema("Contains")
.edge_schema("HasReview")
.edge_schema("BelongTo")
.edge_schema("Writes")
// Create properties for node and edge schemas
create()
.node_property(@User, "Name", string)
.node_property(@User, "Email", string)
...
.edge_property(@Contains, "UnitPrice", double)
.edge_property(@Contains, "TotalPrice", double)
This initial definition of schemas can then be utilized to accept inserted node and edge data. In the following UQL, the target (one and only one) schema is declared in the into() parameter, and you may include all or part of the property values in each node or edge object.
// Insert a @User node
insert().into(@User).nodes({Email: "[email protected]", RegistrationDate: "2022-10-2 12:30:25"})
// Insert two @Writes edges
insert().into(@Writes).edges([{_from: "User123", _to: "Review6756"}, {_from: "User123", _to: "Review6770"}])
Ultipa embraces the explicit schema and property design for the shared benefits to DBAs, developers, and all players involved in data processing. We believe that the clarity in organizing and managing data ultimately promotes consistency, interoperability and security; let alone the efficiency improvement gained in converting data from serialized disk storage formats to graph representations, and vice versa.
In comparison, some schema-less database systems typically don’t require or even support explicit data modeling. The paradox is that a data model is surely needed for effective data management. That’s why you see some other mechanisms to address data grouping or categorization in those systems, such as labels (a special form of index) in Neo4j. However, to avoid the pre-definition of a structure, they might couple data creation with data modeling.
For instance, the following Cypher query creates a node with the Person label and with the Name property set to a value:
CREATE (:Person {Name: "John Doe"})
The next one though, creates another node with the person label and with the name property set to a value:
CREATE (:person {name: "Joe Ann"})
Since the names of both labels and properties are case sensitive, two different labels and properties are created. Without a written data model, the DBMS can’t validate the typing. Consequently, any unintentional deviation from naming conventions could result in data discrepancies.
Careful consideration is essential. While the schema-less structuring offers some flexibility, it can also be vexing.
Manipulating Graphs: Schema-Agnostic Operations
Data modeling is never a one-time task, regardless of the database platform being used. Schema change in SQL systems has long been a painful or even risky decision to be made. Depending on the application layer above and the type of change, structural evolutions in Ultipa, or any other graph databases, can also arise as a mission critical.
Nevertheless, modifying the schemas and their associated properties, or even completely changing the nature of the data (such as transforming a property into a node schema), those tasks can be accomplished more straightforwardly within Ultipa. This is largely attributed to the intuitive modeling process inherent in graph databases. You no longer need to "convert" your mental (or physical) whiteboard sketches into tabular representations, which necessitate complex JOIN tables, adding extra complexity.
Ultipa provides a suite of comprehensive DDL and DML UQLs to seamlessly facilitate structural and data changes, all without any downtime. There is where the demi-schema feature comes into play – a dynamic and fluid mechanism that allows to focus on a single schema or span multiple schemas.
Here are some examples of UQLs for property management, data updating and deletion:
// Create a property for the node schema User
create().node_property(@User, "Password", int32)
// Create a property for all node schemas
create().node_property(@*, "level", int32)
// Rename a property for all node schemas (if has)
alter().node_property(@*.level).set({name: "Level"})
// Update a property for all edge schemas (if has)
update().edges({weight > 2}).set({time: dateAdd(time, 1, "day")})
// Delete any node with a certain tag (if has)
delete().nodes({tag == "cloud"})
This approach fundamentally alters the way data manipulation is performed compared to traditional SQL databases, where operations are confined to individual tables. It empowers users to conduct bulk operations across diverse schemas with minimal complexity, thereby enhancing administrative efficiency.
Querying Graphs: Maximizing Efficiency with Flexibility
The demi-schema feature also extends to querying data from the graph. On the one hand, expressions like @User.RegistrationDate, which specify the exact property of a schema, allow you to retrieve certain nodes or edges within the graph:
// Retrieve users registered before 2022
find().nodes({@User.RegistrationDate < "2022-1-1"}) as n
RETURN n{*}
In scenarios where properties with identical names coexist across different schemas, such as the Name property under @User, @Product, and @ProductCategory in Figure 2, you also have the option to target nodes or edges throughout all schemas that possess the property:
// Retrieve any node whose name (a property; if has) is “clark”
find().nodes({Name == "clark"}) as n
RETURN n{*}
While these queries may seem granted, maintaining their performance is actually daunting, when we are talking about industrial-scale graphs with millions to billions of metadata. Under the hood, Ultipa tackles this challenge by optimizing the filtering of schema-less data by intelligently adding all schemas containing the given property back to the query.
That’s to say, the filter {Name == "clark"} is parsed and optimized as {@User.Name == "clark" || @Product.Name == "clark" || @ProductCategory.Name == "clark"} before sending the UQL to the computing engine to fetch data. This strategy leads to substantial traversal time savings, potentially up to several orders of magnitude. Consider a graph with 40 million nodes in total, of which only 20 million are @User, @Product and @ProductCategory – this optimization saves 50% time-cost for traversal.
A step further, when indexes are created for properties @User.Name, @Product.Name, and @ProductCategory.Name, query execution time can be accelerated to the next level. Figure 3 shows the impressive results of a schema-less query on a dataset with over 373 million nodes and 243 million edges, with the filtering property name indexed. This query took only 52ms, which translates to traversing over 7 million nodes per millisecond.
Figure 3. A schema-less query performed in Ultipa Manager
As you may have already comprehended, Ultipa’s query optimization against schema-less search is a direct result of its clear schema-property structure definition. Unlike other graph database systems that are purely schema-less, such flexible queries, though possible, are left with no option but to scan over the entire dataset, leading to significant resource overhead.
Given an example, the Neo4j Cypher query below is meant to retrieve any node with any label which has the property name and the value is "clark". Since no label is specified, the query must read all nodes, and no index can be utilized due the same reason of missing labels.
MATCH (n {Name: "clark"})
RETURN n
The workaround generally suggested is to manually add all labels that have the name property to narrow down the scope of search and enable index usage:
MATCH (n:User|Product|ProductCategory {Name: "clark"})
RETURN n
Unfortunately, every time a new label containing this property is added, you will have to trace back to all these queries and modify them, which can be cumbersome and error prone.
In contrast, Ultipa performs implicit union operations behind the scenes, so the query remains consistent even when new schemas containing that very property are added. This capability accommodates evolving data models without sacrificing query performance or requiring manual intervention to update queries.
Bottom Line
Ultipa’s demi-schema approach deftly navigates the fine line between the inflexibility of traditional schemas and the fluidity of schema-less architectures. By doing so, Ultipa creates an environment that is both adaptable and responsive, fueling innovation and expediting organizations’ time-to-insight.