Overview
In cosine similarity, data objects in a dataset are treated as vectors, and it uses the cosine value of the angle between two vectors to indicate the similarity between them. In the graph, specifying N numeric properties (features) of nodes to form N-dimensional vectors, two nodes are considered similar if their vectors are similar.
Cosine similarity ranges from -1 to 1; 1 means that the two vectors have the same direction, -1 means that the two vectors have the opposite direction.
In 2-dimensional space, the cosine similarity between vectors A(a1, a2) and B(b1, b2) is computed as:
In 3-dimensional space, the cosine similarity between vectors A(a1, a2, a3) and B(b1, b2, b3) is computed as:
The following diagram shows the relationship between vectors A and B in 2D and 3D spaces, as well as the angle θ between them:
Generalize to N-dimensional space, the cosine similarity is computed as:
Considerations
- Theoretically, the calculation of cosine similarity between two nodes does not depend on their connectivity.
- The value of cosine similarity is independent of the length of the vectors, but only the direction of the vectors.
Example Graph
To create this graph:
// Runs each row separately in order in an empty graphset
create().node_schema("product")
create().node_property(@product, "price", int32).node_property(@product, "weight", int32).node_property(@product, "width", int32).node_property(@product, "height", int32)
insert().into(@product).nodes([{_id:"product1", price:50, weight:160, width:20, height:152}, {_id:"product2", price:42, weight:90, width:30, height:90}, {_id:"product3", price:24, weight:50, width:55, height:70}, {_id:"product4", price:38, weight:20, width:32, height:66}])
Creating HDC Graph
To load the entire graph to the HDC server hdc-server-1
as hdc_sim_prop
:
CALL hdc.graph.create("hdc-server-1", "hdc_sim_prop", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
})
hdc.graph.create("hdc_sim_prop", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
}).to("hdc-server-1")
Parameters
Algorithm name: similarity
Name |
Type |
Spec |
Default |
Optional |
Description |
---|---|---|---|---|---|
ids |
[]_id |
/ | / | No | Specifies the first group of nodes for computation by their _id ; computes for all nodes if it is unset. |
uuids |
[]_uuid |
/ | / | No | Specifies the first group of nodes for computation by their _uuid ; computes for all nodes if it is unset. |
ids2 |
[]_id |
/ | / | Yes | Specifies the second group of nodes for computation by their _id ; computes for all nodes if it is unset. |
uuids2 |
[]_uuid |
/ | / | Yes | Specifies the second group of nodes for computation by their _uuid ; computes for all nodes if it is unset. |
type |
String | cosine |
cosine |
No | Specifies the type of similarity to compute; for Cosine Similarity, keep it as cosine . |
node_schema_property |
[]"<@schema.?><property> " |
/ | / | No | Numeric node properties to form a vector for each node; all specified properties must belong to the same label (schema). |
return_id_uuid |
String | uuid , id , both |
uuid |
Yes | Includes _uuid , _id , or both to represent nodes in the results. |
order |
String | asc , desc |
/ | Yes | Sorts the results by similarity . |
limit |
Integer | ≥-1 | -1 |
Yes | Limits the number of results returned; -1 includes all results. |
top_limit |
Integer | ≥-1 | -1 |
Yes | Limits the number of results returned for each node specified with ids /uuids in selection mode; -1 includes all results with a similarity greater than 0. This parameter is invalid in pairing mode. |
The algorithm has two calculation modes:
- Pairing: When both
ids
/uuids
andids2
/uuids2
are configured, each node inids
/uuids
is paired with each node inids2
/uuids2
(excluding self-pairing), and pairwise similarities are computed. - Selection: When only
ids
/uuids
is configured, pairwise similarities are computed between each target node and all other nodes in the graph. The results include all or a limited number of nodes with a similarity > 0 to the target node, ordered in descending similarity.
File Writeback
CALL algo.similarity.write("hdc_sim_prop", {
params: {
return_id_uuid: "id",
ids: "product1",
ids2: ["product2", "product3", "product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "cosine"
},
return_params: {
file: {
filename: "cosine"
}
}
})
algo(similarity).params({
project: "hdc_sim_prop",
return_id_uuid: "id",
ids: "product1",
ids2: ["product2", "product3", "product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "cosine"
}).write({
file: {
filename: "cosine"
}
})
Result:
_id1,_id2,similarity
product1,product2,0.986529
product1,product3,0.878858
product1,product4,0.816876
Full Return
CALL algo.similarity("hdc_sim_prop", {
params: {
return_id_uuid: "id",
ids: ["product1","product2"],
ids2: ["product2","product3","product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "cosine"
},
return_params: {}
}) YIELD cs
RETURN cs
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["product1","product2"],
ids2: ["product2","product3","product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "cosine"
}) as cs
return cs
} on hdc_sim_prop
Result:
_id1 | _id2 | similarity |
---|---|---|
product1 | product2 | 0.986529 |
product1 | product3 | 0.878858 |
product1 | product4 | 0.816876 |
product2 | product3 | 0.934217 |
product2 | product4 | 0.881988 |
Stream Return
CALL algo.similarity("hdc_sim_prop", {
params: {
return_id_uuid: "id",
ids: ["product1", "product3"],
node_schema_property: ["price", "weight", "width", "height"],
type: "cosine",
top_limit: 1
},
return_params: {
stream: {}
}
}) YIELD top
RETURN top
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["product1", "product3"],
node_schema_property: ["price", "weight", "width", "height"],
type: "cosine",
top_limit: 1
}).stream() as cs
return cs
} on hdc_sim_prop
Result:
_id1 | _id2 | similarity |
---|---|---|
product1 | product2 | 0.883292 |
product3 | product2 | 0.877834 |