Overview
The Pearson correlation coefficient is the most common way of measuring the strength and direction of the linear relationship between two quantitative variables. In the graph, nodes are quantified by N numeric properties (features) of them.
For two variables X= (x1, x2, ..., xn) and Y = (y1, y2, ..., yn) , Pearson correlation coefficient (r) is defined as the ratio of the covariance of them and the product of their standard deviations:
The Pearson correlation coefficient ranges from -1 to 1:
Pearson correlation coefficient |
Correlation type |
Interpretation |
---|---|---|
0 < r ≤ 1 | Positive correlation | As one variable becomes larger, the other variable becomes larger |
r = 0 | No linear correlation | (May exist some other types of correlation) |
-1 ≤ r < 0 | Negative correlation | As one variable becomes larger, the other variable becomes smaller |
Considerations
- Theoretically, the calculation of Pearson correlation coefficient between two nodes does not depend on their connectivity.
Example Graph
To create this graph:
// Runs each row separately in order in an empty graphset
create().node_schema("product")
create().node_property(@product, "price", int32).node_property(@product, "weight", int32).node_property(@product, "width", int32).node_property(@product, "height", int32)
insert().into(@product).nodes([{_id:"product1", price:50, weight:160, width:20, height:152}, {_id:"product2", price:42, weight:90, width:30, height:90}, {_id:"product3", price:24, weight:50, width:55, height:70}, {_id:"product4", price:38, weight:20, width:32, height:66}])
Creating HDC Graph
To load the entire graph to the HDC server hdc-server-1
as hdc_sim_prop
:
CALL hdc.graph.create("hdc-server-1", "hdc_sim_prop", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
})
hdc.graph.create("hdc_sim_prop", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
}).to("hdc-server-1")
Parameters
Algorithm name: similarity
Name |
Type |
Spec |
Default |
Optional |
Description |
---|---|---|---|---|---|
ids |
[]_id |
/ | / | No | Specifies the first group of nodes for computation by their _id ; computes for all nodes if it is unset. |
uuids |
[]_uuid |
/ | / | No | Specifies the first group of nodes for computation by their _uuid ; computes for all nodes if it is unset. |
ids2 |
[]_id |
/ | / | No | Specifies the second group of nodes for computation by their _id ; computes for all nodes if it is unset. |
uuids2 |
[]_uuid |
/ | / | No | Specifies the second group of nodes for computation by their _uuid ; computes for all nodes if it is unset. |
type |
String | pearson |
cosine |
No | Specifies the type of similarity to compute; for Pearson Correlation Coefficient, keep it as pearson . |
node_schema_property |
[]"<@schema.?><property> " |
/ | / | No | Numeric node properties to form a vector for each node; all specified properties must belong to the same label (schema). |
return_id_uuid |
String | uuid , id , both |
uuid |
Yes | Includes _uuid , _id , or both to represent nodes in the results. |
order |
String | asc , desc |
/ | Yes | Sorts the results by similarity . |
limit |
Integer | ≥-1 | -1 |
Yes | Limits the number of results returned; -1 includes all results. |
top_limit |
Integer | ≥-1 | -1 |
Yes | Limits the number of results returned for each node specified with ids /uuids in selection mode; -1 includes all results with a similarity greater than 0. This parameter is invalid in pairing mode. |
The algorithm has two calculation modes:
- Pairing: When both
ids
/uuids
andids2
/uuids2
are configured, each node inids
/uuids
is paired with each node inids2
/uuids2
(excluding self-pairing), and pairwise similarities are computed. - Selection: When only
ids
/uuids
is configured, pairwise similarities are computed between each target node and all other nodes in the graph. The results include all or a limited number of nodes with a similarity > 0 to the target node, ordered in descending similarity.
File Writeback
CALL algo.similarity.write("hdc_sim_prop", {
params: {
return_id_uuid: "id",
ids: "product1",
ids2: ["product2", "product3", "product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "pearson"
},
return_params: {
file: {
filename: "pearson"
}
}
})
algo(similarity).params({
project: "hdc_sim_prop",
return_id_uuid: "id",
ids: "product1",
ids2: ["product2", "product3", "product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "pearson"
}).write({
file: {
filename: "pearson"
}
})
Result:
_id1,_id2,similarity
product1,product2,0.998785
product1,product3,0.474384
product1,product4,0.210494
Full Return
CALL algo.similarity("hdc_sim_prop", {
params: {
return_id_uuid: "id",
ids: ["product1","product2"],
ids2: ["product2","product3","product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "pearson"
},
return_params: {}
}) YIELD p
RETURN p
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["product1","product2"],
ids2: ["product2","product3","product4"],
node_schema_property: ["price", "weight", "width", "height"],
type: "pearson"
}) as p
return p
} on hdc_sim_prop
Result:
_id1 | _id2 | similarity |
---|---|---|
product1 | product2 | 0.998785 |
product1 | product3 | 0.474384 |
product1 | product4 | 0.210494 |
product2 | product3 | 0.507838 |
product2 | product4 | 0.253573 |
Stream Return
CALL algo.similarity("hdc_sim_prop", {
params: {
return_id_uuid: "id",
ids: ["product1", "product3"],
node_schema_property: ["price", "weight", "width", "height"],
type: "pearson",
top_limit: 1
},
return_params: {
stream: {}
}
}) YIELD top
RETURN top
exec{
algo(similarity).params({
return_id_uuid: "id",
ids: ["product1", "product3"],
node_schema_property: ["price", "weight", "width", "height"],
type: "pearson",
top_limit: 1
}).stream() as top
return top
} on hdc_sim_prop
Result:
_id1 | _id2 | similarity |
---|---|---|
product1 | product2 | 0.998785 |
product3 | product2 | 0.507838 |