Overview
TextRank, derived from PageRank, is a graph-based ranking model for text processing. It can be applied to various natural language processing tasks such as keyword extraction, keyphrase extraction, and text summarization.
- R. Mihalcea, P. Tarau, TextRank: Bringing Order Into Texts (2004)
Concepts
Text as a Graph
To apply the TextRank algorithm, the text must first be represented as a graph .The structure of the graph depends on the specific application:
- Nodes: Text units that best fit the task, such as words, collocations, or sentences, are added as nodes in the graph.
- Edges: Relationships between text units, such as semantic similarity, co-occurrence, or contextual overlap, are used to connect nodes with edges.
Sample graph build for keyphrase extraction: nodes are selected lexical units from the text, and edges are established based on co-occurrence within a defined window of words (Source: Original paper)
TextRank Model
The ranks of all text units are computed recursively using a "recommendation" mechanism, similar to the PageRank algorithm. However, TextRank incorporates edge weights, using a modified formula to integrate these weights effectively:
where,
- Out(v) is the set of nodes that node v points to;
- wvu is the edge weight between nodes v and u;
- d is the damping factor.
Considerations
- The rank of isolated text unit will stay the same as the value of (1 - d).
- Self-loop is regarded as a successor and a predecessor, a node would pass some rank to itself through self-loop. If a network has many self-loops, it will take more iterations to converge.
Example Graph
To create this graph:
// Runs each row separately in order in an empty graphset
create().edge_property(@default, "weight", int32)
insert().into(@default).nodes([{_id:"A"}, {_id:"B"}, {_id:"C"}, {_id:"D"}, {_id:"E"}, {_id:"F"}, {_id:"G"}])
insert().into(@default).edges([{_from:"A", _to:"E", weight:3}, {_from:"B", _to:"A", weight:3}, {_from:"B", _to:"E", weight:2}, {_from:"C", _to:"A", weight:1}, {_from:"C", _to:"D", weight:4}, {_from:"D", _to:"E", weight:5}, {_from:"E", _to:"G", weight:2}, {_from:"F", _to:"B", weight:1}, {_from:"F", _to:"G", weight:3}])
Creating HDC Graph
To load the entire graph to the HDC server hdc-server-1
as hdc_textrank
:
CALL hdc.graph.create("hdc-server-1", "hdc_textrank", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
})
hdc.graph.create("hdc_textrank", {
nodes: {"*": ["*"]},
edges: {"*": ["*"]},
direction: "undirected",
load_id: true,
update: "static",
query: "query",
default: false
}).to("hdc-server-1")
Parameters
Algorithm name: text_rank
Name |
Type |
Spec |
Default |
Optional |
Description |
---|---|---|---|---|---|
init_value |
Float | >0 | 0.2 |
Yes | The initial rank assigned to all nodes. |
loop_num |
Integer | ≥1 | 5 |
Yes | The maximum number of iteration rounds. The algorithm will terminate after completing all rounds. |
damping |
Float | (0,1) | 0.8 |
Yes | The damping factor. |
max_change |
Float | ≥0 | 0 |
Yes | The algorithm terminates when the changes in all ranks between iterations are less than the specified max_change , indicating that the result is stable. Sets to 0 to disable this criterion. |
edge_schema_property |
[]"<@schema.?><property> " |
/ | / | No | Numeric edge properties as weights, summing values across the specified properties; edges without the specified properties are ignored. |
return_id_uuid |
String | uuid , id , both |
uuid |
Yes | Includes _uuid , _id , or both values to represent nodes in the results. |
limit |
Integer | ≥-1 | -1 |
Yes | Limits the number of results returned; -1 includes all results. |
order |
String | asc , desc |
/ | Yes | Sorts the results by rank . |
File Writeback
CALL algo.text_rank.write("hdc_textrank", {
params: {
return_id_uuid: "id",
init_value: 1,
loop_num: 50,
damping: 0.8,
edge_schema_property: "weight",
order: 'desc'
},
return_params: {
file: {
filename: "textrank"
}
}
})
algo(text_rank).params({
project: "hdc_textrank",
return_id_uuid: "id",
init_value: 1,
loop_num: 50,
damping: 0.8,
edge_schema_property: "weight",
order: 'desc'
}).write({
file: {
filename: "textrank"
}
})
Result:
_id,text_rank
G,0.973568
E,0.81696
A,0.3472
D,0.328
B,0.24
F,0.2
C,0.2
DB Writeback
Writes the text_rank
values from the results to the specified node property. The property type is float
.
CALL algo.text_rank.write("hdc_textrank", {
params: {
loop_num: 50,
edge_schema_property: "@default.weight"
},
return_params: {
db: {
property: "rank"
}
}
})
algo(text_rank).params({
project: "hdc_textrank",
loop_num: 50,
edge_schema_property: "@default.weight"
}).write({
db:{
property: 'rank'
}
})
Full Return
CALL algo.text_rank("hdc_textrank", {
params: {
return_id_uuid: "id",
init_value: 1,
loop_num: 50,
damping: 0.8,
edge_schema_property: "weight",
order: 'desc',
limit: 5
},
return_params: {}
}) YIELD TR
RETURN TR
exec{
algo(text_rank).params({
return_id_uuid: "id",
init_value: 1,
loop_num: 50,
damping: 0.8,
edge_schema_property: "weight",
order: 'desc',
limit: 5
}) as TR
return TR
} on hdc_textrank
Result:
_id | text_rank |
---|---|
G | 0.973568 |
E | 0.81696 |
A | 0.3472 |
D | 0.328 |
B | 0.24 |
Stream Return
CALL algo.text_rank("hdc_textrank", {
params: {
return_id_uuid: "id",
loop_num: 50,
damping: 0.8,
edge_schema_property: "weight",
order: 'desc',
limit: 5
},
return_params: {
stream: {}
}
}) YIELD TR
RETURN TR
exec{
algo(text_rank).params({
return_id_uuid: "id",
loop_num: 50,
damping: 0.8,
edge_schema_property: "weight",
order: 'desc',
limit: 5
}).stream() as TR
return TR
} on hdc_textrank
Result:
_id | text_rank |
---|---|
G | 0.973568 |
E | 0.81696 |
A | 0.3472 |
D | 0.328 |
B | 0.24 |