Processing
Reference for weblinx.processing
Permalink
load_candidate_elements
Permalink
weblinx.processing.load_candidate_elements(path, group_keys=('"demo_name"', '"turn_index"'), log_fn=None)
DescriptionPermalink
This will load the candidates from the path.
group_keys
is used to group the candidates by the specified keys, which will be returned
as a dictionary of lists, where the keys are the values of the specified keys, and the values
are the list of candidates that have the same value for the specified keys. If group_keys
is
None, then the candidates will not be grouped, and will be returned as a list.
For example, if group_keys
is (“demo_name”, “turn_index”), then the candidates will be
{
("demo1", 0): [c1, c2, c3, ...],
("demo1", 1): [c4, c5, c6, ...],
("demo2", 0): [c7, c8, c9, ...],
...
}
Whereas if group_keys
is None, then the candidates will be
[c1, c2, c3, c4, c5, c6, c7, c8, c9, ...]
Candidates are elements that could potentially be used as the target output for the model. For example, if the goal is to click on a button, you may find a button element as a candidate, if the original candidate ranking model does a good job at ranking the candidates.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
path |
str |
The path to the candidates file. This can be a JSONL file. | |
group_keys |
tuple[str] |
('"demo_name"', '"turn_index"') |
The keys to use to group the candidates. If None, then the candidates will not be grouped. |
ReturnsPermalink
list or dict
The candidates, either as a list or as a dictionary of lists.
ExamplePermalink
Here’s an example of how to use this function:
candidates = load_candidate_elements("candidates.jsonl", group_keys=("demo_name", "turn_index"))
This will load the candidates from the file “candidates.jsonl” and group them by the keys “demo_name” and “turn_index”. The candidates will be returned as a dictionary of lists.
Reference for weblinx.processing.dom
Permalink
get_tree_repr_simple
Permalink
weblinx.processing.dom.get_tree_repr_simple(tree, keep_html_brackets=False, copy=True, postfix="\nAbove are the pruned HTML contents of the page.")
DescriptionPermalink
This will return a simple representation of the tree in a string format.
This is useful if you want to pass the tree to a action model like a LLM, which
can only accept strings (and images in the case of multimodal models).
Parameters
----------
tree : lxml.html.HtmlElement
The tree to get the representation of.
keep_html_brackets : bool, optional
Whether to keep the HTML brackets or not. Defaults to False.
copy : bool, optional
Whether to copy the tree or not. Defaults to True.
postfix : str, optional
A string to append to the end of the tree representation. Defaults to " Above are the pruned HTML contents of the page.".
Returns
-------
str
The string representation of the tree.
Raises
------
ImportError
If the lxml library is not installed.
Note
----
This function is based on the works of Mind2Web (@xiang-deng). Copyrights belong
to the original author. The original code can be found here at osu-nlp-group/mind2web
sanitize_elem_attributes
Permalink
weblinx.processing.dom.sanitize_elem_attributes(tree, remove_data_attrs=True, remove_underscore_attrs=True, remove_angular_attrs=True, remove_alpine_attrs=True, remove_xml_attrs=True, remove_google_attrs=True, remove_id_attrs=True, uid_key="data-webtasks-id")
DescriptionPermalink
This will take a tree and sanitize the attributes of the elements. This is needed in order to remove any attributes that are not needed before the element selection stage of the ranking model.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
tree |
lxml.html.HtmlElement |
The tree to sanitize. | |
remove_data_attrs |
bool |
True |
Whether to remove data- attributes or not. Defaults to True. |
remove_underscore_attrs |
bool |
True |
Whether to remove _ attributes or not. Defaults to True. |
remove_angular_attrs |
bool |
True |
Whether to remove ng attributes or not. Defaults to True. |
remove_alpine_attrs |
bool |
True |
Whether to remove x- attributes or not. Defaults to True. |
remove_xml_attrs |
bool |
True |
Whether to remove xml attributes or not. Defaults to True. |
remove_google_attrs |
bool |
True |
Whether to remove js attributes or not. Defaults to True. |
remove_id_attrs |
bool |
True |
Whether to remove id attributes or not. Defaults to True. |
uid_key |
str |
"data-webtasks-id" |
The key to use as the unique identifier. Defaults to “data-webtasks-id”. |
RaisesPermalink
ImportError If the lxml library is not installed.
remove_uid_when_not_candidate
Permalink
weblinx.processing.dom.remove_uid_when_not_candidate(dom_tree, candidate_uids, uid_key="data-webtasks-id")
DescriptionPermalink
This will remove the uid from the tree if it is not in the candidate_uids. This is useful for removing any uids that are not in the candidate set.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
dom_tree |
lxml.html.HtmlElement |
The tree to remove the uids from. | |
candidate_uids |
list |
The list of candidate uids to keep. | |
uid_key |
str |
"data-webtasks-id" |
The key to use as the unique identifier. Defaults to “data-webtasks-id”. |
RaisesPermalink
ImportError If the lxml library is not installed.
remove_html_comments
Permalink
weblinx.processing.dom.remove_html_comments(dom_tree)
DescriptionPermalink
Will try to remove all HTML comments from the tree.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
dom_tree |
lxml.html.HtmlElement |
The tree to remove the comments from. |
get_descendants
Permalink
weblinx.processing.dom.get_descendants(node, max_depth, current_depth=0)
DescriptionPermalink
This function was originally written by @xiang-deng for the mind2web repository
It is kept as is and added here for convenience. All copyrights belong to the original author.
prune_tree
Permalink
weblinx.processing.dom.prune_tree(dom_tree, candidate_set, max_depth=5, max_children=50, max_sibling=3, uid_key="data-webtasks-id")
DescriptionPermalink
This function was originally written by @xiang-deng for the mind2web repository
It was modified to allow uid_key to be specified. All rights belong to the original author.
clean_and_prune_tree
Permalink
weblinx.processing.dom.clean_and_prune_tree(dom_tree, cands_turn, max_depth=1, max_children=5, max_sibling=2)
DescriptionPermalink
This function will clean and prune the tree based on the candidates in the cands_turn. This is useful for removing any elements that are not candidates, and for removing any elements that are not needed for the ranking model.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
dom_tree |
lxml.html.HtmlElement |
The tree to clean and prune. | |
cands_turn |
list |
The list of candidates for the turn. | |
max_depth |
int |
1 |
The maximum depth to prune the tree. Defaults to 1. |
max_children |
int |
5 |
The maximum number of children to keep for each candidate. Defaults to 5. |
max_sibling |
int |
2 |
The maximum number of siblings to keep for each candidate. Defaults to 2. |
ReturnsPermalink
lxml.html.HtmlElement
The cleaned and pruned tree.
RaisesPermalink
ValueError If cands_turn is None.
Reference for weblinx.processing.intent
Permalink
Intent.from_string
Permalink
weblinx.processing.intent.Intent.from_string(cls, intent)
Intent.get_element_intents
Permalink
weblinx.processing.intent.Intent.get_element_intents(cls, as_set=False)
Intent.get_text_intents
Permalink
weblinx.processing.intent.Intent.get_text_intents(cls, as_set=False)
Intent.get_tab_intents
Permalink
weblinx.processing.intent.Intent.get_tab_intents(cls, as_set=False)
Intent.get_eval_intents
Permalink
weblinx.processing.intent.Intent.get_eval_intents(cls, as_set=False)
Reference for weblinx.processing.outputs
Permalink
are_not_none
Permalink
weblinx.processing.outputs.are_not_none()
cast_to_float
Permalink
weblinx.processing.outputs.cast_to_float(value)
DescriptionPermalink
Checks if a value is an int or a float.
format_action_arg_value
Permalink
weblinx.processing.outputs.format_action_arg_value(arg_value)
list_all_non_alphanum_chars
Permalink
weblinx.processing.outputs.list_all_non_alphanum_chars(other_chars_allowed=None)
DescriptionPermalink
Returns a list of all non-alphanumeric characters in Python.
split_by_comma
Permalink
weblinx.processing.outputs.split_by_comma(string, check_quotes=False)
DescriptionPermalink
This function splits a string by commas, but ignores commas inside quotes. For example, if my string is ‘a, b, “c, d”, e’, then this function will return [‘a’, ‘b’, ‘“c, d”’, ‘e’] if check_quotes is True, but [‘a’, ‘b’, ‘“c’, ‘d”’, ‘e’] if check_quotes is False.
ExamplePermalink
# Let's test the function
test_string = 'a, b, "c, d", e'
print(split_by_comma(test_string, check_quotes=True))
print(split_by_comma(test_string, check_quotes=False))
find_last_non_alphanum_char
Permalink
weblinx.processing.outputs.find_last_non_alphanum_char(s, other_chars_allowed=None)
DescriptionPermalink
Given a string and a list of characters, find the last index of a non-alphanumeric character in the string.
find_intent_and_raw_args
Permalink
weblinx.processing.outputs.find_intent_and_raw_args(raw_output_string)
DescriptionPermalink
This iterative function will walk through a raw string that might contain an action
in the form intent(arg1="val1", arg2=num1, ...)
and return the intent and the unparsed args
parse_predicted_output_string
Permalink
weblinx.processing.outputs.parse_predicted_output_string(raw_output_string)
DescriptionPermalink
Given an output string, try to find a substring of format
get_element_info
Permalink
weblinx.processing.outputs.get_element_info(turn, uid, uid_key="data-webtasks-id", cache_dir=".cache/demonstrations/xpaths")
DescriptionPermalink
Given a uid_key for an element, retrieve additional information about the element from the HTML which can be used for evaluation.
Extracts only the information needed for evaluation.
get_element_uid_by_coords
Permalink
weblinx.processing.outputs.get_element_uid_by_coords(turn, x, y)
DescriptionPermalink
Given (x,y) coordinates for an element, find the smallest non-zero-sized element that contains the coordinates using bboxes and return its id.
get_xy_coords_corners
Permalink
weblinx.processing.outputs.get_xy_coords_corners(args)
dict_has_keys
Permalink
weblinx.processing.outputs.dict_has_keys(d, keys)
DescriptionPermalink
Checks if a dictionary has all the keys in a list.
infer_element_for_action
Permalink
weblinx.processing.outputs.infer_element_for_action(intent, args, turn, uid_key="data-webtasks-id")
DescriptionPermalink
Given an intent and args, infer the element that the action is performed on, if the element is not explicitly specified.
extract_action_from_turn
Permalink
weblinx.processing.outputs.extract_action_from_turn(turn, uid_key="data-webtasks-id")
DescriptionPermalink
Creates an action from a turn in a demonstration (i.e. the ground truth action).
sanitize_args
Permalink
weblinx.processing.outputs.sanitize_args(args)
DescriptionPermalink
This function is used to sanitize the arguments of an action.
check_pred_is_suitable
Permalink
weblinx.processing.outputs.check_pred_is_suitable(pred)
DescriptionPermalink
Given a prediction, check if it is suitable for evaluation.
Reference for weblinx.processing.prompt
Permalink
get_speaker
Permalink
weblinx.processing.prompt.get_speaker(utterance, instructor_name="User", navigator_name="Assistant", default_name=None)
DescriptionPermalink
This will return the speaker for the utterance. If the utterance does not start with “say”, then it will return the navigator’s name. If the utterance starts with “say”, then it will return the instructor’s name if the speaker is “instructor”, and the navigator’s name if the speaker is “navigator”. If the speaker is neither “instructor” nor “navigator”, then it will return the default name, if it is not None. If the default name is None, then it will raise a ValueError.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
utterance |
str |
The utterance to get the speaker for. | |
instructor_name |
str |
"User" |
The name of the instructor. Defaults to “User”. |
navigator_name |
str |
"Assistant" |
The name of the navigator. Defaults to “Assistant”. |
default_name |
str |
None |
The default name to use if the speaker is neither “instructor” nor “navigator”. If None, then it will raise a ValueError. Defaults to None. |
ReturnsPermalink
str
The speaker for the utterance.
identity
Permalink
weblinx.processing.prompt.identity(x)
DescriptionPermalink
Simply returns the input. Needed for the format_intent
parameter in format_prev_turns_truncated
.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
x |
any |
The input to return. |
ReturnsPermalink
any
The input.
format_prev_turns
Permalink
weblinx.processing.prompt.format_prev_turns(replay, turn, format_intent, turn_sep=" ; ", num_prev_turns=5)
DescriptionPermalink
Formats the previous turns (up until but not including turn
) as a string, or as
a list if turn_sep is None
. The previous turns are formatted using the format_intent
which is a function that takes a turn and returns a string or a dictionary. This function
is useful for displaying the previous turns as context for the current turn. This information
is used by the action model to predict the next action.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
replay |
Replay |
The replay to get the previous turns from. | |
turn |
Turn |
The turn to get the previous turns for. | |
format_intent |
Callable |
A function that takes a turn and returns a string or a dictionary. | |
turn_sep |
str |
" ; " |
The separator to use for joining the previous turns. If None, then the previous turns are returned as a list. Defaults to “ ; “. |
num_prev_turns |
int |
5 |
The number of previous turns to include. Defaults to 5. |
ReturnsPermalink
str or list
The previous turns formatted as a string, or as a list if turn_sep is None.
format_candidates
Permalink
weblinx.processing.prompt.format_candidates(candidates, max_char_len=300, use_uid_as_rank=False)
DescriptionPermalink
This will format the candidates as a string. The candidates are formatted as follows:
- If there are no candidates, return “”
- If there are candidates, return the candidates as a string, with the rank (e.g. uid) and document (text representation of the candidate)
- If the document is longer than
max_char_len
, then it will be truncated tomax_char_len
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
candidates |
list |
The candidates to format. | |
max_char_len |
int |
300 |
The maximum character length for the document. If the document is longer than this, then it will be truncated to this length. Defaults to 300. |
use_uid_as_rank |
bool |
False |
Whether to use the UID as the rank. If False, then it will use the rank from the candidates. Defaults to False. |
ReturnsPermalink
str
The candidates formatted as a string.
format_utterances
Permalink
weblinx.processing.prompt.format_utterances(turns, num_utterances=5, type_filter="chat", sep=" ", convert_to_minutes=True, template="[{timestamp}] {utterance}")
DescriptionPermalink
Formats utterances from a list of turns. The utterances are formatted as follows:
- If there are no utterances, return “No instructor utterance”
- If there are less than
num_utterances
utterances, return all utterances - If there are more than
num_utterances
utterances, return the first and lastnum_utterances-1
utterances
If sep is None, then the utterances are returned as a list. Otherwise, they are joined by sep
.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
turns |
list |
The list of turns to get the utterances from. | |
num_utterances |
int |
5 |
The number of utterances to include. Defaults to 5. |
type_filter |
str |
"chat" |
The type of turn to include. If None, then it will include all types. Defaults to “chat”. |
sep |
str |
" " |
The separator to use for joining the utterances. If None, then the utterances are returned as a list. Defaults to “ “. |
convert_to_minutes |
bool |
True |
Whether to convert the timestamp to minutes. Defaults to True. |
template |
str |
"[{timestamp}] {utterance}" |
The template to use for formatting the utterances. Defaults to “[{timestamp}] {utterance}”. |
ReturnsPermalink
str or list
The utterances formatted as a string, or as a list if sep is None.
format_prev_turns_truncated
Permalink
weblinx.processing.prompt.format_prev_turns_truncated(replay, turn, format_intent, tokenizer, num_tokens_to_remove, format_output_dict_fn=<_ast.Name object at 0x7fac2032a370>, num_prev_turns=5, turn_sep=" ; ", allow_iterative_reduction=False)
DescriptionPermalink
This performs the same function as format_prev_turns
, but it truncates the text
to fit within the max_tokens
. The truncation is not a regular truncation, instead
it uses the truncate_text_at_center
function (see strategic trunction part
in the paper).
find_turns_with_instructor_chat
Permalink
weblinx.processing.prompt.find_turns_with_instructor_chat(replay, turn, speaker="instructor", num_prev_turns=5)
DescriptionPermalink
This looks for all the turns in replay
up to turn
(minus num_prev_turns
turns)
that are by speaker
(default: instructor). This is used to find the instructor’s utterances
that are used as context for the current turn. The reason we have a num_prev_turns parameter
is because we want to limit the number of turns we look at, as the last num_prev_turns turns
can be displayed by another function, such as format_prev_turns
, separate from this.
This output of this function should be used by format_utterances to display the utterances.
multi_attempt_format_prev_turns_truncated
Permalink
weblinx.processing.prompt.multi_attempt_format_prev_turns_truncated(replay, turn, format_intent, tokenizer, max_tokens, num_prev_turns=5, turn_sep=" ; ", max_attempts=5, format_output_dict_fn=<_ast.Name object at 0x7fac2034a2b0>, warn_after_attempts=True, allow_iterative_reduction=False)
DescriptionPermalink
This function behaves the same as format_prev_turns_truncated
, but it will attempt to
truncate the text multiple times until it fits within the max_tokens
. This is useful
when the object is difficult to truncate, and the function is unable to truncate the text
using the approximation method described in the strategic truncation part of the paper.
format_utterances_truncated
Permalink
weblinx.processing.prompt.format_utterances_truncated(turns, tokenizer, max_tokens, format_utterances_fn, num_utterances=5, type_filter="chat", sep=" ", convert_to_minutes=True, template="[{timestamp}] {utterance}", allow_iterative_reduction=False)
DescriptionPermalink
Formats utterances from a list of turns. The utterances are formatted as follows:
- If there are no utterances, return “No instructor utterance”
- If there are less than
num_utterances
utterances, return all utterances - If there are more than
num_utterances
utterances, return the first and lastnum_utterances-1
utterances
If sep is None, then the utterances are returned as a list. Otherwise, they are joined by sep
.
find_prev_turn_with_candidates
Permalink
weblinx.processing.prompt.find_prev_turn_with_candidates(prev_turns, candidates, reverse=True)
DescriptionPermalink
For a given turn and candidates for all turns, this search in the reverse direction to find the previous turn that has candidates. This is useful for chat turns, since the candidates are only available for browser actions; though in rare cases, they may also be missing in browser turns too.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
prev_turns |
list |
The list of previous turns to search through. | |
candidates |
dict |
The candidates for all turns, as a dictionary of lists. | |
reverse |
bool |
True |
Whether to search in the reverse direction. Defaults to True. |
ReturnsPermalink
Turn
The previous turn that has candidates, or None if no such turn is found.
select_candidates_for_turn
Permalink
weblinx.processing.prompt.select_candidates_for_turn(candidates, turn, num_candidates=20)
DescriptionPermalink
This will select the top candidates for the given turn. The candidates are sorted by their rank,
and the top num_candidates
will be returned.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
candidates |
dict |
The candidates for all turns, as a dictionary of lists. | |
turn |
Turn |
The turn to select the candidates for. | |
num_candidates |
int |
20 |
The number of candidates to select. Defaults to 20. |
ReturnsPermalink
list
The top candidates for the given turn.
select_turns_and_candidates_for_prompts
Permalink
weblinx.processing.prompt.select_turns_and_candidates_for_prompts(demos, candidates=None, num_candidates=20, remove_turns_without_elements=True)
DescriptionPermalink
This will select the turns that will be used for building the prompts. It first filters turns based on the intents that will be predicted, whether the turn has a validated screenshot, and if it’s a chat turn, then whether the speaker is not the navigator. Then, we will select the candidates for each turn, and if we cannot find any candidates for a given turn, then we will find the previous turn that has candidates.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
demos |
list |
The list of demonstrations to select the turns from. | |
candidates |
dict |
None |
The candidates for all turns, as a dictionary of lists. If None, then the candidates will not be used. Defaults to None. |
num_candidates |
int |
20 |
The number of candidates to select for each turn. Defaults to 20. |
remove_turns_without_elements |
bool |
True |
Whether to remove turns that do not have elements. Defaults to True. |
ReturnsPermalink
list
A list of dictionaries, where each dictionary contains the following keys:
- replay: The replay for the turn
- turn: The turn to use for the prompt
- cands_turn: The candidates for the turn, or None if no candidates are found
build_input_record_for_single_turn
Permalink
weblinx.processing.prompt.build_input_record_for_single_turn(turn_dict, format_intent, build_prompt_records_fn, format_prompt_records_fn)
DescriptionPermalink
This builds the input record for a single turn. The input record is a dictionary, which contains the following
- demo_name: The name of the demonstration
- base_dir: The base directory of the demonstration
- turn_index: The index of the turn
- prompt: The prompt to use for the model
- output_target: The target output for the model
- output_target_dict: The target output for the model, but in dictionary format
If candidates
is not None, then the prompt includes the candidates as well.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
turn_dict |
dict |
A dictionary containing the following - replay: The replay for the turn - turn: The turn to use for the prompt - cands_turn: The candidates for the turn, or None if no candidates are found | |
format_intent |
Callable |
A function that takes a turn and returns a string or a dictionary. | |
build_prompt_records_fn |
Callable |
A function that takes a replay, turn, and cands_turn, and returns the prompt records. | |
format_prompt_records_fn |
Callable |
A function that takes the prompt records and formats them. |
ReturnsPermalink
dict
The input record for the model.
build_input_records_from_selected_turns
Permalink
weblinx.processing.prompt.build_input_records_from_selected_turns(selected_turns, format_intent, build_prompt_records_fn, format_prompt_records_fn)
DescriptionPermalink
This will build the input records for the model. The input records are a list of dictionaries, which contains the following keys:
- demo_name: The name of the demonstration
- base_dir: The base directory of the demonstration
- turn_index: The index of the turn
- prompt: The prompt to use for the model
- output_target: The target output for the model
- output_target_dict: The target output for the model, but in dictionary format
If candidates
is not None, then the prompt includes the candidates as well.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
selected_turns |
list |
The list of selected turns to build the input records from. | |
format_intent |
Callable |
A function that takes a turn and returns a string or a dictionary. | |
build_prompt_records_fn |
Callable |
A function that takes a replay, turn, and cands_turn, and returns the prompt records. | |
format_prompt_records_fn |
Callable |
A function that takes the prompt records and formats them. |
ReturnsPermalink
list
A list of input records for the model.
build_input_records_from_selected_turns_parallel
Permalink
weblinx.processing.prompt.build_input_records_from_selected_turns_parallel(selected_turns, format_intent, build_prompt_records_fn, format_prompt_records_fn, num_processes=4, chunksize=50)
DescriptionPermalink
This will build the input records for the model. The input records are a list of dictionaries, which contains the following keys:
- demo_name: The name of the demonstration
- base_dir: The base directory of the demonstration
- turn_index: The index of the turn
- prompt: The prompt to use for the model
- output_target: The target output for the model
- output_target_dict: The target output for the model, but in dictionary format
If candidates
is not None, then the prompt includes the candidates as well.
If num_processes
is greater than 1, then this will use multiprocessing to build the input records.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
selected_turns |
list |
The list of selected turns to build the input records from. | |
format_intent |
Callable |
A function that takes a turn and returns a string or a dictionary. | |
build_prompt_records_fn |
Callable |
A function that takes a replay, turn, and cands_turn, and returns the prompt records. | |
format_prompt_records_fn |
Callable |
A function that takes the prompt records and formats them. | |
num_processes |
int |
4 |
The number of processes to use to build the input records. If 1, then this will use a single process. Defaults to 4. |
chunksize |
int |
50 |
The chunksize to use for multiprocessing. Defaults to 50. This allows the processes to work on a chunk of chunksize turns at a time instead of one turn at a time, allowing for better performance. |
ReturnsPermalink
list
A list of input records for the model.
Reference for weblinx.processing.truncation
Permalink
get_bracket_length
Permalink
weblinx.processing.truncation.get_bracket_length(elem_open_bracket, elem_close_bracket, tokenizer)
DescriptionPermalink
This function calculates the length of the brackets.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
elem_open_bracket |
str |
The opening bracket. For example, “<” or “[”. | |
elem_close_bracket |
str |
The closing bracket. For example, “>” or “]”. | |
tokenizer |
transformers.PreTrainedTokenizer (PreTrainedTokenizer) |
The tokenizer to use to tokenize the brackets. This is used to calculate the length of the brackets. |
ReturnsPermalink
int
The length of the brackets.
reduce_list_of_lengths
Permalink
weblinx.processing.truncation.reduce_list_of_lengths(lengths, max_length, assert_exactly_max=True)
DescriptionPermalink
Given a list of lengths, reduce the lengths to a maximum length. To learn more about how this is achieved, please read the strategic truncation section in the paper.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
lengths |
list |
A list of lengths to reduce. | |
max_length |
int |
The maximum length to reduce the list of lengths to. | |
assert_exactly_max |
bool |
True |
If True, then we will assert that the total length of the reduced list of lengths is exactly equal to max_length. Defaults to True. If False, then we will not assert this condition. |
ReturnsPermalink
list
The reduced list of lengths.
RaisesPermalink
ValueError If the difference between the new total length and the max_length is negative.
AssertionError If assert_exactly_max is True and the total length of the reduced list of lengths is not exactly equal to max_length. Also, if the lengths are not monotonically increasing (i.e., the lengths are not sorted by length).
get_truncation_offsets
Permalink
weblinx.processing.truncation.get_truncation_offsets(tokens, length, num_tokens_to_remove)
DescriptionPermalink
This calculates the start and end offsets of the tokens to remove. This is a
helper function for truncate_text_at_center
.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
tokens |
dict |
The tokens to truncate. This should be a dictionary with the keys “input_ids” and “offset_mapping”. | |
length |
int |
The length of the tokens. | |
num_tokens_to_remove |
int |
The number of tokens to remove from the tokens. |
ReturnsPermalink
tuple
A tuple of the start and end offsets of the tokens to remove.
truncate_text_at_center
Permalink
weblinx.processing.truncation.truncate_text_at_center(text, tokenizer, tokens=None, max_tokens=10, ellipsis="...", ellipsis_length=None, assert_max_tokens=False, allow_retry_without_ellipsis=True, allow_iterative_reduction=False)
DescriptionPermalink
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
allow_iterative_reduction |
bool |
False |
If True, then we will allow the iterative reduction to continue until the max_tokens is reached. This is useful when the tokenizer output does not necessarily decrease when we remove tokens from the input. For example, if we remove a token that is part of a word, but the updated text is retokenized to the same number of tokens, then we will continue to remove tokens until we reach the max_tokens limit. |
build_records_of_tokens_for_dom_tree
Permalink
weblinx.processing.truncation.build_records_of_tokens_for_dom_tree(dom_tree, tokenizer, sorted_by_length=True)
truncate_dom_tree
Permalink
weblinx.processing.truncation.truncate_dom_tree(dom_tree, tokenizer, num_tokens_to_remove, ellipsis="...", remove_when_none=True, copy=True, allow_iterative_reduction=False)
DescriptionPermalink
This function takes a dom tree and truncate it based on the number of tokens to remove.
It is not guaranteed that the resulting dom tree will have the exact number of tokens
as specified, but it will be close to the specified number of tokens. Please see
multi_attempt_truncate_dom_tree
for a more robust approach.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
dom_tree |
lxml.html.HtmlElement |
The dom tree to truncate. | |
tokenizer |
transformers.PreTrainedTokenizer (PreTrainedTokenizer) |
The tokenizer to use to tokenize the text. | |
num_tokens_to_remove |
int |
The number of tokens to remove from the dom tree. | |
ellipsis |
str |
"..." |
The ellipsis to use to indicate that the text has been truncated. Defaults to “…”. |
remove_when_none |
bool |
True |
If True, then we will remove the attribute if the value is None. Defaults to True. |
copy |
bool |
True |
If True, then we will copy the dom_tree before modifying it. Defaults to True. |
allow_iterative_reduction |
bool |
False |
If True, then we will allow the iterative reduction to continue until the max_tokens is reached. This is useful when the tokenizer output does not necessarily decrease when we remove tokens from the input. For example, if we remove a token that is part of a word, but the updated text is retokenized to the same number of tokens, then we will continue to remove tokens until we reach the max_tokens limit. |
ReturnsPermalink
lxml.html.HtmlElement
The truncated dom tree.
multi_attempt_truncate_dom_tree
Permalink
weblinx.processing.truncation.multi_attempt_truncate_dom_tree(dom_tree, tokenizer, ellipsis="...", max_attempts=5, max_tokens=700, warn_after_attempts=True, copy_tree=True, compute_final_num_tokens=False, allow_iterative_reduction=False)
DescriptionPermalink
” This will attempt to truncate the dom_tree to the specified number of tokens. THe reason we need more than one attempt is because when we specify a number of tokens to remove based on a max_tokens value, the resulting number of tokens may be greater than the max_tokens value, due to how tokenizers work. Therefore, we need to try multiple times to truncate the dom_tree to the specified number. If after max_attempts, we are still unable to truncate the dom_tree to the specified number of tokens, then we will truncate the resulting text directly.
convert_elem_dict_to_str
Permalink
weblinx.processing.truncation.convert_elem_dict_to_str(elem_dict, remove_empty=False)
DescriptionPermalink
Convert an element dictionary to a string.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
elem_dict |
dict |
The element dictionary. | |
remove_empty |
bool |
False |
If True, then we will remove any empty elements. Defaults to False. |
ReturnsPermalink
str
The string representation of the element dictionary.
truncate_cands_turn
Permalink
weblinx.processing.truncation.truncate_cands_turn(cands_turn, tokenizer, num_tokens_to_remove, protected_elem_keys=('"tag"', '"bbox"'), copy=True, remove_empty=True, allow_iterative_reduction=False)
DescriptionPermalink
This truncates the candidates turn to the specified number of tokens. This is useful when the candidates turn is too long, and we need to truncate it to fit within the maximum token length.
ParametersPermalink
Name | Type | Default | Description |
---|---|---|---|
cands_turn |
list |
The candidates turn to truncate. | |
tokenizer |
transformers.PreTrainedTokenizer (PreTrainedTokenizer) |
Th tokenizer to use to tokenize the text. | |
num_tokens_to_remove |
int |
The number of tokens to remove from the candidates turn. | |
protected_elem_keys |
tuple |
('"tag"', '"bbox"') |
The keys to protect from truncation. Defaults to (“tag”, “bbox”). |
copy |
bool |
True |
If True, then we will copy the candidates turn before modifying it. Defaults to True. |
remove_empty |
bool |
True |
If True, then we will remove any empty elements. Defaults to True. |
allow_iterative_reduction |
bool |
False |
If True, then we will allow the iterative reduction to continue until the max_tokens is reached. This is useful when the tokenizer output does not necessarily decrease when we remove tokens from the input. For example, if we remove a token that is part of a word, but the updated text is retokenized to the same number of tokens, then we will continue to remove tokens until we reach the max_tokens limit. |
ReturnsPermalink
list
The truncated candidates turn.
multi_attempt_truncate_cands_turn
Permalink
weblinx.processing.truncation.multi_attempt_truncate_cands_turn(cands_turn, tokenizer, max_tokens, format_candidates_fn, max_attempts=5, warn_after_attempts=True, protected_elem_keys=('"tag"', '"bbox"'), allow_iterative_reduction=False)
DescriptionPermalink
This is a more robust version of truncate_cands_turn
. It will attempt to truncate
the candidates turn to the specified number of tokens. If after max_attempts, we are
still unable to truncate the candidates turn to the specified number of tokens, then
we will truncate the resulting text directly, as a last resort.