ASPL User Manual v 1.00
© 2025 SetSphere.com
The equivalence relation is generalized to act on the element names of the group and on the attribute of subgroups and elements. However, if the elements represent actual files on your system, then it is possible to compare the similarity of their contents. This is provided through the attribute dosi that is settable in ASPL equivalence relation.
One of the standardized attribute that is available with every element grouping class is the attribute dosi. To list the attributes describing the meta data in a workspace, you type attributes at the ASPL prompt; the dosi attribute is described as "doc similarity measure" of type float.
In this chapter we show how to get the quotient set of the documents based on the similarity of their content.
ASPL provides a specialized processing facility to compare document similarity.
The named elements must represent actual files on the system where ASPL is running.
The named elements must represent actual files on the system where ASPL is running.
To find the elements that are equivalent to each other when comparing groups, one can define an equivalence relation.
■ Defining an Equivalence Relation for Document Similarity
Whenever the set variables represent the group of files in directories on your system, you can define an equivalence relation and put the attribute dosi in action by specifying the matching descriptor when comparing the similarity between two elements (files), for example:
dosi=i0m3lv2p1Putting the dosi in the context of an equivalence relation we can type:
q r7 := frx=.*txt$,dosi=i0m3lv2p1
Set the dosi for document similarity
dosi=i[01]p[0-N]m[0-N]lv[0-N]
i FOLLOWED BY 0 OR 1 TO IGNORE CASE
p FOLLOWED BY 0,1,2,.. FOR THE PROXIMITY
m FOLLOWED BY 0,1,2,.. FOR THE MINIMUM LENGTH
lv FOLLOWED BY 0,1,2,.. FOR THE LEVENSHTEIN MATCHING DISTANCE
FOR EXAMPLE THE FOLLOWING dosi=i0m3lv2 IS A VALID DOCUMENT SIMILARITY THAT WILL
IGNORE THE CASE, MATCH MINIMUM THREE CHARACTERS, AND DO A LEVENSHTEIN
DISTANCE OF THREE
The following example shows how to use the dosi attribute in an equivalence relation:
# aspl WS1
# aspl -wsname JUNKTEST -groupingclass POSIX
aspl> q r5 := frx=.*txt$,dosi=i0m3p0
aspl> q r6 := frx=.*txt$,dosi=i0m3lv2p0
aspl> q r7 := frx=.*txt$,dosi=i0m3lv2p1
aspl> dir1 = ggdir(dir,/tmp/docdoc1)
aspl> dir2 = ggdir(dir,/tmp/docdoc2)
aspl> ks dosi mtime chksum ppdd ffl
aspl> f&/~r5 dir1 dir2
aspl> f&/~r6 dir1 dir2
aspl> f&/~r7 dir1 dir2
aspl> y&/~r7 dir1 dir2
DOCUMENT SIMILARITY dosi
The set variables in WS1 represent grouping of directories on the UNIX system. The elements in a group can be either subdirectories or files. Here we want to select these elements (files) whose name contain oo, and whose mtime attributes are different, in addition the content of the file has the string ibm.com anywhere in its body (otherwise skip the file). We want to get the similarity between two elements (files) by using the following descriptor i0m3lv2
let define the r9 relation
aspl> q r9 := frx=.*oo.*$,mtm~,body=ibm.com,dosi=i0m3lv2
aspl> q r9
3:12:9 root@vienna /tmp aspl:2 > q r9
QUOTIENT SET BUILDER
{f&/~r9 A12 B3 C2} <=> f&/frx=.*oo.*$,mtm~,body=ibm.com,dosi=i0m3lv2 A12 B3 C2
Detailed view:
{f&/~r9 A12 B3 C2} <=>
f& / frx=.*oo.*$,mtm~,body=ibm.com,dosi=i0m3lv2 A12 B3 C2
| | | | | | | | +----> set-variable
| | | | | | | +---------> set-variable
| | | | | | +---------------> set-variable
| | | | | +------------------------> document similarity
| | | | +-------------------------------------> file content regular expression
| | | +----------------------------------------------> have different make times
| | +-------------------------------------------------------> file name regular expression
| +--------------------------------------------------------------> stroking the Quotient Relation
+----------------------------------------------------------------> gets the elements intersection
Set builder syntax is read from left to right, or from bottom to top.
All ASPL setops are setadic: they take a setop followed by set variables.
Note that when typing the command: the setop, the stroke, and the quotient
relation predicates must not include any space.
- named files to match: .*oo.*$
- body content matching on the set operation: ibm.com
- predicate condition on the set operation: mtm~
- document similarity matching with the following parameters:
1)string minimum length: 3
2)levenshtein matching distance: 2
3)ignore case: 0
4)proximity: 0
The f& is used as an example.
The set-variables A12 B3 C2 are dummies and used for explanation
The frx specifies how to select the group of elements matching the regular expression .*oo.*$ that is to match any element that contains oo in it.
The effect of some simple equivalence relations can be the same as the precoded predicates (that can be affixed to set operators, following the tick
` operator).
In particular, for the attributes: mtime, chksum, and entropy (which form a common denominator between all grouping classes attributes), are
tickable predicates that can be attached to set operators. For instance, f&`mtm= is
equivalent to f&/mtm=. Consult the "ASPL Operations Guide" on the set operators and their predicates.
It is convenient to define quotient-relations of your own. You can define your own equivalence relation by using the q operator, followed an identifier as thee relation name:
aspl> q r1 := frx=.*wsadmin.*,mtm~
define q quotient relation r1
aspl> q r2 := frx=.*\.txt$
define q quotient relation r2
aspl> q r1
print r1 definition
aspl> f&/~r1