ASPL User Manual v 1.00
© 2025 SetSphere.com


2-3

   Quotient Set for Document Similarity

The equivalence relation is generalized to act on the element names of the group and on the attribute of subgroups and elements. However, if the elements represent actual files on your system, then it is possible to compare the similarity of their contents. This is provided through the attribute dosi that is settable in ASPL equivalence relation.

One of the standardized attribute that is available with every element grouping class is the attribute dosi. To list the attributes describing the meta data in a workspace, you type attributes at the ASPL prompt; the dosi attribute is described as "doc similarity measure" of type float.

In this chapter we show how to get the quotient set of the documents based on the similarity of their content.

ASPL provides a specialized processing facility to compare document similarity.
The named elements must represent actual files on the system where ASPL is running.



To find the elements that are equivalent to each other when comparing groups, one can define an equivalence relation.


■ Defining an Equivalence Relation for Document Similarity

Whenever the set variables represent the group of files in directories on your system, you can define an equivalence relation and put the attribute dosi in action by specifying the matching descriptor when comparing the similarity between two elements (files), for example:
  dosi=i0m3lv2p1
Putting the dosi in the context of an equivalence relation we can type:
  q r7 := frx=.*txt$,dosi=i0m3lv2p1
Set the dosi for document similarity
   dosi=i[01]p[0-N]m[0-N]lv[0-N] 
       i FOLLOWED BY 0 OR 1 TO IGNORE CASE 
       p FOLLOWED BY 0,1,2,.. FOR THE PROXIMITY
       m FOLLOWED BY 0,1,2,.. FOR THE MINIMUM LENGTH
       lv FOLLOWED BY 0,1,2,.. FOR THE LEVENSHTEIN MATCHING DISTANCE
     FOR EXAMPLE THE FOLLOWING dosi=i0m3lv2 IS A VALID DOCUMENT SIMILARITY THAT WILL
     IGNORE THE CASE, MATCH MINIMUM THREE CHARACTERS, AND DO A LEVENSHTEIN 
     DISTANCE OF THREE


The following example shows how to use the dosi attribute in an equivalence relation:

aspl WS1

aspl -wsname JUNKTEST -groupingclass POSIX

aspl>  q r5 := frx=.*txt$,dosi=i0m3p0

aspl>  q r6 := frx=.*txt$,dosi=i0m3lv2p0

aspl>  q r7 := frx=.*txt$,dosi=i0m3lv2p1

aspl>  dir1 = ggdir(dir,/tmp/docdoc1)

aspl>  dir2 = ggdir(dir,/tmp/docdoc2)

aspl>  ks dosi mtime chksum ppdd ffl

aspl>  f&/~r5 dir1 dir2

aspl>  f&/~r6 dir1 dir2

aspl>  f&/~r7 dir1 dir2

aspl>  y&/~r7 dir1 dir2


DOCUMENT SIMILARITY dosi
The set variables in WS1 represent grouping of directories on the UNIX system. The elements in a group can be either subdirectories or files. Here we want to select these elements (files) whose name contain oo, and whose mtime attributes are different, in addition the content of the file has the string ibm.com anywhere in its body (otherwise skip the file). We want to get the similarity between two elements (files) by using the following descriptor i0m3lv2
let define the r9 relation

aspl>  q r9 := frx=.*oo.*$,mtm~,body=ibm.com,dosi=i0m3lv2

To see how ASPL will process the equivalence relation r9 we type q r9 at the ASPL prompt, and the output is shown below:

aspl>  q r9

3:12:9 root@vienna /tmp  aspl:2 > q r9

  QUOTIENT SET BUILDER

   {f&/~r9  A12 B3 C2} <=>  f&/frx=.*oo.*$,mtm~,body=ibm.com,dosi=i0m3lv2  A12   B3   C2 

  Detailed view:

   {f&/~r9  A12 B3 C2} <=>

    f& / frx=.*oo.*$,mtm~,body=ibm.com,dosi=i0m3lv2  A12   B3   C2 
     | |      |        |        |            |        |     |    +----> set-variable
     | |      |        |        |            |        |     +---------> set-variable
     | |      |        |        |            |        +---------------> set-variable
     | |      |        |        |            +------------------------> document similarity
     | |      |        |        +-------------------------------------> file content regular expression
     | |      |        +----------------------------------------------> have different make times
     | |      +-------------------------------------------------------> file name regular expression
     | +--------------------------------------------------------------> stroking the Quotient Relation
     +----------------------------------------------------------------> gets the elements intersection

  Set builder syntax is read from left to right, or from bottom to top.
  All ASPL setops are setadic: they take a setop followed by set variables.

  Note that when typing the command: the setop, the stroke, and the quotient
  relation predicates must not include any space.

   - named files to match: .*oo.*$
   - body content  matching on the set operation: ibm.com
   - predicate condition on the set operation: mtm~
   - document similarity matching with the following parameters:
      1)string minimum length: 3
      2)levenshtein matching distance: 2
      3)ignore case: 0
      4)proximity: 0

  The f& is used as an example.
  The set-variables A12 B3 C2 are dummies and used for explanation


The frx specifies how to select the group of elements matching the regular expression .*oo.*$ that is to match any element that contains oo in it.

The effect of some simple equivalence relations can be the same as the precoded predicates (that can be affixed to set operators, following the tick ` operator). In particular, for the attributes: mtime, chksum, and entropy (which form a common denominator between all grouping classes attributes), are tickable predicates that can be attached to set operators. For instance, f&`mtm= is equivalent to f&/mtm=. Consult the "ASPL Operations Guide" on the set operators and their predicates.

It is convenient to define quotient-relations of your own. You can define your own equivalence relation by using the q operator, followed an identifier as thee relation name:

aspl>  q r1 := frx=.*wsadmin.*,mtm~
     define q quotient relation r1

aspl>  q r2 := frx=.*\.txt$
     define q quotient relation r2

aspl>  q r1
     print r1 definition

Once an equivalence relation is defined with the identifier symbolic name r1, you can use it in set operation by following the set operator with /~r1 as follows

aspl>  f&/~r1

This is read as: get the elements intersection such that the equivalence relation r1 is satified.