/* 
RANKTIE: computes ranks, with ties, of the columns of a matrix.

FORMAT:  r = ranktie(x,tiemethd,descend,groups,normal,csavage);

INPUTS:

        x --       matrix containing variables (columns) to be ranked. 

 tiemethd --       If == 1, then tied values will be assigned mean rank.
                    If == 2, then tied values will be assigned high rank.
                    If == 3, then tied values will be assigned low rank.
                    For instance, if the data values were: 4   5   5 3 1 6
                         With tiemethd==1, ranks would be: 3 4.5 4.5 2 1 6
                         With tiemethd==2, ranks would be: 3   5   5 2 1 6
                         With tiemethd==3, ranks would be: 3   4   4 2 1 6 
 descend --      If == 1, then the highest value is given the rank of 1,
                    and so on.  If == 0, then the lowest value is given the
                    value of 1. 
 groups --       If GROUPS=n, where n is some positive integer, then
                    a matrix _QURANKS will be computed containing the
                    quantile ranks based upon n groups.  _QURANKS will be
                    returned to memory.

                    If n=4, then quartile ranks will be computed; if
                    n=10, decile ranks will be computed; if n=100
                    percentile rankes will be computed.

                    The quantile values start with 0  -- thus, if n=4,
                    the observations falling into the lowest quantile
                    are assigned the value 0, and the observations
                    falling into the highest quantile are assigned
                    the value 3. 
 normal --         If > 0, will compute a matrix _NORMS containing normal
                    scores from the ranks. _NORMS will be returned to memory.

                    Normal scores are approximations to the exact expected 
                    order statistics for the normal distribution.
                    If == 1, will compute normal score using BLOM:
                           norms=cdfinvn((r-3/8)/(n+1/4))
                    If == 2, will compute normal score using TUKEY:
                           norms=cdfinvn((r-1/3)/(n+1/3))
                    If == 3, will compute normal score using VW:
                           norms=cdfinvn(r/(n+1/3))
                    where n is the number of non-missing observations (for
                    each variable separately), and where
                    cdfinvn is the inverse of the cumulative distribution
                    function for the standard normal distribution.

                    Apparently, the first of these options usually
                    gives the best fit. 
 csavage --        If == 1, will compute Savage (exponential)
                    scores from the ranks, and place them in a matrix 
                    _SAVAGE in memory.  The score for the jth observation,
                    sj, is computed from the formula:
                    sj = [sum(1/j)]-1, where summation is from
                    j=n-rj+1 to n, with rj=rank of jth observation,
                    and with n the number of non-missing observations. 

  OUTPUT:  

      r --         NxK matrix (same size as X), containing rank ties.

     /* the following will be returned to memory under specified conditions */

   _quranks --     if GROUPS > 0; see GROUPS

    _norms --      if NORMAL > 0; see NORMAL

   _savage --      if CSAVAGE == 1; see CSAVAGE

    _nmiss --      if missings encountered, this will contain the number
                   of missings in each column of the input (and output)
                   matrix.

REMARKS:

This proc will compute the ranks of the columns of a matrix. It is 
functionally similar to the SAS RANK procedure.

This proc ranks variables from smallest to largest (or the reverse,
if specified), with the smallest value assigned the rank 1, and the largest
assigned n, the number of observations.

The observations in the matrix RANKS will be in the same order as
in the input data matrix (X) -- but the initial values will be replaced by
ranks.  The matrix RANKS will always have the same number of rows and
columns as the original data matrix.

This proc will work for a matrix with an arbitrary number of columns
(variables).  However, there is the restriction that the entire data
matrix must fit into memory at one time.  Because of this, there can
be no more than 4090 rows (observations) in the data matrix.  If there
are this many rows, then there can be at most 2 variables.  However,
it is always possible to read in the data 1 or 2 variables at
a time, and to compute the ranks for these variables.  Then these ranks
can be saved to disk in matrix or data files, and the ranks for all
variables can then be concatenated into one data set.  Since the order
of observations is preserved (even if there are missing values), it
is easy to concatenate these matrices.

The proc just passes missing values through all of the computations
(that is, this is the way it appears to the user; in fact, missings
are converted to a very large positive number before ranks are computed,
and then the resulting ranks are converted back to missings)
and returns missing values in the corresponding rows and columns of the
output matrices.  Thus, the results are what they would be if the missing
values had not been present, but the output matrices always have the
same size as the input matrices.  If it is necessary to get rid of the
missing values before using the output matrices, just use the MISS
or PACKR functions to convert them to numeric values or to get rid
of rows with any missings in them.

If missing values are encountered in any variable, a message will be
printed at the completion of the proc specifying the number of missings
encountered in each variable.  At the completion of the proc, the
the vector NMISS contains the number of missing values in each variable.

Three methods are available for handling ties.  Ties can be assigned
the mean value, the high value, or the low value.  This feature is
controlled by setting the variable TIEMETHD to the values 1, 2, or 3
respectively. See TIEMETHD above for details.

Setting the variable DESCEND=1 will cause the ranks to be assigned in
descending order, so that the highest value is assigned the rank or
1, and so on.

Setting the variable GROUPS equal to some positive integer will cause
a matrix QURANKS to be computed, containing the quantile ranks of the
variables, with the number of quantiles equal to the number specified
by GROUPS.  Thus, if GROUPS=4, quartile ranks will be computed.  At the
completion of the proc, QURANKS will be in memory as well as RANKS
and any other matrices specified. See GROUPS above for more details.

Setting the variable NORMAL equal to 1, 2, or 3 will cause the matrix
NORMS to be computed.  This will contain normal scores, computed by
one of 3 methods (these are discussed above).

Setting the variable CSAVAGE equal to 1 will cause the matrix SAVAGE
to be computed.  This will contain Savage (exponential) scores,
computed from the ranks.  See the discussion above for more details on this.

--------------------------------------------------------------------------- */ 
proc ranktie(x,tiemethd,descend,groups,normal,csavage);
local havemiss, ranks, nmiss, norms, dvpt, k, n, ranki, missval, i, nmissi,
      notmiss, quranks, p, psv, rsav, savage, ni, rnk, rseq, cx, rx, obs,
      sx, mask, mask1, i0, i1, xrank, lim, p0, q0, p1, q1, p2, q2, p3,
      q3, p4, q4, maskgt, maskeq, sgn, pn, y;
clear havemiss, ranks, nmiss, norms; dvpt = 1;
@ Computation -- computation is controlled by the following calls
                 to subroutines.  This makes it easier to put this
                 section of code within a data loop that reads in
                 all the data, selects one variable at a time, and
                 then saves the results in another data set.    @

   gosub init;
   gosub sranks;
   gosub squranks;
   gosub snorms;
   gosub ssavage;
   dvpt = varput(nmiss,"_nmiss");
   if not dvpt;
"ERROR: Symbol table full. Could not write matrices to memory."; 
   endif;
if havemiss;
   ndpclex;   @ This clears the 8087 exceptions when there are missings. @
endif;
retp( ranks );
@ ********************************************************************** @
@       It is not usually necessary to make changes below here.          @
@                                                                        @
@ ---------------------- SUBROUTINES FOLLOW ---------------------------- @


@ Initializations -- data matrix must be called X. @
init:
k=cols(x); n=rows(x);
clear ranks, ranki, havemiss;
missval=pi*1e+300;@ This specifies the numeric value to be used for
                    missing values.  If a very large value is chosen,
                    and if DESCEND=0, then the observations with missing
                    values will be assigned the highest ranks.  If any
                    missing values have been encountered, a message will
                    be printed at the end of the proc specifying how
                    may missing values there are for each variable. @

 sranks:
@ Compute ranks separately for every variable, and return a matrix RANKS
  that is the same size as the original data matrix. This section of
  code does most of the work.   It is arranged as a set of subroutines
  that are called from the main body of the proc. @

 i=1;
 do until i > k;
    gosub comprank(x[.,i]);
    nmissi=n-rows(packr(x[.,i]));  @ Number of missings in this var @
   if nmissi;
    ranki=miss(ranki,maxc(ranki)); @ Convert ranks of missings to missing. @
   endif;
   if i == 1; ranks=ranki; nmiss=nmissi;
      else;   ranks=ranks~ranki; nmiss=nmiss|nmissi;
   endif; clear ranki;
  i=i+1;
  endo;
  notmiss=n-nmiss;
  if not nmiss == 0; havemiss=1; endif;  @ General flag for missings. @

return;

@ If the GROUPS variable has been specified, then compute quantiles
  based upon the number of groups given. @

squranks:
if groups > 0;
   quranks=floor((ranks*groups)./(notmiss+1)');
   dvpt = varput(quranks,"_quranks");
endif;
return;

@ If NORMAL > 0, then compute normal scores using one of the three
  options.   @
snorms:
if normal > 0;
       if normal == 1;   p=(ranks-3/8)./(notmiss+1/4)';
   elseif normal == 2;   p=(ranks-1/3)./(notmiss+1/3)';
   elseif normal == 3;   p=ranks./(notmiss+1)';
    else; "ERROR: The control variable NORMAL can only take on the values";
          "0, 1, 2, or 3.";
   endif;
 gosub cdfinvn;
 dvpt = varput(norms,"_norms");

endif;
return;

@ If CSAVAGE == 1, then compute the savage scores @
ssavage:
if csavage;
  if not havemiss;    @ True number of obs is the same for all variables. @
  @ Vector of partial sums of sequence 1/(n+1-i), i=1,...,n --
    given this, just need to pull the correct elements out of
    this vector to obtain SAVAGE. @

 psv = 1/(n+1-seqa(1,1,n));
 rsav=recserar(psv,psv[1,1],1);

   @ The following code pulls out the partial sums corresponding to
     the values of RNK,and reshapes to put in matrix the same size as RANKS.
     If there are ties, need to decide what to do with the fractional
     ranks.  This truncates. @

  savage=reshape(submat(rsav,trunc(ranks),0)-1,n,k);  clear rsav;

  else;      @ Missings -- true number of obs differs among variables. @

 i=1;        @ Compute SAVAGE for each variable in turn. @
 do until i > k;
       ni=notmiss[i,.]; rnk=missrv(ranks[.,i],n);  @ Convert missing
                                                     ranks to highest val @
       rseq=ni+1-seqa(1,1,n);
       rseq=1/miss(rseq.*(rseq .> 0),0); @ Convert 0 or negative values to M @

       rsav=recserar(rseq,rseq[1,1],1);

    if i == 1;
       savage=submat(rsav,trunc(rnk),0)-1;  clear rsav;
       else;
       savage=savage~submat(rsav,trunc(rnk),0)-1;  clear rsav;
    endif;
  i=i+1;
  endo;

 endif;
dvpt = varput(savage,"_savage");
endif;
return;

send:
cls;
"  The RANKS proc is completed.  The matrix RANKS remains in memory,
  along with the matrices QURANKS, NORMS, and/or SAVAGE, if these
  have been requested using using the appropriate options.";
if not nmiss == 0;
?;?;
"  NOTE: Missing values have been encountered.  The number of
  missing values,respectively by variable is:";
  format 3,0; nmiss'; format 10,6;
endif;
return;

@ COMPRANK -- This subroutine computes the ranks for a vector, and
              returns the rank vector in the original order.   @

comprank: pop cx;
@ Step 1 -- sort and retain rank index @

cx=missrv(cx,missval);     @ Convert missings into very large or very
                             small number. @
rx = sortc(seqa(1,1,rows(cx))~cx,2);
clear cx;

obs=rx[.,1];  @ Pull out the original observation number @

if descend == 0;          @ Initial ranks. @
   ranki=seqa(1,1,n);      @ Ascending order @
 else;
   ranki=seqa(n,-1,n);     @ Descending order @
endif;

@ Step 2 -- lag and trim sorted x to compute vector of 1's and 0's
            corresponding to ties @

sx=rx[.,2];     @ Pull out sorted data vector @
   clear rx;
mask = 0|(sx[2:rows(sx),.] .== sx[1:rows(sx)-1,.]);
                    @ Create vector of 1's and 0's, corresponding to
                     equal contiguous elements that the sorted data
                     vector. This identifies the positions of the ties.
                     This vector is the same length as the original data
                     vector. The first element is set equal to 0
                     automatically. @
    clear sx;
@ Step 3 -- Go through a loop that deals with each set of ties
            in turn. @

first:
mask1=zeros(n,1);
i0=maxindc(mask);       @ This  produces  the index of the first
                          max value.  If this == 1, then there are no 1's
                          in MASK, and therefore no ties in the current
                          mask.  @

if i0 == 1; goto last; endif;     @ Quit if there are no ties in current
                                    mask. @

    mask[1:i0-1,1]=ones(i0-1,1)*2;   @ Fill in first  i0-1 elements with
                                       2's @

    i1=minindc(mask);             @ Find the first 0, if one exists, in
                                     the transformed mask. @

   if i1 == i0;                  @ There are no zeros, which means
                                   that all values after i0 are tied. @

       mask1[i0-1:n,1]=ones(n-i0+2,1);

         gosub tieranks;     @ Jump to the subroutine that computes
                               the appropriate ranks for the tied
                               set. @
         goto last;

       else;                      @ There is a 0 after a set of 1's, so
                                    that not all remaining values are tied.@

        mask1[i0-1:i1-1,1]=ones(i1-i0+1,1);  @ Create new mask with 1's only
                                               corresponding to current set
                                               of ties. @

          gosub tieranks;      @ Jump to the subroutine that computes
                                 the appropriate ranks for the tied
                                 set. @

   mask[1:i1-1,1]=zeros(i1-1,1);   @ Fill in the first i1-1 elements of the
                                     mask with zeros. This eliminates the ones
                                     corresponding to the current set of tied
                                     values. @
    endif;

goto first;     @ Go back to the top and start the procedure all over
                  again. @
last:

ranki=submat(sortc(ranki~obs,2),0,1);  @ Sort by OBS vector to get back
                                        in original order. @
clear mask, mask1, obs;
return;

@ TIERANKS -- This subroutine computes the correct ranks for tied
              values. @
tieranks:

xrank=packr(miss(ranki.*mask1,0));  @ This uses the fact that ranki can
                                      contain no 0's. @
if tiemethd == 1;     @ Assign mean rank to ties. @

   rnk=meanc(xrank);     @ Compute mean rank @

  elseif tiemethd == 2;  @ Assign high value to ranks. @

   rnk=maxc(xrank);

  elseif tiemethd == 3;  @ Assign low value to ranks. @

   rnk=minc(xrank);

  else; "ERROR: METHOD must be in the range 1-3.";
endif;

   ranki=(.not mask1).*ranki+mask1*rnk;  @ Substitute rnk for ranks @
clear xrank;
return;

@ CDFINVN -- Subroutine to compute the inverse normal cdf.
             The algorithm used here is taken from
             Kennedy and Gentle, STATISTICAL COMPUTING,
             p. 95.

             This code can be called as a subroutine as:

             gosub cdfinvn;  It requires a vector p in memory
             and will return a vector xp,
             where p is an nx1 vector of probabilities (that
             is, 0 < p < 1), and where xp is the corresponding
             vector of values such that: CDFN(xp)=p;

             This algorithm seems to be very accurate.  It is
             rated to be accurate to 1e-7.  However, it appears
             to be substantially more accurate -- approximately
             1e-10.    @

cdfinvn:
@ Constants @

lim =  1e-20;

p0  = - 0.322232431088;             q0  =   0.0993484626060;
p1  = - 1.0;                        q1  =   0.588581570495;
p2  = - 0.342242088547;             q2  =   0.531103462366;
p3  = - 0.0204231210245;            q3  =   0.103537752850;
p4  = - 0.453642210148*1e-4;        q4  =   0.38560700634*1e-2;

@ Main body of code @

 @ Create masks for handling p > .5 and p >= .5 @

maskgt=(p .> 0.5);
maskeq=(p .ne 0.5);
sgn = missrv(miss(maskgt,0),-1);

 @ Convert p > .5 --> 1-p @

pn=(maskgt-p).*sgn;   clear maskgt;

 @ Computation of function for p < 0.5 @

y=sqrt(abs((-2*ln(pn))));     clear pn;

y=miss(-missrv(-y,missval),-missval);    @ Get rid of +NAN's @

norms=y + ((((y*p4 + p3).*y + p2).*y + p1).*y + p0)./
       ((((y*q4 + q3).*y + q2).*y + q1).*y + q0);  clear y;

return;
 @ Convert results for p > .5 and p = .5 @

norms=(norms.*sgn).*maskeq;
clear sgn, maskeq, p;
return;

endp;
