| Title: | Mathematics at DEC | 
| Moderator: | RUSURE::EDP | 
| Created: | Mon Feb 03 1986 | 
| Last Modified: | Fri Jun 06 1997 | 
| Last Successful Update: | Fri Jun 06 1997 | 
| Number of topics: | 2083 | 
| Total number of notes: | 14613 | 
         Hello,
         I was wondering if any of the readers of this conference
         would have some ideas on the types of functions best suited
         for a curve fit to a NON-DECREASING INTEGRAL data set. The
         problem that I'm running into is that the data has lots of
         flat spots which makes things more difficult. All comments,
         suggestions, or references are welcome.
         Restrictions:
         1. There MUST be a way to reconstruct the original data points 
            (integers) from the function.
            i.e. data(i) = round( f(i) ) 
            The behavior of the function between the data points does 
            not matter.  
         2. The function must be somewhat easy to express, i.e. A one
            one hundred term polynomial is NOT desirable.  I'd prefer
            a function that has < 10 coefficients, exponents, and
            constants.  That is, the number of coefficients + the number
            of exponents + a constant is less than 10 or so.
         Here's an example data set and a plot of it.
          i      :  0  1  2  3  4  5  6  7  8  9
          DATA(i):  0  1  1  4  5  5  5  6  7  8
                         9
                         8                   *
                         7                 *
                         6               *
                 DATA(i) 5         * * *
                         4       * 
                         3
                         2
                         1   * *       
                         0 *       
                           0 1 2 3 4 5 6 7 8 9 
                                    i
         Find a function that fits this data such that:
             1. data(i) = round(f(i))
             2. f(i) is not too messy (I know this is vague!).
         Note that in reality the data set will be much larger,
         perhaps n > 500.
         Thanks in advance for your input.
         Sincerely,
         John Argentati
| T.R | Title | User | Personal Name | Date | Lines | 
|---|---|---|---|---|---|
| 977.1 | Let's solve something else - this one's too messy | AKQJ10::YARBROUGH | I prefer Pi | Wed Nov 16 1988 09:32 | 14 | 
| Your only hope here appears to be in segmenting the function so that each segment can be fitted with a low-order polynomial. The 'flat spots', or regions with constant derivative, suggest that an overall polynomial fit will have wild ups and downs between points. In other words, if you fit a continuous differentiable function Y through 500 points at random, there probably exists an x-value for which Y(x) is substantially larger in absolute value than any of the input data points. However, we are probably already looking too closely at the details of (one possible) implementation before really knowing what *problem* you are trying to solve. If you were to describe that, we might be able to see an approach that completely avoids the issue you raised. Lynn Yarbrough | |||||
| 977.2 | HPSTEK::XIA | Wed Nov 16 1988 11:52 | 8 | ||
|     re. 0
    
    SPLINE should be a good way of doing it.  As .1 said, if you use
    high order polynomials, there is a danger of large fluctuation between
    data points.  On the other hand, if you are sure intermediate points
    do not matter, then....  Check any numerical analysis book on how
    to do SPLINE.
    Eugene
 | |||||
| 977.3 | DWOVAX::YOUNG | Note early. Note Often. | Thu Nov 17 1988 00:46 | 7 | |
|     Re .2:
    
    Splines are too messy.  An exact fit spline for a 500 data point
    set would probably be even more complex, in number of terms and
    segments, than a polynomial.
    
    --  Barry
 | |||||
| 977.4 | HPSTEK::XIA | Thu Nov 17 1988 01:31 | 10 | ||
|     re .2 .3
    Well, if as .1 said that 500 points are already fine enough,
    why not just link these points with straight lines.  That will be
    easy enough.  On the other hand 500 is not unmanagable by SPLINE.
    What it amounts to is solving a tri-diagnal matrix (maybe even simpler
    than that.  I have not touch the stuff for two years.  So....) of 
    500X500 which shouldn't be too bad if you have an efficient matrix 
    package.  
    
    Eugene
 | |||||
| 977.5 | HPSTEK::XIA | Thu Nov 17 1988 01:34 | 5 | ||
|     re .2
    An important point I want to emphasize is that a polynomial
    of degree 499 is unsafe.
      
    Eugene
 | |||||
| 977.6 | AITG::DERAMO | Daniel V. {AITG,ZFC}:: D'Eramo | Thu Nov 17 1988 18:43 | 120 | |
|      I remember a calendar day-of-the-week algorithm that came up
     with a good representation for the running total of the
     number of days in a month, modulo seven.  This was clever,
     but having only twelve data points must have helped a lot.
     
     The calendar stuff follows.  It is interesting but won't
     help for five hundred data points.
     
     Dan
     
     I saw something like this used to encode an almost linear
     function used in a day-of-the-week algorithm.  Our calendar
     has a period of 400 years.  Leap year occurs in years
     divisible by 4, unless also divisible by 100 when there is
     no leap year, unless also divisible by 400 when there is.
     So every 400 years, the leap year cycle repeats.  It so
     happens that these 400 years have a number of days which is
     an exact multiple of 7.  So after 400 years the next leap
     year cycle starts on the same day of the week as the first
     previous cycle.  This led someone to develop a day of the
     week formula.
     So number the days of the modulo 7, with Sunday = 0 up to
     Saturday = 6.  Let y, m, and d be the year, month, and day
     numbers.  So for today, Thursday November 17, 1988, we have
     y = 1988, m = 11, d = 17, and the day of the week is 4.
     It is easiest to handle leap day as the last day of the
     year, so we slightly massage the y, m, d inputs by treating
     March as the first month, and February as the twelfth month
     of the previous year.  So the first step in the formula is
     to make this transformation from y, m, d to Y, M, D:
              { y - 1          if m = 1 or m = 2
          Y = {
              { y              if 3 <= m <= 12
              { M + 10         if m = 1 or m = 2
          M = {
              { M - 2          if 3 <= m <= 12
          D = d
     Now the day of the week number will be the number of days
     past since some reference date added to the day of the week
     number of that reference date, modulo 7.  The contribution
     of Y to the sum will be 365 = 1 mod 7 days per year, plus 1
     for every leap year, for a total of
          Y + [Y/4] - [Y/100] + [Y/400]
     Here the notation [x] means the greatest integer <= x.
     The contribution for the day will just be D.  Now here is
     where the "almost linear" function stuff starts, in
     computing the contribution for the adjusted month number M.
     For M=1 (March) the contribution will be 0; for M=2 the
     contribution will be the number of days in March, modulo 7;
     for M=3 the contribution will be the number of days in March
     and April, modulo 7; etc.  Now write this out, for each next
     value of M adding the number of days of the previous month
     modulo 7, but not reducing 7's from the sum:
           M      days     days modulo 7     sum
          --      ----     -------------     ---
           1         0                 0       0
           2        31                 3       3
           3        30                 2       5
           4        31                 3       8
           5        30                 2      10
           6        31                 3      13
           7        31                 3      16
           8        30                 2      18
           9        31                 3      21
          10        30                 2      23
          11        31                 3      26
          12        31                 3      29
     or
       M    1  2  3  4  5  6  7  8  9 10 11 12
     f(M)   0  3  5  8 10 13 16 18 21 23 26 29
     The problem was to express this function in an easy to
     compute way.  The solution, and I don't know who to credit
     this to, is to use:
          f(M) = [2.6 M - 2.1] = [2.6M - 2.2]
     
     I had not memorized this, I just worked it out again, it
     isn't too hard for this particular example.  By the way,
     in most programming languages, use truncating integer
     arithmetic and let f(M) = (26 * M - 21)/10 or (13 * M - 11)/5
     So the overall formula becomes
          Y + [Y/4] - [Y/100] + [Y/400] + [2.6M - 2.1] + D + k
     where k is the constant that gives the correct value for the
     initial day.  For today, with expected result 4, we get
          y, m, d = 1988, 11, 17     Y, M, D = 1988, 9, 17
     4 = 1988 + [1988/4] - [1988/100] + [1988/400] + [2.6 * 9 - 2.2] + 17 + k
     (all that modulo 7, of course) or k = 2.
     The [... - 2.1] and + 2 can be combined to give
     >>        f(Y, M, D) = Y + [Y/4] - [Y/100] + [Y/400] + [2.6M - 0.1] + D
     You can precompute the result of the Y terms and remember
     them for the next year or two.  So, for example,
     
          f(1988, M, D) = [2.6M - 0.1] + D + 6 modulo 7
          f(1989, M, D) = [2.6M - 0.1] + D     modulo 7
     Of course, don't forget the transformation from y,m -> Y,M.
     Dan
 | |||||