Here's detail on what I'm looking for:
1. Total number of variables:
Each data stream needs to be separately labeled as to its validity based on
how it compares to all other data. Each data stream to be tested for
validity is, in effect, a series of rectangles marked on a larger fixed
square area. The rectangles would be approximately 1/20th to 1/5th the size
of the larger square. The data stream would include the four sets of
coordinates for the four corners of each rectangle. Some of the rectangles
might overlap within the same data stream. For each rectangle, other than
their 4 pairs of corner coordinates, there are 5 variables, let’s call them
A, B, C, D and T. Variables A, B and C are fixed for each data stream (but
likely different for different data streams) and remain the same across the
entire square area. D is “true” within the marked rectangles and “false”
outside those rectangles. T is the date and time the data stream arrived.
I include variable T (time) because over time, the “bell curve” of the value
of D may change for a particular A, B and C for the same area of the plane.
This change will be gradual (over a period of say 6 months to a year or
more, for example). If such a change occurs, then data that was “valid” a
year ago will be “invalid” closer to the present.
I’m looking for three correlations –
a) For a given A, B and C, how well does a new D correlate (for the entire
100-meter square area) with the all other D data for the same A, B and C and
for the same 100-meter square area?
i.) For example, if data stream X shows some instances where D is “true”
for areas where all other data streams show D as “false” – then data stream
X should be given a very low (if not zero) value of validity or correlation
with the other data.
correlate in a particular 100-meter square area with the all other D data
streams for the same A and B and in the same 100-meter square area?
c) For a given A, regardless of B and C, the same thing -- how well does D
correlate in a particular 100-meter square area with the all other D data
streams for the same A and B and in the same 100-meter square area?
2. Approx # of data points and how they will be increasing:
There will be several large square areas within which data streams will be
recorded. All areas are the same size – say – 100 square meters. Over a
period of a year or so, some square areas will receive no data. Others
could receive up to 1,000,000 data streams.
For each square kilometer, I suppose I'm looking for a bell curve (as it
develops) and then want to be able to run a program periodically on the
data, which will assign a validity factor to each record (for a particular
100-meter square) as to how well that record correlates with all other
records for that same 100-meter square. I could see different levels of
correlation for different records within the same data stream.
I'm looking for the set of formulas which will give me a reliable figure for
the validity of the data.
bipulsin@gmail.com