How to sample data without returning duplicates?

G

Guest

I have installed the analysis toolpak and am using the Data Analysis -
Sampling feature. I have two issues I am trying to resolve:

1) Most important is when I run a sample of my range, the process will
return duplicate values in the sample. For example, if I have values of 1 -
100, and I take a sample of 10 itmes, it may return the number 45 several
times. Is there a way to prevent this, so that every value returned appears
only once in the sample?

2) The data I want to sample is alpha, not numeric. However the Sampling
feature apparently only works with numeric input data. How can I get around
this limitation.

To sum up, I need a sampling method that works on text fields and only
selects an item once for inclusion in the sample.

Thanks to anyone who can help!

Ralph
 
G

Guest

You are probably using the random sample portion, a random sample can not be
take from data while ignoring what has been taken, it wouldn't be random. You
will have to use another method to get what you want.
 
J

joeu2004

1) Most important is when I run a sample of my range, the process will
return duplicate values in the sample.
[....] Is there a way to prevent this, so that every value returned appears
only once in the sample?

2) The data I want to sample is alpha, not numeric. However the Sampling
feature apparently only works with numeric input data. How can I get around
this limitation.

There are at least two common approaches. Arguably, the simplest one
is as follows....

Assume your data is in one column. In each cell in an adjacent
column, put the formula =RAND(). Note: The value of those cells will
change every time you modify the worksheet. Sigh. No matter: the
actual values do not matter, only that they are random.

Now select the range that includes your data and the adjacent column
of random values. Click on Data >> Sort to sort the random column.
This will reorder your data as well. If you select the first "n" of
the data column, it will be random without duplication (assuming all
of your data are unique).
 
J

joeu2004

You are probably using the random sample portion, a random sample
can not be take from data while ignoring what has been taken, it wouldn't
be random.

So by your definition, a Powerball-like lottery does not do random
selection?(!)

Of course that's wrong. You can do random sampling with and without
replacement.

Arguably, without replacement is the most common form of random
sampling. Can you imagine a political survey where the opinions of
one person might be counted more than once? Can you imagine a jury
pool where one person might go through voir dire twice for the same
jury panel?
 
G

Guest

Not to split hairs, as i think i answered the question that it could not be
done with this formula....the definition of random is:
Statistics. of or characterizing a process of selection in which each item
of a set has an equal probability of being chosen.

So by definition, no powerball is NOT random, the machine randomly selects a
number, but each number does not have the same chance of being chosen,
chosing 5 out of 10 numbers excluding duplicates, the first person picked was
picked with 1:10 odds and the 5th 1:5.
 
J

joeu2004

Not to split hairs, as i think i answered the question that it could not
be done with this formula....the definition of random is:
Statistics. of or characterizing a process of selection in which each
item of a set has an equal probability of being chosen.

So by definition, no powerball is NOT random, the machine randomly
selects a number, but each number does not have the same chance
of being chosen, chosing 5 out of 10 numbers excluding duplicates,
the first person picked was picked with 1:10 odds and the 5th 1:5.

Of course it is a random selection. Nowhere in the definition does it
say that the probability is equal for all selections; merely that for
each selection, the probability is equally distributed. That is, for
each selection, the size of the set has changed. So as you pointed
out, the probability is 1 in 49 for any ball in the first selection
and 1 in 48 for any ball in the second selection (assuming there are
49 balls to begin with). Each selection is (uniformly) random.
 
G

Guest

Thanks! That solved both problems at the same time. I know about the rand()
function, but it never occurred to me to use it in this way. Thanks again!

joeu2004 said:
1) Most important is when I run a sample of my range, the process will
return duplicate values in the sample.
[....] Is there a way to prevent this, so that every value returned appears
only once in the sample?

2) The data I want to sample is alpha, not numeric. However the Sampling
feature apparently only works with numeric input data. How can I get around
this limitation.

There are at least two common approaches. Arguably, the simplest one
is as follows....

Assume your data is in one column. In each cell in an adjacent
column, put the formula =RAND(). Note: The value of those cells will
change every time you modify the worksheet. Sigh. No matter: the
actual values do not matter, only that they are random.

Now select the range that includes your data and the adjacent column
of random values. Click on Data >> Sort to sort the random column.
This will reorder your data as well. If you select the first "n" of
the data column, it will be random without duplication (assuming all
of your data are unique).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top