random split dataset

  • Thread starter Thread starter news.leidenuniv.nl
  • Start date Start date
N

news.leidenuniv.nl

For modelling purposes I have a list with records of plant species. I need
of each species randomly 50% to run in my modelling software and the other
50% I use for testing. Any suggestions on how to split this dataset?
Regards,

Niels
 
Add two helper columns (Say A and B).

Put 1 in A1 and 2 in A2.
select A1:A2 and drag down.
(you can use this to resort to the original order).

Put:
=rand()
in B1
and drag down.

Sort your data by column B and take the top 50%.

Sort by column A to return to the original order.

Delete/hide columns A:B when you're done.
 
This works if all records are of 1 species, but my list contains 50 records
of species 1, 25 of species 2 etc. What I need is randomly 50% of the
records of each species. And, I need the remainder (other 50%) of each
species for testing my model.
 
Do you have a column that indicates the species?

If yes, then add a third column. I used C as that next helper column and I used
column D for the column with the species indicator.

I put this in C2 (headers in row 1).

=COUNTIF(D$2:$D2,D2)/COUNTIF(D:D,D2)
and dragged down.

Then I could apply data|filter|autofilter to show just the values less than or
equal to 50%.

I selected that range.
edit|goto|special|visible cells only
edit|copy
and edit|Paste into a new worksheet.
 
Back
Top