Garbage date data now my headache

  • Thread starter Thread starter Gary F Shelton
  • Start date Start date
G

Gary F Shelton

Okay folks I have a painful .csv file that I recieve daily that is now my
problem. Granted I am only as good as my data so let me show you an example
of the data:

Column A Column B
Date Code Date Code Answer
X71229 12/29/2007
X71203330382 12/03/2007
X71023 G3 10/23/2007
X71003DWOODS 10/03/2007
X70913 D5 09/13/2007
X70913 09/13/2007
20071215 12/15/2007
20071005 10/05/2007
81007 08/10/2007
80211 08/02/2011
071220 12/20/2007

So as you can see are some of the problems with the data:
the string length of the data changes, the format of some cells are text and
some are numbers, some of the data has letter in front and others don't, some
have the full year typed out then month then day and others don't. the data
is what I call garbage but like I stated it is my headache to fix.

So my goal is to try and come up with some methodology to fix it. Any takers
ready for this headache.
 
Excel, as all computer programs, is a beast of logic. I can think of
various techniques to clean up the first column and put those entries into a
more manageable format, but the logic of some of your examples is beyond
logic. For instance, why is 80211 08/02/11 and not 02/11/2008 while 71229
is 12/29/2007 and not 07/12/2029? That particular twist of logic (or lack
of) is not something that a computer can handle. Perhaps you have some
insight that you might be able to share that could be used by a computer to
differentiate those values. Perhaps some data in some other column could
point to a particular logic pattern to use for that entry.
A cleanup of the first column would be like removing leading and trailing
text, as well as removing the G3 and D5. Then, with a logic pattern
furnished by you, Excel could convert all the numbers to dates. Post back
if any of the above could help you. Otto
 
Otto,

Let me give a brief of what I know and then I will lead into some
assumptions of mine.

My data is for consumable products that have a shelf life. My company pays a
vendor to store our product. The vendor is geographically located in Atlanta,
Chicago, New York, Los Angeles, etc. When I get data from this vendor I go to
their website where the data is consolidated. So what I assume is happening
and some of the patters I am seeing is that the date data formats change when
the their is a change in Warehouses which leads me to believe that each
geographically located Warehouse inputs and uploads their data in different
formats. As you pointed out some of the data is just down right garbage data
like the 080211 stuff as that had to be a human error in someone on their end
mistypig it... But you are right that I am going to have to use some other
records in my data set to try and make logical patterns out of the date code
data. I think the big thing I will need to do is get the data parsed out to
the most logical pattern I can and then manually alter the outliers... I
guess that I can do is work the.csv file before I try and import it into
excel ... The other thing that is frustrating is that this vendor must make
lots of changes to their website titles as last month they had certain title
names and they had them in certain columns and now in December they made
changes... So as you can imagine it makes it even harder for me try and
standardize a process on my end... I will be working on this file some more
and if I do get stuck I am sure I will ask some more questions...
 
Okay folks I have a painful .csv file that I recieve daily that is now my
problem. Granted I am only as good as my data so let me show you an example
of the data:

Column A Column B
Date Code Date Code Answer
X71229 12/29/2007
X71203330382 12/03/2007
X71023 G3 10/23/2007
X71003DWOODS 10/03/2007
X70913 D5 09/13/2007
X70913 09/13/2007
20071215 12/15/2007
20071005 10/05/2007
81007 08/10/2007
80211 08/02/2011
071220 12/20/2007

So as you can see are some of the problems with the data:
the string length of the data changes, the format of some cells are text and
some are numbers, some of the data has letter in front and others don't, some
have the full year typed out then month then day and others don't. the data
is what I call garbage but like I stated it is my headache to fix.

So my goal is to try and come up with some methodology to fix it. Any takers
ready for this headache.


The formats that you have are translatable and consistent except for the next
to last two that start with an eight.

All of the others are basically in a YMD format that is either
yymmdd or
ymmdd or
yyyymmdd
(with extraneous information possible before and after)

The two dates that start with an eight are both in mddyy format.

How do you know that the Date Code Answer for those two entries is what it is?

If we could apply the same logic, we might have a chance. Maybe there is
something in another field?
--ron
 
Okay folks I have a painful .csv file that I recieve daily that is now my
problem. Granted I am only as good as my data so let me show you an example
of the data:

Column A Column B
Date Code Date Code Answer
X71229 12/29/2007
X71203330382 12/03/2007
X71023 G3 10/23/2007
X71003DWOODS 10/03/2007
X70913 D5 09/13/2007
X70913 09/13/2007
20071215 12/15/2007
20071005 10/05/2007
81007 08/10/2007
80211 08/02/2011
071220 12/20/2007

So as you can see are some of the problems with the data:
the string length of the data changes, the format of some cells are text and
some are numbers, some of the data has letter in front and others don't, some
have the full year typed out then month then day and others don't. the data
is what I call garbage but like I stated it is my headache to fix.

So my goal is to try and come up with some methodology to fix it. Any takers
ready for this headache.

As I think about this more, it occurs to me that you will also need to know the
purpose of the "X". If it is a placeholder for a "0", that would be good.

However, if it is some other kind of indicator, that might be followed by
either a five or six digit date code, then other entries become ambiguous.

For example, an entry similar to your #2:

X710101330382 could be
2007-10-10 or
1971-01-01



--ron
 
Back
Top