Regex question

T

tshad

This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a valid
date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

Thanks,

Tom
 
T

tshad

What is really strange is that 05/07/08(-4%) is converting to:

05/07/084

Convert.ToDateTime is change this to: 05/07/0084 12:00:00 AM????

But Sql says it is invalid and gives me an error:

Message: SqlDateTime overflow. Must be between 1/1/1753 12:00:00 AM and
12/31/9999 11:59:59 PM.

It turns out all of these examples die in SQL but pass in VS.

The code I use is:

public static DateTime? CleanDates(string valueIn)
{
DateTime? datValue = null;
string strValue;

strValue = Regex.Replace(valueIn, @"[^\d/]", "");
try
{
datValue = Convert.ToDateTime(strValue);
}
catch
{
datValue = null;
}
return datValue;
}

I would have assumed it would take the Catch, but doesn't.

Why is this?

Thanks,

Tom
 
A

Arne Vajhøj

tshad said:
What is really strange is that 05/07/08(-4%) is converting to:

05/07/084

Convert.ToDateTime is change this to: 05/07/0084 12:00:00 AM????

But Sql says it is invalid and gives me an error:

Message: SqlDateTime overflow. Must be between 1/1/1753 12:00:00 AM and
12/31/9999 11:59:59 PM.

It turns out all of these examples die in SQL but pass in VS.

SQLServer DATETIME apparently has a different range than
..NET System.DateTime - that is not particular surprising.

Arne
 
A

Arne Vajhøj

tshad said:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a valid
date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

For inspiration see below.

Arne

=======================

using System;
using System.Globalization;
using System.Text.RegularExpressions;

namespace E
{
public class Program
{
private static readonly Regex re = new
Regex(@"\d{2}/\d{2}/\d{2}", RegexOptions.Compiled);
public static DateTime Parse(string s)
{
return DateTime.Parse(re.Match(s).Value, new
CultureInfo("en-US"));
}
public static void Main(string[] args)
{
Console.WriteLine(Parse("05/07/08(-4%)"));
Console.WriteLine(Parse("09/19/08 DOM 55"));
Console.WriteLine(Parse("09/19/2008 DOM 53"));
Console.WriteLine(Parse("FOR 09/15/08 -23"));
Console.ReadKey();
}
}
}
 
T

tshad

Peter Duniho said:
Why is that strange? Your regex basically removes anything that's not a
digit or a '/' character.


Well, what did you expect it to do with the year "084"?

I know that. I didn't know the range would be that much different from SQL
but it obviously is.
I don't know much about SQL, but it looks like the range of valid dates in
SQL does not include the year 84. Not that I see any reason you'd _want_
it to accept the year 84, given the original string you started with.
I don't want it to accept this date. The problem is that my regex only gets
rid of everything that is not a number and a "/". This works for about 99%
of my dates (out of about 200,000 at the moment).

But I would like to figure out a better regex option.
What does that mean? It's clear why the result of your conversion isn't
accepted by SQL. But what does it mean for something to "pass in VS"?

I mean that it considers it a valid date - which it is but not really THE
valid date.
The code I use is:

public static DateTime? CleanDates(string valueIn)
{
DateTime? datValue = null;
string strValue;

strValue = Regex.Replace(valueIn, @"[^\d/]", "");
try
{
datValue = Convert.ToDateTime(strValue);
}
catch
{
datValue = null;
}
return datValue;
}

I would have assumed it would take the Catch, but doesn't.

Why would you have assumed that the execution would wind up in the "catch"
clause? You had a successful conversion from the string to a DateTime
value.

I agree and that is my problem. It passes (valid) in VS and then dies in
SQL. But SQL is only dropping it because it is outside the date range.
That doesn't necessarily making it correct. Being a valid date doesn't make
it the correct date. Which is my problem.
As far as the original question goes, you're not being specific enough
about what it is you really want to do. It seems to me that if you want
to actually extract dates of the form "mm/dd/yy" or "mm/dd/yyyy", you
should write a regex pattern that matches _that_, rather than trying to
remove all the characters from the string that aren't digits or '/'
characters.

I agree. Which was what I was asking. I just wanted to see if anyone knew
of one off hand.
From your examples, it appears that you always have two-digit months and
dates, but may have two- or four-digit years. So, perhaps rather than
using the Replace() method, you could use the Match() method, with a
string like this: @"\d\d/\d\d/\d\d(\d\d)?" Keeping in mind that I'm no
regex expert, and might have something not quite right in that one. But
hopefully you get the idea; that is, that there is in fact a specific
pattern that you're looking for, so matching against that pattern is
likely to be more fruitful than trying to exclude specific characters.

I agree. I am actually looking for 1 or 2 characters in the days and months
and 2 or 4 in the last. Exactly that pattern. Anything else I want to get
rid of.

Thanks,

Tom
 
T

tshad

Arne Vajhøj said:
SQLServer DATETIME apparently has a different range than
.NET System.DateTime - that is not particular surprising.

I agree. I don't know what the range is, but that wouldn't help anyway. It
can be a valid date and not be THE valid date - which this showed. I wasn't
expecting this - but should have.

Thanks,

Tom
 
T

tshad

Arne Vajhøj said:
tshad said:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

For inspiration see below.

That looks good. Would that handle 1 and 2 digits for days and months and 2
and 4 for years?

Thanks,

Tom
Arne

=======================

using System;
using System.Globalization;
using System.Text.RegularExpressions;

namespace E
{
public class Program
{
private static readonly Regex re = new Regex(@"\d{2}/\d{2}/\d{2}",
RegexOptions.Compiled);
public static DateTime Parse(string s)
{
return DateTime.Parse(re.Match(s).Value, new
CultureInfo("en-US"));
}
public static void Main(string[] args)
{
Console.WriteLine(Parse("05/07/08(-4%)"));
Console.WriteLine(Parse("09/19/08 DOM 55"));
Console.WriteLine(Parse("09/19/2008 DOM 53"));
Console.WriteLine(Parse("FOR 09/15/08 -23"));
Console.ReadKey();
}
}
}
 
A

Arne Vajhøj

tshad said:
Arne Vajhøj said:
tshad said:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23
For inspiration see below.

That looks good. Would that handle 1 and 2 digits for days and months and 2
and 4 for years?
private static readonly Regex re = new Regex(@"\d{2}/\d{2}/\d{2}", RegexOptions.Compiled);

@"\d{1,2}/\d{1,2}/\d{2,4}"

is more flexible.

Arne
 
A

Arne Vajhøj

tshad said:
I agree. I don't know what the range is, but that wouldn't help anyway. It
can be a valid date and not be THE valid date - which this showed. I wasn't
expecting this - but should have.

http://msdn.microsoft.com/en-us/library/system.datetime.aspx

The DateTime value type represents dates and times with values ranging
from 12:00:00 midnight, January 1, 0001 Anno Domini (Common Era) through
11:59:59 P.M., December 31, 9999 A.D. (C.E.)

Time values are measured in 100-nanosecond units called ticks, and a
particular date is the number of ticks since 12:00 midnight, January 1,
0001 A.D. (C.E.) in the GregorianCalendar calendar.

http://msdn.microsoft.com/en-us/library/ms187819.aspx

Date range January 1, 1753, through December 31, 9999

Time range 00:00:00 through 23:59:59.997

Accuracy Rounded to increments of .000, .003, or .007 seconds

Arne
 
J

Jeff Johnson

This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

Stop using Replace to get rid of the stuff you don't want, because clearly
it's causing problems. Instead, examine all the possible inputs you might
get and then craft a regex to EXTRACT those parts. Then TEST what you've
extracted to see if it's a date. After all, 99/76/23 might fit the regex,
but it isn't a valid date.

If you are positive that the separator will always be a slash and that
you'll only have digits (not 10/Dec/2008), you might get away with this:

Regex dateRegex = New Regex(@"\d{1,2}/\d{1,2}/\d{2,4}");

Then you'll use the Match() method (or Matches) and see if you get anything,
and then Date.TryParse[Exact]() to see if it's a real date.
 
A

Arne Vajhøj

Jeff said:
I was trying to figure out why, and I learned that Britain didn't adopt the
Gregorian calendar until 1752, which I thought might be significant, but
that's still a year earlier than 1753....

Britain did not switch at new year - they jumped from September 2nd
to September 14th in 1752.

And if MS wanted to start with January 1st, then ...

Arne
 
T

tshad

Arne Vajhøj said:
tshad said:
Arne Vajhøj said:
tshad wrote:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23
For inspiration see below.

That looks good. Would that handle 1 and 2 digits for days and months
and 2 and 4 for years?
private static readonly Regex re = new
Regex(@"\d{2}/\d{2}/\d{2}", RegexOptions.Compiled);

@"\d{1,2}/\d{1,2}/\d{2,4}"

is more flexible.

Great. That really helps.

Thanks,

Tom
 
T

tshad

Jeff Johnson said:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

Stop using Replace to get rid of the stuff you don't want, because clearly
it's causing problems. Instead, examine all the possible inputs you might
get and then craft a regex to EXTRACT those parts. Then TEST what you've
extracted to see if it's a date. After all, 99/76/23 might fit the regex,
but it isn't a valid date.

And how would you suggest I do that??? These are just examples of some of
the inputs I am getting. I can't really get rid of any parts as I don't
know what will be where.

I have no control over what the user will enter in this case.

I need to be able to be able to find a date in the input. Using a variety
of possible (probable) date formats, I should be able to extract the date
from the input - if one exists.
If you are positive that the separator will always be a slash and that
you'll only have digits (not 10/Dec/2008), you might get away with this:
That was what I was looking for.
Regex dateRegex = New Regex(@"\d{1,2}/\d{1,2}/\d{2,4}");

Then you'll use the Match() method (or Matches) and see if you get
anything, and then Date.TryParse[Exact]() to see if it's a real date.
What I was planning to do - just wasn't sure of the regex.

Thanks,

Tom
 
J

Josh Einstein

What Jeff is saying is that your approach is backwards. Take a step back and
think about what you're trying to do. If you want to pull a valid date from
a string, then you have to write a pattern that defines the lexical
structure of the date you're trying to extract.

First, here's the easy button: www.regexlib.com

Second, regular expressions are not by any means easy. What you're trying to
do is simple by Regex standards, but still requires a lot more specificity
than "a digit or a slash". For example, in Regex you can specify quantifiers
and alternation constructs.

\d{1,2}(-|/)\d{1,2}(-|/)(\d{2}|\d{4})

Will match a 1 or 2 digit month, a 1 or 2 digit day, and a 2 or 4 digit
year - each separated by a - or /.

However, that doesn't guarantee you a valid date. For example, it matches
99/99/9999. Not to mention the fact that it's only valid for US dates with a
pretty specific format.

So another poster had recommended a second pass using DateTime.TryParse.
Regex will help you get most of the way there but trying to construct a
pattern that will ensure a valid date within the range allowed by T-SQL
would be a really lousy use of Regex in the first place.

Josh

tshad said:
Jeff Johnson said:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

Stop using Replace to get rid of the stuff you don't want, because
clearly it's causing problems. Instead, examine all the possible inputs
you might get and then craft a regex to EXTRACT those parts. Then TEST
what you've extracted to see if it's a date. After all, 99/76/23 might
fit the regex, but it isn't a valid date.

And how would you suggest I do that??? These are just examples of some of
the inputs I am getting. I can't really get rid of any parts as I don't
know what will be where.

I have no control over what the user will enter in this case.

I need to be able to be able to find a date in the input. Using a variety
of possible (probable) date formats, I should be able to extract the date
from the input - if one exists.
If you are positive that the separator will always be a slash and that
you'll only have digits (not 10/Dec/2008), you might get away with this:
That was what I was looking for.
Regex dateRegex = New Regex(@"\d{1,2}/\d{1,2}/\d{2,4}");

Then you'll use the Match() method (or Matches) and see if you get
anything, and then Date.TryParse[Exact]() to see if it's a real date.
What I was planning to do - just wasn't sure of the regex.

Thanks,

Tom
 
T

tshad

Josh Einstein said:
What Jeff is saying is that your approach is backwards. Take a step back
and think about what you're trying to do. If you want to pull a valid date
from a string, then you have to write a pattern that defines the lexical
structure of the date you're trying to extract.

First, here's the easy button: www.regexlib.com

Second, regular expressions are not by any means easy. What you're trying
to do is simple by Regex standards, but still requires a lot more
specificity than "a digit or a slash". For example, in Regex you can
specify quantifiers and alternation constructs.

\d{1,2}(-|/)\d{1,2}(-|/)(\d{2}|\d{4})

Will match a 1 or 2 digit month, a 1 or 2 digit day, and a 2 or 4 digit
year - each separated by a - or /.

However, that doesn't guarantee you a valid date. For example, it matches
99/99/9999. Not to mention the fact that it's only valid for US dates with
a pretty specific format.

That is fine. I only need to be able to extract something that looks like a
date. The information is being added into a form that is completely
freeform (have no control over that) and the user can enter whatever he
wants. The user knows it is a date field so most of the time it will have a
valid date in the text somewhere. We then get the information via XML that
we put into SQL. We then have a table with cleaned values, including these
date values that we need for reporting purposes. If it isn't a valid date -
it won't get reported on.
So another poster had recommended a second pass using DateTime.TryParse.

I like this as well and am using both now which gets me what I need. There
will be some problem records but they are a very small amount and won't
affect anything.
Regex will help you get most of the way there but trying to construct a
pattern that will ensure a valid date within the range allowed by T-SQL
would be a really lousy use of Regex in the first place.
I know know why it is a lousy use of it. Seems like a better way to find a
date pattern in my text field then trying to parse the field with for/loops
to find a date.

Thanks,

Tom
Josh

tshad said:
Jeff Johnson said:
This is really a regex question.

I am wonding if anyone knows a good Regex expression that would pull a
valid date from a string.

I have used:

strValue = Regex.Replace(valueIn, @"[^\d/]", "");

which works most of the time.

But I have some cases where I have strings like:

05/07/08(-4%)
09/19/08 DOM 55
09/19/2008 DOM 53
FOR 09/15/08 -23

Stop using Replace to get rid of the stuff you don't want, because
clearly it's causing problems. Instead, examine all the possible inputs
you might get and then craft a regex to EXTRACT those parts. Then TEST
what you've extracted to see if it's a date. After all, 99/76/23 might
fit the regex, but it isn't a valid date.

And how would you suggest I do that??? These are just examples of some
of the inputs I am getting. I can't really get rid of any parts as I
don't know what will be where.

I have no control over what the user will enter in this case.

I need to be able to be able to find a date in the input. Using a
variety of possible (probable) date formats, I should be able to extract
the date from the input - if one exists.
If you are positive that the separator will always be a slash and that
you'll only have digits (not 10/Dec/2008), you might get away with this:
That was what I was looking for.
Regex dateRegex = New Regex(@"\d{1,2}/\d{1,2}/\d{2,4}");

Then you'll use the Match() method (or Matches) and see if you get
anything, and then Date.TryParse[Exact]() to see if it's a real date.
What I was planning to do - just wasn't sure of the regex.

Thanks,

Tom
 
J

Josh Einstein

I didn't say that finding date patterns was a lousy use of Regex. I said
that trying to write such a complex Regex that it ensured it only matched
valid dates (ie. did not match 99/99/9999) would be a lousy use. A blend of
the two approaches is best - use Regex to get something that "looks" like a
date and use DateTime to ensure it is valid.

You see this type of thing all the time when you see people write IP address
or Email Address regex's. They go crazy trying to make it RFC compliant or
ensure that it won't match anything invalid. But the regex quickly becomes
unreadable and actually much slower than parsing.

So in summary - a simple regex gets you most of the way there and a parse
confirms the validity. Sounds like you've got it working the way you want.

Good luck! Merry Christmas and Happy New Year!

Josh Einstein
 
T

tshad

Josh Einstein said:
I didn't say that finding date patterns was a lousy use of Regex. I said
that trying to write such a complex Regex that it ensured it only matched
valid dates (ie. did not match 99/99/9999) would be a lousy use. A blend of
the two approaches is best - use Regex to get something that "looks" like a
date and use DateTime to ensure it is valid.
Sounds reasonable.
You see this type of thing all the time when you see people write IP
address or Email Address regex's. They go crazy trying to make it RFC
compliant or ensure that it won't match anything invalid. But the regex
quickly becomes unreadable and actually much slower than parsing.
True.

You can overuse anything. Just because it is good for one thing doesn't
mean it is good for everything.
So in summary - a simple regex gets you most of the way there and a parse
confirms the validity. Sounds like you've got it working the way you want.

Good luck! Merry Christmas and Happy New Year!

Same to you,

Thanks,

Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top