Parse and merge data

G

Guest

Hi all,

Need some tips to help me approach this problem. I would consider myself
advanced in Access and VBA, so just looking for conceptual pointers.

In essence, I have a number of text files, that needs to be parsed and
merged. Each text file will contain around 70,000 records. For now, lets
assume that I have only 2 text files that will need merging.

I have managed to succesfully parse the files and sort each line in the text
file into a database record that will contain 3 fields; a time stamp field
(converted to a long number as time is measured in thousandth's of seconds),
a Datasource field (i.e. who actually generated the text file), and a string
field (containing a text string of up to 255 characters).

Basically, what I would like to do is merge the data from the two files, in
such a way that I am only left with one unique set of records; i.e. any
duplicates in the second datasource that are identical to the first
datasource will be deleted, and any records that are in the second datasource
but not in the first will be added. I will therefore end up with a "superset"
of data.

The primary difficulty in the above is the time stamp field in each
datasource is dependant on the user's own PC clock. Therefore, there may be
time differences between records that would otherwise be identical. These
time differences are never more than 30 seconds. Moreover, the time
difference is not exactly the same for each matched record. A SELECT DISTINCT
query would have done the job for me, other than the fact that the time stamp
field prevents this from happening. So I though of finding out the difference
in times between the two recordsources, and as it turns out the difference
may vary every so slightly through the datasources, so some records may be
out by 0.15 of a second and others by 0.14 and others by 0.13, for example.

So, what I have done up until now is to create a loop through the recordset,
but as expected, this takes a longgggg time (note that records are not yet
deleted/added; they are just marked as being matched in the following code).

I would love to hear if there are better ways of approaching this problem.
Many thanks for any help that you can provide.

Dim CurrTime As Long

Dim varArray

CurrTime = Time()
'Recordset to retrieve the combat log
Set cnn = CurrentProject.Connection
Set rs1 = New ADODB.Recordset

strSql1 = "Select ParseID, lngTime, strCharName, strMsg, lngParseID from
tblParsedData ORDER BY lngTime ASC"

rs1.Open strSql1, cnn, adOpenKeyset, adLockOptimistic


varArray = rs1.GetRows
lngRecordCount1 = rs1.RecordCount

lngRecordCount1 = rs1.RecordCount - 1

strDataSource1 = "Leandra"
strDataSource2 = "Spuge"

For i = 0 To lngRecordCount1


If Not IsNull(varArray(4, i)) Then 'Looks to see if we have matched
it yet
'Do nothing as its aleady been matched, so move onto next i
Else
If varArray(2, i) = strDataSource1 Then
strMsg = varArray(3, i)
lngParseID = varArray(0, i)
lngTime1 = varArray(1, i)
For j = lngCurrRecord To lngRecordCount1
If (varArray(1, j) - lngTime1) > 30000 Then '>30
seconds, stop looking for a match
Exit For
Else
If IsNull(varArray(4, j)) Then 'Looks to see if we
have matched it yet
If varArray(2, j) = strDataSource2 Then
If varArray(3, j) = strMsg Then
varArray(4, j) = lngParseID
varArray(4, i) = varArray(0, j)
lngCurrRecord = j
Exit For
End If
End If
End If
End If
Next
End If

End If


Next i

Debug.Print Time() - CurrTime
 
J

John Spencer

I might try truncating the "time" field to the nearest second (or other
appropriate interval). Int(LngTime/1000) as NewTime

Then you could either import ALL the records and use distinct in the query

SELECT Distinct Int(LngTime/1000) as CalcLongTime, ...


Or do the truncation during the Append



'====================================================
John Spencer
Access MVP 2002-2005, 2007
Center for Health Program Development and Management
University of Maryland Baltimore County
'====================================================
 
G

Guest

Hi John,

Thanks very much for replying.

Obviously we are thinking along the same lines, because I have been playing
around with the *exact* methodology before I read your post. Weird.

It does give me very good results, but the issue remains that its still not
100% accurate because the time differences are not always the same. So,
whilst most records are correctly removed by the SELECT DISCTINCT, some
remain because they are out by a fraction of a second. So, for matching
records, it will be out by 1 second.

Not quite sure how to get around this....and I am not sure if its actually
possible.

Thanks again.
 
G

Guest

OK, so what I have done is looped through the first 100 or so records, to
obtain an average time differential, and then applied that as a time offset.

It seems pretty accurate.

Thanks again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top