M
Mike
List,
I call this a "Parsing Problem", but it could be called formatting or
regular expressions as well. I have a set of data that was formerly
processed on an OS390 (hence a lot of horsepower). Now, it has been
moved to a database from where I can call it via a web service with a
C# client. The data looks like this:
ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH
ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)
I need to format 24000 lines of such data into a tree that looks like
what is below. The key is that whatever substring is repeated across
all the records becomes a heading. For example, ABLATION is common to
all the rows and so is the heading for all of them. Heart is common to
two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.
ABLATION
ENDOMETRIAL (HYSTEROSCOPIC)
HEART (CONDUCTION DEFECT)
WITH CATHETER
INNER EAR (CRYOSURGERY) (ULTRASOUND)
BY INJECTION
LESION HEART
BY PERIPHERALLY INSERTED CATHETER
ENDOVASCULAR APPROACH
MAZE PROCEDURE (COX-MAZE)
ENDOVASCULAR APPROACH
OPEN (TRANS-THORACIC) APPROACH
TRANS-THORACIC APPROACH
PITUITARY
BY
COBALT-60
IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
PROTON BEAM (BRAGG PEAK)
PROSTATE (ANAT = 59.02)
BY
LASER, TRANSURETHRAL
RADIOFREQUENCY THERMOTHERAPY
TRANSURETHRAL NEEDLE ABLATION (TUNA)
PERINEAL BY
CRYOABLATION
RADICAL CRYOSURGICAL ABLATION (RCSA)
TRANSURETHRAL
BY LASER
CRYOABLATION
RADICAL CRYOSURGICAL ABLATION (RCSA)
TISSUE HEART - SEE ABLATION, LESION, HEART
VESICLE NECK (ANAT = 60.02)
I call this a "Parsing Problem", but it could be called formatting or
regular expressions as well. I have a set of data that was formerly
processed on an OS390 (hence a lot of horsepower). Now, it has been
moved to a database from where I can call it via a web service with a
C# client. The data looks like this:
ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH
ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)
I need to format 24000 lines of such data into a tree that looks like
what is below. The key is that whatever substring is repeated across
all the records becomes a heading. For example, ABLATION is common to
all the rows and so is the heading for all of them. Heart is common to
two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.
ABLATION
ENDOMETRIAL (HYSTEROSCOPIC)
HEART (CONDUCTION DEFECT)
WITH CATHETER
INNER EAR (CRYOSURGERY) (ULTRASOUND)
BY INJECTION
LESION HEART
BY PERIPHERALLY INSERTED CATHETER
ENDOVASCULAR APPROACH
MAZE PROCEDURE (COX-MAZE)
ENDOVASCULAR APPROACH
OPEN (TRANS-THORACIC) APPROACH
TRANS-THORACIC APPROACH
PITUITARY
BY
COBALT-60
IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
PROTON BEAM (BRAGG PEAK)
PROSTATE (ANAT = 59.02)
BY
LASER, TRANSURETHRAL
RADIOFREQUENCY THERMOTHERAPY
TRANSURETHRAL NEEDLE ABLATION (TUNA)
PERINEAL BY
CRYOABLATION
RADICAL CRYOSURGICAL ABLATION (RCSA)
TRANSURETHRAL
BY LASER
CRYOABLATION
RADICAL CRYOSURGICAL ABLATION (RCSA)
TISSUE HEART - SEE ABLATION, LESION, HEART
VESICLE NECK (ANAT = 60.02)