Kaki said:
Given a file, how do I know if it's ascii or unicode or binary ?
And how do I know if it's rtf or html or etc? In other words,
how do I find the stream type or mime type?
(No, file extension cannot be the answer)
Only large-system operating systems such as VMS [DEC / Compaq] and MVS [IBM]
make any formal distinction between file types. In these systems there are
even physical differences between file types in so far as they are stored
differently, and are accessed with different code routines.
Under operating systems such as DOS / Windows-family, and *NIX / Linux, a
'file' is merely a named, persistent collection of bytes, and the only way
to tell whether a file contains data that is to be interpreted as text, or
as binary is by adherence to some conven'tion such as file extension usage
[e.g. '.txt' indicates a text file etc], and schemes such as searching
'magic numbers' [i.e. byte sequences known to uniquely identify file types]
in files, one heaviliy used in the *NIX / Linux world [the latter systems
also make distinctions between things like sockets, and devices at the
operating system level, but this hardly helps in identifying file types].
Thus, the answer is: there is no way of guaranteeing what a file's 'type'
actually is. All you can do is adhere to some convention, and hope that
everyone else follows suit. When attempting to access a particular file you
would check to ensure that the data read in conforms to the expected pattern
/ format for that file type.
For example, an HTML file could be expected to contain a <HTML> tag
somewhere near the start of the file, while many proprietary file formats
[e.g. MS Excel, Word etc] would sport a byte collection known as a 'header'
containing 'fields' with version information and the like. If, in reading
such files, the expected tags are found, or 'sensible' values for each
field are read in, then you can be reasonably sure [though not absolutuely
certain] that the 'correct' file type has been accessed.
Note that I made no mention of 'streams' which are nothing more than
program objects that are temporarily connected or linked to file(s) for
purposes of file data access / updating. Now, it might be possible for such
objects to report information about the file, or the current connection /
linkage status. However, when first creating establishing a link to a
specified file, such objects can merely make the checks mentioned earlier to
ascertain the 'correctness' of the file.
I'm not sure this is the type of response you were after, but the rather
general nature of your query seemed to warrant it. Additionally, it is the
type of issue that trancends any one programming language / environment.
I hope this helps.
Anthony Borla