Different systems and programs have different ways of identifying the 'type' of a file. Examples include:
- File magic. More than just the first two bytes of a file. If you look at how the unix 'file' command works (or mod_mime_magic in apache) you'll find that it examines bytes throughout the file to determine e.g. the bitrate of an mp3, as well as the fact that it is an mp3. However, generally the first few bytes of a file are used. e.g %! (postscript), #! (shell script), 0xCAFEBABE (java class), PK (zip file). A kind of MagicNumbers.
- MacOs file type and creator FileSystem MetaData (the ResourceFork of an application can contain a description of the types of data files it can create and open)
- OS/2 Extended Attributes
- OLE container (e.g. Word files). The first 512 bytes contains a header block which contains the file metadata you see when you right click on a file and choose 'Properties' in windows.
- SGML DTD
- XML schema
- Pluggable typing system. Some systems like Apache can do 'all of the above'.
- With object persistence, the "type" of a file may be the "class" of the first or "root" instance in the file.
The type information extracted clearly varies a lot
. Which means using the extracted information can be a pain. The MIME type system was supposed to help with this, providing a common level of metadata external to files themselves. By supplying MIME metadata along with an attached file, mail systems (originally) were supposed to be spared the pain of implementing any or all of the above systems. However, MIME can be fairly useless - since few software companies bother to register their file types with IANA, most of the time data will be described as text/plain, text/html, application/xml, or binary/octet-stream.
RE: "With object persistence, the "type" of a file may be the "class" of the first or "root" instance in the file."
[Feel free to delete this one if it doesn't count, but it seems to me that...]
The ultimate 'file' typing system would be an object store of persistent objects.
Here each object has a full type hierarchy and can be accessed as either the base type or any super types.
Many object persistence systems prefix instance data with some kind of class identifier. The first or "root" class of a file could reasonably be considered the file's type.
when this point decided]
[It is worth noting that in some OS, eg PalmOs
, the notion of a filesystem doesn't exist, replaced with a checkpointed view of objects in memory. However the 'file/object typing' issue still comes into play if the system is dynamically typed]
This last bullet raises several good points. We should start thinking beyond the traditional
file system paradigm, which holds us back in the days of separation of data and code. At the very least, we should be thinking in terms of objects, where the raw data is encapsulated, and we invoke methods to access, modify, navigate, and interpret the data. I prefer to think even higher, in terms of services. My Palm doesn't just "hold my data;" it provides storage and management services for my personal information. Note that it doesn't need a file system to do this; I suppose there is an implied concept of "types" of data, but it handles all of that transparently to me -- if I do everything through its user interface, it takes care of keeping appointment data associated with the calendar function, and separate from phone numbers...
I propose that the name of this page be modified to DataTypingSystem?.
[I - the original author - disagree. This page describes file type determination and it is useful to keep it that way. Start a new discussion on DataTypingSystem?
and refer back to here if you want to extend it out. I guess that answers JeffGrigg
's question... You might want to look around first, there's a lot of relevant info on the wiki about type inference in compilers etc - which is irrelevant to files]
is all relevant. Magic numbers are StaticTyping
since they can't be easily changed by the owners of those objects, whereas FilenameExtension
s are DynamicTyping
In DynamicTyping, type is an attribute of the value/object, while in StaticTyping type is an attribute of the variable/container, and must be made manifest externally to the object itself. Magic numbers therefore are DynamicTyping while FilenameExtensions are StaticTyping. Consider "foo.jpeg" and "JpegFile? foo". Extensions are though WeaklyTyped, in that you can "cast" foo.jpeg to foo.exe and do interesting things. ;)