Tags

, , ,

Today I attended a lecture by Ray R. Larson, and Krishna Janakiraman, and Brian Tingle – of the University of California, School of Information, which described their efforts to create a new cross indexing system for massive libraries such as the Library of Congress, the Vatican, Online Archive of California, Virginia Heritage  and many others. It is a remarkably complex task requiring a great many flexible tags to combine and then redisplay in a comprehensible form the blizzard of data coming from all these sources. As wonderful as their efforts are it seemed to me that they were too fixed in their categories and didn’t address the problems of trustworthiness of information, and a sufficiently workable way of coping with disambiguating an infinity things. Take the common name John Anderson, for example. With no more than that name it would be a very long struggle to find some additional linkages which were meaningful because almost all of them would be false associations. With large data bases the problem grows massively and perhaps becomes exponential in its complexity because no matter how carefully the terms are defined there will soon be exceptions. And rules are difficult to apply to simple cases and unfortunately a large percentage of cases will not be simple. A method must be found which is infinitely expandable and easily tractable by computers.

What might work for a massive search of complex data would be to use a method of attaching a stickiness charge to each piece of information based on how reliable the information was considered to be. Carefully documented information sources like the LOC would receive a high stickiness and undocumented single source mention would receive very low stickiness scores. Thus when a John Anderson of Oakland, CA was mentioned in a letter it would have three pieces of information all with low stickiness scores but when they were combined with an address on the envelope with a postmark of July 4, 1932, to 890, 53rd, Street, Oakland, CA it and all of the other pieces of information would adhere together and they all would achieve a high stickiness score. When another bit of data comes in about a John Anderson attending a high school near that address in 1930 that would at first have a low stickiness but when we discover a Oakland resident named John Anderson attending UC Berkeley in 1935 each of these things begin to form a consistent pattern and we may suspect it is the same person. It would make sense to attach a JulianA tag similar to: JulianA~John Anderson (location – mail), 2426892.5, 37.8377, -122.2728 SN55. By attaching the 30 digit JulianA tag to this John Anderson we can disambiguate him from all other John Andersons except for possibly a Sr. or Jr. living at the same residence.

Once we can attach the JulianA tag to any bit of information, then as other John Andersons enter our data bank they can be quickly be identified as the same person and given the same JulianA number with a stickiness number. Those stickiness numbers could be 00-99 with the first digit being the quality of the source and the second number being the quality of the particular data with 33 being neutral. 00 would be attached to a proven fraud. 10 would indicate unknown source and unknown quality and 99 would be reserved for fully responsible easily verified source with carefully documented original information. In the John Anderson example above SN55 would indicate a good but not excellent stickiness number. If this new John Anderson has some quality which can be spatially or temporarily located elsewhere then he can be tagged with an additional 30 digit JulianA tag number. If a new John Anderson is discovered they can be easily linked back to the best stickiness number. When our John Anderson birth date is discovered and his birth certificate verified then he would be tagged with a (birth) tag and a SN of 88 and each of these new locations in time and space could be easily identified and linked with a 30 digit number to this person and their data location. After a while this person would have hundreds perhaps millions of these 30 digit tags (cell phone tracking) linking back to an exact time and place for this person. This person could now be uniquely identified and tracked through all time and space and any data could be reliably attached to his file. This method is not limited to persons but can be used on any document. Anything could be similarly identified and tracked.