Structured vs. Unstructured Digital Indexing Guide

Naming your digital files, or “indexing” them, is an essential part of any good digitization project. If you don’t name them coherently and can’t find them after they’re scanned, what would be the point of scanning in the first place?

And although there’s an infinite number of ways you could index your records, we’re going to describe the two most common methods: unstructured and structured indexing.

When you’re done with this article, you’ll know how both methods can be good or not so good for your project, and when it may make sense to choose one over the other.

What Is Indexing?

Indexing is naming a file once it’s been digitized (scanned into a digital format).

It’s a simple concept, but when you get down into the details it can be nuanced. Aside from the quantity of records, indexing as part of a digital conversion project is typically one of the main aspects that can make it either expensive or cost-effective.

One part of it is the “level” of indexing, which can be record level or file level. If you take a roll of microfilm, you would have either roll level, which is capturing the information from the microfilm box, or you may do file level, which is going into the individual images and finding pieces of information to capture and “bookmark.” Comparatively, the file level indexing will be more expensive than the roll level.

In paper scanning, you can equate that to a box. You can have box level indexing, or you could have file or page level indexing. There are many ways to peel the orange here, and it comes down to what level of indexing you need for your specific situation.

Second, there’s how you’re going to structure the data once it’s captured, regardless of where that information came from.

This means that once you’ve captured the information, either from the roll label, the box, the microfiche title, or individual files, that data needs to be formatted and applied to the digital images for future identification and retrieval.

The two types of formatting are unstructured and structured. Said another way, unfielded versus fielded data. We’ll describe each of these in the sections below.

Unstructured Data

Unstructured data is when you capture information for naming a file and replicate the digital name as-is (how it was on the physical record).

Let’s say we’re doing a microfilm scanning project and you want “roll level” indexing. We’re going to scan the film and then use the information from the microfilm roll box to create the digital file name. But the key part is that the data captured will be replicated as-is and not manipulated or reformatted.

If there’s a first name, a last name, a student ID number, and then a date on the microfilm label, when you see that file named after it’s been digitized, you’re going to have it named exactly as it was on that roll label with no special formatting or fielding or formatting. It’ll be just like it was on the microfilm box.

Unstructured indexing is capturing exactly what is presented and replicating it in a digital format.

The benefit of unstructured data is that you don’t have to think too much about making it nice and fancy or worry about how it’s going to look when it’s done because you’re just replicating what it was on the physical copies! Replicating what you use now is familiar, simple, and cost-effective.

Capturing this microfiche title in an unstructured format would look like this:
113 ABBOTT STREET 78-1018 1 OF 1

Structured Data

Structured data, on the other hand, is when you manipulate and tweak the data after you capture the information from the record.

Let’s say you’ve got boxes of personnel files, all in standard folders. The folders are taken out of the box, scanned, and then the information is keyed from the folder to create the file name. Assuming that the information on the folder is first name, last name, middle initial, date of birth, and social security number, you capture all that information and then instead of just presenting it as is as it was on the physical copy, you restructure it to fit the layout you’ve determined.

Instead of just being presented as-is as first name, middle initial, last name, date of birth, social security number, you may want it to be a certain way to load or import into a system of yours.

That can mean you need the information to be arranged exactly as last, first, middle, SSN, DOB (or whatever you decide). It needs to be fielded a certain way.

Other aspects of structured data may be that you need a certain punctuation type to separate the info, such as an underscore (_) or a dash (-) or two dashes (–) or something else that is specific to the system you’re using or how you need these file names formatted.

Reiterating what was mentioned before: you want to use structured data when you have a certain methodology to your indexing that you want to replicate that is different from how the data is originally presented on the physical records. Or, you do have the file name presented how you like on the original records but you need to add a formatting item (such as an underscore or comma) to make sure the file imports properly into your document management system.

Capturing this microfiche title in an structured format might look like this,
depending on which fields need to be captured and the presentation order:
78-1018_113 ABBOTT STREET

Is One Better Than The Other?

Structured and unstructured data are not inherently better than each other, it just comes down to what you need and what you’re happy with and what will work when your files are digitized.

The problem that people run into is that when they do a conversion, they often automatically think that once records are digitized then “of course it’s going to be named a certain way!” They believe that it’ll be easy to use no matter what; this may not be the case because if we capture as presented it may be messy. But that’s how the records originally existed anyway!

The problem that folks run into when they do a conversion is that they might think that just because a file is digitized it’ll be named and formatted a certain way to make it incredibly easy to locate and use. It might turn out that way, if you’ve specified how you want the files named and all the records contain the information necessary to create that index. But it’s not always the case, especially when we’re capturing the record information as presented (unstructured).

Also, just because you change the indexing to be fielded does not mean it’s going to work or be better than how you have it on the physical records. Unless there’s a reason to field the data, you may be shooting yourself in the foot by making it more complicated and changing how you access the records versus what you do now.

We often recommend that our clients use the building block approach:

Start by replicating the indexing methodology you utilize now. That may mean capturing file information as-is and testing it out a bit. If it works and you find this acceptable, then you’re done! You already know how to find the data this way and now it’s even better since it’s in digital, so you don’t have to do anything else.

If you find that you’d like more granular file indexing, then move on to structuring and fielding your records to fit your needs.

You can always increase the level of indexing later on, but if you do it at the beginning of a project and it doesn’t turn out like you’d hoped, you’ve already spent the money and created an even bigger mess! Try to avoid this.

How Does Choosing One Affect Your Project?

Choosing structured or fielded data as your digital indexing methodology will typically be more expensive than unstructured data or capturing as presented. This isn’t always the case, but in most scenarios it will be.

However, just because you’re choosing to structure your index data doesn’t mean it’s guaranteed to be more expensive. Pricing will come down to the complexity of what you’re requesting.

For instance, we’ll assume you have 100,000 microfiche sheets and we’re capturing name, date of birth, and social security number from microfiche tile strips. You ask for structured data by simply adding an underscore between fields:

Name_DOB_SSN

If those title strips are consistent across the majority of the project, this is not going to break the bank.

The key is consistent. If there are 50 different variations/layouts of the title, or we find many fiche with additional pieces of information that has to be parsed, it might be more costly. But based on three fields, consistently named on the fiche titles, this isn’t an expensive request.

The complexity comes when you want to rearrange data, or change data, or add things such as padding numbers with zeroes to make them a certain number of digits. When you start adding complexity, price typically goes up.

Lastly, the timeline can have an impact on price depending on the complexity of the project and the complexity of the indexing. If the naming requirement includes loads of information (such as numerous fields), it may take longer to capture the data than if it was a single piece of data. For instance, if we’re working on planning department records and all we need to capture is a street address, that’s a lot quicker than capturing street address, permit number, and an APN.

The more information you’re capturing, the longer it takes, so if you need the project done faster more resources have to be assigned and this can increase cost.

Next Steps

Reach out to us today! Click the “Get Your Quote” button below, fill out the form, and we’ll quickly reply to you to discuss your project.

GET YOUR QUOTE!

Digital Indexing: Structured vs. Unstructured Data