Managing Unstructured Data with Big Data Notions

  • Written by Sam Edens

edens samData is traditionally structured. It makes the data easy to read and simple to consume. Are those statements actually true? I have a hard time accepting that perception. What if I told you we collectively perceive data as traditionally structured because we select to ignore unstructured data? All that remains is the structured data and because that turns into the primary focus, it becomes natural to claim data is traditionally structured. It makes the data easier to read and simpler to consume.

Unstructured data is defined as information that does not have a pre-defined model or is not organized in any sensible manner. This is not a new phenomenon. However, the sheer quantity of unstructured data is on the rise across all industry verticals and there is growing interest in the ingestion of that data. One of the largest examples of this is in the healthcare space. Take a minute to think about all the pre-printed medical forms, questionnaires, insurance forms, procedure and diagnosis forms, and claim forms you are exposed to as a patient or guarantor. That is what you see. Those same forms can exist in a variety of formats and layouts based on the insurance company, hospital or medical facility, or the state or locality where the services are performed. It may be a single field variance or a completely different looking form altogether. Whether typed into an electronic form or scanned into a system as an image, the data is likely unstructured making it nearly impossible to develop software to read, recognize and accurately ingest all of the data all of the time.

Instead of setting unstructured data aside, we need to use technology to reduce the constraints and structure the data. This will allow for the consumption of more data, which is really what enterprises are after. By understanding the data, you are creating a modern architecture ideally easy to use, repeatable, and secure. What could you accomplish with more data? What could society accomplish with more data, specifically in the healthcare space? Could we prevent outbreaks? Could we cure disease?

All enterprises deal with unstructured data. Some have more than others but unstructured data does exist everywhere. There are many methods of thinking about unstructured data, getting it into a structured format, and ultimately loading it into the system or data warehouse. It can be done manually with users reading and inputting the data, it can be done electronically with extract, transform, load-type processes to deal with most of the data, or it can be a combination of electronic and manual effort. Making the process as electronic and systematic as possible will be very helpful but remember it is nearly impossible to consume 100% of the data this way. Some manual effort will still be required when a form field is blank or has been moved to the next box, for example.

You need the right challenges while making the decision to ingest and make sense of unstructured data. It is important to know or at least question what you are possibly missing or what you can possibly gain from this additional data. Do not go blindly into the effort of managing unstructured data. Also, do not get caught up in the immediate issues or only a few use cases. Architect the process for a long term and comprehensive build out. You may know the challenges you are facing today but consider what challenges you may face down the road. Regulation, security, and technology are examples of ever-changing items in our industry that may present future challenges. Finally, data governance must be addressed. This is particularly true in the healthcare industry. The days of only a few fields being deemed personally identifiable information (PII) are fading away. If you are dealing with healthcare data, you are better off considering everything as PII. Secure access to the data on a need-to-know basis or at least with the vision that not everyone requires access to everything.

The concept of ingesting unstructured data is relatively new in the debt collection industry. Much of what I have presented crosses the line between what is considered data integration and what is considered Big Data. Managing unstructured data requires integration processes but understanding why unstructured data is important is a Big Data thought. Regardless of your approach, plan to pay for the sins of the past. Not capturing all the data and randomly dropping data into fields that “look” good may end up skewing your results as you ingest unstructured data.

Sam Edens has been with Emprise Technologies since 2006 and is currently serving as Vice President. Prior to his time with Emprise, Sam designed and developed performance and flow management software for UPS.