Parsing fixed-width data columns using enums
This week at work I had to deal with fixed-width column data like this. That is: a plain text file where each line holds a single record with a predefined order of columns/fields. Each column has a predefined length (in symbols), and, if the value is shorter, it’s padded with spaces. Every line contains every column, and, therefore, every line has exactly the same length. For example,
Field A | Field B | Field C |…
01234567890123456789AB0123456789ABCD…
X YYY ZZZ …
…
having the lengths of 10, 12, and 14 respectively, would be
[{
A: "0123456789",
B: "0123456789AB",
C: "0123456789ABCD"
}, {
A: "X",
B: "YY",
C: "ZZZ"
}, …]
Parsing the data, I came up with a very natural approach which relies on Kotlin/Java enums. Nothing groundbreaking or novel here. I’m sure this trick is familiar to many. However, I liked how naturally the tool suits the problem, and decided to share.
fun parseLine(rawData: String): BankRecord {
val parseStringBound = { f: RawBankDataField -> parseString(rawData, f) }
return BankRecord(
clearingNumber = parseInt(rawData, RawBankDataField.CLEARING_NUMBER),
name = parseStringBound(RawBankDataField.NAME),
postalAddress = parseStringBound(RawBankDataField.POSTAL_ADDRESS),
postalCode = parseStringBound(RawBankDataField.POSTAL_CODE),
city = parseStringBound(RawBankDataField.CITY)
)
}
data class BankRecord(
val clearingNumber: Int,
val name: String,
val postalAddress: String,
val postalCode: String,
val city: String
)
private fun parseInt(rawData: String, f: RawBankDataField): Int {
return parseString(rawData, f).toInt()
}
private fun parseString(rawData: String, f: RawBankDataField): String {
return rawData.substring(indicesRange(offset(f), f.length)).trim()
}
private fun indicesRange(start: Int, length: Int) = start.until(start + length)
private fun offset(f: RawBankDataField): Int {
var result = 0
for (i in RawBankDataField.values()) {
if (i == f) return result
result += i.length
}
return result
}
private enum class RawBankDataField(val length: Int) {
GROUP(2), // Gruppe
CLEARING_NUMBER(5), // BCNr
SUBSIDIARY_ID(4), // Filial-ID
NEW_CLEARING_NUMBER(5), // BCNr neu
SIC_NUMBER(6), // SIC-Nr
MAIN_OFFICE_CLEARING_NUMBER(5), // Hauptsitz
CLEARING_NUMBER_TYPE(1), // BC-Art
VALID_SINCE(8), // gültig ab
SIC(1), // SIC
EURO_SIC(1), // euroSIC
LANGUAGE(1), // Sprache
SHORT_NAME(15), // Kurzbez.
NAME(60), // Bank/Institut
DOMICILE_ADDRESS(35), // Domizil
POSTAL_ADDRESS(35), // Postadresse
POSTAL_CODE(10), // PLZ
CITY(35) // Ort
}
Note that, like in this example, we may need just a subset of the fields present in the data source. Nevertheless, obviously, we’d have to enum-erate all the columns anyway, up to the rightmost relevant to us. I particularly like that it’s so easy to adapt to the changes in the data source format, and to support additional columns.
I only have to say that, if you have a lot of relevant fields packed into the data line, and a lot
of lines, you may wish — for a better performance — to
memoize the offset()
function.