Correcting the column name definitions in Theobald Bex Query

Spaces and other special characters in the column names in Theobald Bex Query Extract cause errors in Azure Synapse Analytics during loading into a data warehouse.

We worked on a requirement where Theobald job extracted Bex Query Result into a Parquet* file and placed it in the Azure Data Lake Gen2 Container. From Azure Data Lake Storage (ADLS) Gen2 Container, Azure Synapse Pipeline picks up the parquet file and loads it into the Synapse Table.

Parquet: Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

Advantages of Storing Data in a Columnar Format:

  • Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.
  • Apache Parquet is built from the ground up. Hence it can support advanced nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data, in the gigabyte range for each individual file.
  • Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster). Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently.
  • Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery, and Google Dataproc.

Difference between Parquet and a CSV

CSV is a simple and common format that is used by many tools such as Excel, Google Sheets, and numerous others. Even though CSV files are the default format for data processing pipelines it has some disadvantages:

  • Amazon Athena and Spectrum will charge based on the amount of data scanned per query.
  • Google and Amazon will charge you according to the amount of data stored on GS/S3.
  • Google Dataproc charges are time-based.

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time, hence the overall costs.

The following table compares the savings as well as the speedup obtained by converting data into Parquet from CSV.

DatasetSize on Amazon S3Query Run TimeData ScannedCost
Data stored as CSV files1 TB236 seconds1.15 TB$5.75
Data stored in Apache Parquet Format130 GB6.78 seconds2.51 GB$0.01
Savings87% less when using Parquet34x faster99% less data scanned99.7% savings

If spaces, or special characters such as “(“or “)” are present in the column names in the Bex query, or the result that generates a parquet file. The file throws an error in Azure Synapse pipeline Data Preview of the Data Flow Activity which is used to load the data synapse table.

For example:

Fig 1. Column Name in Bex Query
Fig 2. The result displayed upon clicking Execute BEX Query button

The above screenshot is from Theobald and is seen while we input the query in cube name to search and select. When we click to Execute BEX Query and detect the columns button that shows the query result.

The columns get populated when we close the query result dialog box.

Under Custom Name, we see column names that have multiple strings, separated by a space or enclosed in parenthesis.

There is an option to see Data Preview in Synapse Pipeline Data Flow activity.  

For example: Pipeline name = PL_EDW_Backorder_Comment_To_HollisterDW

Error:

execute, tree: Exchange SinglePartition, [id=#44] +- *(1) LocalLimit 1000 +- *(1) FileScan parquet [Plant#461,Material#462,Col2#463,Product Line#464,Product Type#465,Material Group#466,Corporate Brand#467,Ext. Matl Group#468,MRP controller#469,Col9#470,Proj ATP Date#471,Available To Promise#472,Comments#473,Allocation#474,Base Unit#475,BO Quantity in BUM#476,BO Value#477,OHI in BUM#478,Transit In BUM#479,Days on Backorder#480] Batched: true, Format: Parquet, Location: InMemoryFileIndex[abfss://[email protected]/Landing/FTP/SLBO/…, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Plant:string,Material:string,Col2:string,Product Line:string,Product Type:string,Material …

In the source tab, while previewing the data we encounter the above error.

Taking a closer look at the Bex Query in Theobald, we see there exists spaces, dot(“.”), and special characters like string enclosed in parenthesis.

Unless we fix these errors, the data will not get loaded into the synapse table.

The Solution Provided:

To resolve the problem, we removed the spaces, special characters if any in the Custom Name of the Bex Query so that all the characters in the Column Name are placed continuously without spaces or any other character, before running the job. Running this job will place the parquet file with column names having no spaces or any other characters in the ADLS Gen2 location. Below screenshot shows how the column name should be:

Fig 4. Spaces and Special characters like Dot, parathesis removed

In the Data Flow activity in the Azure Synapse pipeline, we can preview the incoming file. The error will be resolved:

Fig 5. Data Preview showing incoming file data

Fig 6. Pipeline ran successfully

We can see that data is loaded into the synapse table.

Table name = STG_FTP_SLBO.ConsolidatedBO

Fig 7. Data loaded into synapse table

The solution in this blog will help correct the column name definitions.

"We partnered with TekLink based upon a previous positive experience where they did a great job for us as well as hearing positive feedback about their excellent level of service. We’ve also interviewed many of their experts and consistently found their candidates to be the most technically sound with excellent BW and HANA knowledge above and beyond that of their competitors. Teklink has certainly helped us to stabilize and improve the reliability of our BI operations."
Patrick Bachman
Patrick Bachman

IT Architect

“TekLink’s team exceeded Kellogg Latin America’s expectations with the implementation of Anaplan. Not only their diligence and technical mastery were evident, but also provided critical and out-of-the-box solutions to meet the project’s criteria and expand its scope.”
Francisco Ibarra
Francisco Ibarra

Sr. Manager

“TekLink provided worry free BEx to AO Migration by analyzing and converting our 500+ BEx workbooks to Analysis for Office while also increasing adoption by running power user workshops.”
Lakshmi Thota
Lakshmi Thota

Sr. Manager

Contact Us