Timing Out Redshift

Backstory: We have a datset provided by an outside consultant which is an export from SPSS.

11.3K rows of 1174 columns. 

The columns contain replies to a survey.

The results are in categorical values, for example the answers to How much do you agree with.... are coded as 1 strongly disagree-5 strongly agree.

The end user of the dataset wants the values recoded to store the text rather than the variable, and the columns renamed from the question number to a plaintext of the question (which breaks my relational database thinking a bit, but I understand the reasoning).

We have a datamap file which give the plaintext meaning of the different values, and I have to recode the columns

I had to split the dataset into 6 subsets to get an SQL data set to run.

I've managed to get 4 of the 6 working as MySQL dataflows, but 2 of them will not run.

 

my coding is 250-odd select statements of select the appropriate value from the datamap based on the possible responses

eg of a few lines with identifying information removed:

 

(select "Selection" from "dataset" where "Source Value"="Q50_21" and "Check"='Yes/No') as "Value Q50 From the list below, plea",
(select "Selection" from "dataset" where "Source Value"="Q51" and "Check"='Q51') as "Now please think about",
(select "Selection" from "dataset" where "Source Value"="Q52_1" and "Check"='Appeal') as "Value Q52 X- How appealing do",

 

When I run as MySQL I get Row size is wider than MySQL supports. Please reduce the row size or number of columns.

When I run in Redshift, it runs for an hour and makes 74% progress and then: The database reported a syntax error. [Amazon](600001) The server closed the connection.

I'm pretty sure I'm not doing it in the most efficient way possible, but this was a once off survey which won't be repeated, so the dataset will never be changed.

I literally need it to run correctly once, and to be honest after two days I don't care about efficiency anymore, I just want it done.

 

Any help anyone can suggest would be great.

Tagged:

Comments

  • Oh man, survey data in a relational database can be so difficult.  I started having heart palpitations and sweaty palms while reading your problem description.

    When processing surveys for a client I had some of the same issues.  Too many columns.  Column names too long.  Manual mapping. Vague errors. Etc.

     

    What worked for us was a combination of things.

    1. Focusing on the important answers.  Not every column is necessarily actionable.
    2. Splitting the dataset into multiple datasets by subject. Not only is it challenging to address the whole dataset at once, it's hard to visualize.  Domo isn't going to let you display a thousand columns at the same time, and building out that card in a table card could be really challenging. That's a lot of columns to sift through.
    3. Related to the first two points, getting something out there and building on it.  It's better to have something incomplete but actionable than to forever spin on resolving every issue before publication.  It's easier to find errors this way, too.

    I don't know if this necessarily helps, and others might have better approaches, but it's a real challenge you have.