Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
329 views
in Technique[技术] by (71.8m points)

django - PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS)

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.

Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.

However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.

Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.

Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)

This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.

Please guide what Azure alternatives may work in this case.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.


Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.

Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.

This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.

If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.

E.g. following JSON

    {  
       "_id":"ahokw88",
       "type":"Book",
       "title":"The AWK Programming Language",
       "year":"1988",
       "publisher":"Addison-Wesley",
       "authors":[  
          "Alfred V. Aho",
          "Brian W. Kernighan",
          "Peter J. Weinberger"
       ],
       "source":"DBLP"
    }

Can be queried with following SQL:

SELECT
    JSON_VALUE(jsonContent, '$.title') AS title
  , JSON_VALUE(jsonContent, '$.publisher') as publisher
  , jsonContent
FROM OPENROWSET
  (
      BULK 'json/books/*.json',
      DATA_SOURCE = 'SqlOnDemandDemo'
    , FORMAT='CSV'
    , FIELDTERMINATOR ='0x0b'
    , FIELDQUOTE = '0x0b'
    , ROWTERMINATOR = '0x0b'
  )
WITH
  ( jsonContent varchar(8000) ) AS [r]
WHERE
  JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'

Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)

Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.


Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago

You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.

Only source for Serverless is ADLS.

You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.


Understand the cost and performance implications of each piece before you make a decision:

  • Serverless SQL Pool
  • Dedicated SQL pool
  • Data Factory

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...