Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
534 views
in Technique[技术] by (71.8m points)

python 3.x - Pandas .to_sql fails silently randomly

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On ocassion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.

Here is my code:

class SQLServerHandler(DataBaseHandler):
    
    ...


    def _getSQLAlchemyEngine(self):
        '''
            Get an sqlalchemy engine
            from the connection string

            The fast_executemany fails silently:

            https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
        '''
        # escape special characters as required by sqlalchemy
        dbParams = urllib.parse.quote_plus(self.connectionString)
        # create engine
        engine = sqlalchemy.create_engine(
            'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))

        return engine

    @logExecutionTime('Time taken to upload dataframe:')
    def uploadData(self, tableName, dataBaseSchema, dataFrame):
        '''
            Upload a pandas dataFrame
            to a database table <tableName>
        '''
        engine = self._getSQLAlchemyEngine()

        dataFrame.to_sql(
            tableName,
            con=engine,
            index=False,
            if_exists='append',
            method='multi', 
            chunksize=50,              
            schema=dataBaseSchema)

Switching the method to None seems to work properly but the data takes an insane ammount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.

The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().

I am using:

  • ODBC Driver 17 for SQL Server
  • pandas 1.2.1
  • sqlalchemy 1.3.22
  • pyodbc 4.0.30

Does anyone know how to make it raise an exception if it fails?

Or why it is not uploading any data?

question from:https://stackoverflow.com/questions/65940670/pandas-to-sql-fails-silently-randomly

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

In rebuttal to this answer, if to_sql() was to fall victim to the issue described in

SQL Server does not finish execution of a large batch of SQL statements

then it would have to be constructing large anonymous code blocks of the form

-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…

and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:

import pandas as pd
import pyodbc
import sqlalchemy as sa

print(pyodbc.version)  # 4.0.30

table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}")  # 400

cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()

crsr.execute(f"TRUNCATE TABLE {table_name}")

sql = "".join(
    [
        f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
        for i in range(num_rows)
    ]
)
crsr.execute(sql)

row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}")  # 316

Using to_sql() for that same operation works

import pandas as pd
import pyodbc
import sqlalchemy as sa

print(pyodbc.version)  # 4.0.30

table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}")  # 400

df = pd.DataFrame(
    [(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)

engine = sa.create_engine(
    "mssql+pyodbc://@mssqlLocal64", fast_executemany=True
)

df.to_sql(
    table_name,
    engine,
    index=False,
    if_exists="replace",
)

with engine.connect() as conn:
    row_count = conn.execute(
        sa.text(f"SELECT COUNT(*) FROM {table_name}")
    ).scalar()
    print(f"row_count: {row_count}")  # 400

and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...