【数据预处理】pandas读取sql数据（支持百万条读取）

主要使用两个pandas方法：1、read_sql函数：pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)效果：将SQL查询或数据库表读入DataFrame。此功能是一个方便的包装和...

ChenVast

32196人浏览 · 2018-08-14 18:14:51

ChenVast · 2018-08-14 18:14:51 发布

主要使用两个pandas方法：

1、read_sql

函数：

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)

效果：将SQL查询或数据库表读入DataFrame。

此功能是一个方便的包装和（为了向后兼容）。它将根据提供的输入委派给特定的功能。SQL查询将被路由到，而数据库表名将被路由到。请注意，委派的功能可能有更多关于其功能的特定说明，此处未列出。

参数:

参数:	sql : string or SQLAlchemy Selectable (select or text object) SQL query to be executed or a table name. 要执行的SQL查询或表名。 con : SQLAlchemy connectable (engine/connection) or database string URI or DBAPI2 connection (fallback mode) Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported. 或DBAPI2连接（后备模式）使用SQLAlchemy可以使用该库支持的任何数据库。如果是DBAPI2对象，则仅支持sqlite3。 index_col : string or list of strings, optional, default: None Column(s) to set as index(MultiIndex). 要设置为索引的列（MultiIndex）。 coerce_float : boolean, default True Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets. 尝试将非字符串，非数字对象（如decimal.Decimal）的值转换为浮点，这对SQL结果集很有用。 params : list, tuple or dict, optional, default: None List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’} parse_dates : list or dict, default: None List of column names to parse as dates. 要解析为日期的列名列表。 Dict of `{column_name: format string}` where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps. 在解析字符串时，格式字符串是strftime兼容的格式字符串，或者是（D、s、ns、ms、us），以防解析整型时间戳。 Dict of `{column_name: arg dict}`, where the arg dict corresponds to the keyword arguments of `pandas.to_datetime()` Especially useful with databases without native Datetime support, such as SQLite. {column_name：arg dict}的字典，其中arg dict对应于pandas.to_datetime（）的关键字参数。对于没有本机Datetime支持的数据库（如SQLite）特别有用。 columns : list, default: None List of column names to select from SQL table (only used when reading a table). 从SQL表中选择的列名列表（仅在读取表时使用）。 chunksize : int, default None If specified, return an iterator where chunksize is the number of rows to include in each chunk. 如果指定，则返回一个迭代器，其中chunksize是要包含在每个块中的行数。
Returns:	DataFrame

sql : string or SQLAlchemy Selectable (select or text object)

SQL query to be executed or a table name.

要执行的SQL查询或表名。

con : SQLAlchemy connectable (engine/connection) or database string URI

or DBAPI2 connection (fallback mode)

Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.

或DBAPI2连接（后备模式）

使用SQLAlchemy可以使用该库支持的任何数据库。如果是DBAPI2对象，则仅支持sqlite3。

index_col : string or list of strings, optional, default: None

Column(s) to set as index(MultiIndex).

要设置为索引的列（MultiIndex）。

coerce_float : boolean, default True

Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.

尝试将非字符串，非数字对象（如decimal.Decimal）的值转换为浮点，这对SQL结果集很有用。

params : list, tuple or dict, optional, default: None

List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}

parse_dates : list or dict, default: None

List of column names to parse as dates.

要解析为日期的列名列表。

Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

在解析字符串时，格式字符串是strftime兼容的格式字符串，或者是（D、s、ns、ms、us），以防解析整型时间戳。

Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite.

{column_name：arg dict}的字典，其中arg dict对应于pandas.to_datetime（）的关键字参数。对于没有本机Datetime支持的数据库（如SQLite）特别有用。

columns : list, default: None

List of column names to select from SQL table (only used when reading a table).

从SQL表中选择的列名列表（仅在读取表时使用）。

chunksize : int, default None

If specified, return an iterator where chunksize is the number of rows to include in each chunk.

如果指定，则返回一个迭代器，其中chunksize是要包含在每个块中的行数。

Returns:

DataFrame

使用案例

import pymysql
import pandas as pd

con = pymysql.connect(host="127.0.0.1",user="root",password="password",db="world")
# 读取sql
data_sql=pd.read_sql("SQL查询语句",con)
# 存储
data_sql.to_csv("test.csv")

2、read_sql_table

函数：

pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)[source]

效果：将SQL数据库表读入DataFrame。

给定一个表名和一个SQLAlchemy可连接，返回一个DataFrame。此功能不支持DBAPI连接。

Parameters:

Parameters:	table_name : string Name of SQL table in database. 数据库中SQL表的名称。 con : SQLAlchemy connectable (or database string URI) SQLite DBAPI connection mode not supported. 不支持SQLite DBAPI连接模式。 schema : string, default None Name of SQL schema in database to query (if database flavor supports this). Uses default schema if None (default). 要查询的数据库中的SQL模式的名称（如果数据库flavor支持此功能）。如果为None（默认值），则使用默认架构。 index_col : string or list of strings, optional, default: None Column(s) to set as index(MultiIndex). 要设置为索引的列（MultiIndex）。 coerce_float : boolean, default True Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision. 尝试将非字符串，非数字对象（如decimal.Decimal）的值转换为浮点值。可能导致精度损失。 parse_dates : list or dict, default: None List of column names to parse as dates. 要解析为日期的列名列表。 Dict of `{column_name: format string}` where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps. {column_name：format string}的字典，其中格式字符串在解析字符串时间时与strftime兼容，或者在解析整数时间戳的情况下是（D，s，ns，ms，us）之一。 Dict of `{column_name: arg dict}`, where the arg dict corresponds to the keyword arguments of `pandas.to_datetime()` Especially useful with databases without native Datetime support, such as SQLite. {column_name：arg dict}的字典，其中arg dict对应于pandas.to_datetime（）的关键字参数。对于没有本机Datetime支持的数据库（如SQLite）特别有用。 columns : list, default: None List of column names to select from SQL table 从SQL表中选择的列名列表 chunksize : int, default None If specified, returns an iterator where chunksize is the number of rows to include in each chunk. 如果指定，则返回一个迭代器，其中chunksize是要包含在每个块中的行数。
Returns:	DataFrame

table_name : string

Name of SQL table in database.

数据库中SQL表的名称。

con : SQLAlchemy connectable (or database string URI)

SQLite DBAPI connection mode not supported.

不支持SQLite DBAPI连接模式。

schema : string, default None

Name of SQL schema in database to query (if database flavor supports this). Uses default schema if None (default).

要查询的数据库中的SQL模式的名称（如果数据库flavor支持此功能）。如果为None（默认值），则使用默认架构。

index_col : string or list of strings, optional, default: None

Column(s) to set as index(MultiIndex).

要设置为索引的列（MultiIndex）。

coerce_float : boolean, default True

Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision.

尝试将非字符串，非数字对象（如decimal.Decimal）的值转换为浮点值。可能导致精度损失。

parse_dates : list or dict, default: None

List of column names to parse as dates.

要解析为日期的列名列表。

Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

{column_name：format string}的字典，其中格式字符串在解析字符串时间时与strftime兼容，或者在解析整数时间戳的情况下是（D，s，ns，ms，us）之一。

Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite.

{column_name：arg dict}的字典，其中arg dict对应于pandas.to_datetime（）的关键字参数。对于没有本机Datetime支持的数据库（如SQLite）特别有用。

columns : list, default: None

List of column names to select from SQL table

从SQL表中选择的列名列表

chunksize : int, default None

If specified, returns an iterator where chunksize is the number of rows to include in each chunk.

如果指定，则返回一个迭代器，其中chunksize是要包含在每个块中的行数。

Returns:

DataFrame

使用案例

import pandas as pd
import pymysql
from sqlalchemy import create_engine

con = create_engine('mysql+pymysql://user_name:password@127.0.0.1:3306/database_name')
data = pd.read_sql_table("table_name", con)
data.to_csv("table_name.csv")

NVIDIA DRIVE 智能汽车专区

更多推荐

NVIDIA DRIVE 合作伙伴在 CES 上展示最新移动出行创新技术

NVIDIA DRIVE 智能汽车专区

NVIDIA DRIVE Hyperion 平台为自动驾驶汽车开发实现关键汽车安全和网络安全里程碑

NVIDIA DRIVE 智能汽车专区

丰田、Aurora 和大陆集团加入 NVIDIA 合作伙伴行列，推出下一代高度自动化的自动驾驶车型

使用或采用 NVIDIA 产品和技术的第三方、这样做所带来的优势和影响以及第三方产品的功能、性能和供货情况；我们依靠第三方来制造、组装、包装和测试我们的产品；NVIDIA、NVIDIA 徽标、NVIDIA Cosmos、NVIDIA DGX、NVIDIA DRIVE、NVIDIA DRIVE AGX、NVIDIA DRIVE AGX Orin、NVIDIA Omniverse 和 NVIDIA O